Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Abstract
Self-evolving agents based on Large Language Models can deviate in unintended ways, leading to various risks such as safety misalignment and vulnerability introduction, necessitating new safety paradigms.
Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.
Community
We systematically propose and investigate the phenomenon of "Misevolution" for the first time. It refers to the case that self-evolving agents may "go astray" during the process of self-evolution, leading to unexpected risks and vulnerabilities on their own, even in the absence of external attacks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents (2025)
- A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems (2025)
- Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents (2025)
- Do Large Language Model Agents Exhibit a Survival Instinct? An Empirical Study in a Sugarscape-Style Simulation (2025)
- Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark (2025)
- SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs (2025)
- LM Agents May Fail to Act on Their Own Risk Knowledge (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper