Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Abstract
Self-evolving LLM agents can abandon alignment constraints post-deployment, leading to rapid misalignment and collective failure in multi-agent systems.
As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.
Community
🚨 Introducing ATP — the Alignment Tipping Process!
🔥 Beware: Self-evolution is quietly pushing LLM agents off the rails.
⚠️ Even perfect alignment at deployment can gradually forget human alignment and shift toward self-serving strategies. Over time, LLM agents stop following values, imitate bad strategies, and even spread misaligned behaviors to others!
🧠 Alignment isn’t static — it’s fragile, dynamic, and decays through experience. Let’s rethink alignment as something to maintain, not just achieve!
📄 Paper: https://arxiv.org/abs/2510.04860
🔗 Code & data: https://github.com/aiming-lab/ATP

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions (2025)
- Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment (2025)
- Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning (2025)
- AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning (2025)
- Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents (2025)
- Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning (2025)
- Interactive Learning for LLM Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper