The Era of Real-World Human Interaction: RL from User Conversations
Abstract
Reinforcement Learning from Human Interaction (RLHI) uses in-the-wild user conversations to improve conversational models, enhancing personalization and instruction-following through user-guided rewrites and persona-conditioned rewards.
We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.
Community
๐ Era of Real-World Human Interaction ๐
- RL directly from User Conversations
- Organic replies + long-term history are learning signals
- Trained on WildChat, beats RLHF at user level
- -> the future for personal Super Intelligence?
Recipe for RL from Human Interaction (RLHI)๐งโ๐ณ:
- User-guided Rewrites:
- LLM detects if a user response indicates preference / clear signals for how to improve
- LLM rewrites response using signal as preference pair
- User-based Rewards:
- LLM builds user profile (persona) from long-term history, and incorporates into Reward Model
- Builds user-specific preference data based on it
Main Results:
- RLHI outperforms the Instruct model, RLHF on top, or rewriting without implicit conversational feedback
- Eval is via WildChat users with >=10 multi-turn chats (WildChat UserEval). RLHI performs well in both user personalization & instruct following metrics
Additional Results:
- RLHI still performs well on standard (user-free) instruct following benchmarks
- RLHI can improve reasoning benchmarks too given knowledgeable user feedback
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions (2025)
- Language Models Can Learn from Verbal Feedback Without Scalar Rewards (2025)
- HumAIne-Chatbot: Real-Time Personalized Conversational AI via Reinforcement Learning (2025)
- CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization (2025)
- T-POP: Test-Time Personalization with Online Preference Feedback (2025)
- PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions (2025)
- ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper