arxiv:2509.25137

The Era of Real-World Human Interaction: RL from User Conversations

Published on Sep 29

· Submitted by

Chuanyang Jin on Sep 30

AI at Meta

Upvote

Authors:

Chuanyang Jin ,

Bo Liu ,

Abstract

Reinforcement Learning from Human Interaction (RLHI) uses in-the-wild user conversations to improve conversational models, enhancing personalization and instruction-following through user-guided rewrites and persona-conditioned rewards.

AI-generated summary

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

View arXiv page View PDF Add to collection

Community

Chuanyang-Jin

Paper author Paper submitter 1 day ago

•

edited 1 day ago

🌀 Era of Real-World Human Interaction 🌀

RL directly from User Conversations
Organic replies + long-term history are learning signals
Trained on WildChat, beats RLHF at user level
-> the future for personal Super Intelligence?

Recipe for RL from Human Interaction (RLHI)🧑‍🍳:

User-guided Rewrites:

LLM detects if a user response indicates preference / clear signals for how to improve
LLM rewrites response using signal as preference pair

User-based Rewards:

LLM builds user profile (persona) from long-term history, and incorporates into Reward Model
Builds user-specific preference data based on it

Main Results:

RLHI outperforms the Instruct model, RLHF on top, or rewriting without implicit conversational feedback
Eval is via WildChat users with >=10 multi-turn chats (WildChat UserEval). RLHI performs well in both user personalization & instruct following metrics

Additional Results: