Papers
arxiv:2509.25137

The Era of Real-World Human Interaction: RL from User Conversations

Published on Sep 29
ยท Submitted by Chuanyang Jin on Sep 30
Authors:
,
Bo Liu ,
,
,
,
,
,

Abstract

Reinforcement Learning from Human Interaction (RLHI) uses in-the-wild user conversations to improve conversational models, enhancing personalization and instruction-following through user-guided rewrites and persona-conditioned rewards.

AI-generated summary

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

Community

Paper author Paper submitter
โ€ข
edited 1 day ago

๐ŸŒ€ Era of Real-World Human Interaction ๐ŸŒ€

  • RL directly from User Conversations
  • Organic replies + long-term history are learning signals
  • Trained on WildChat, beats RLHF at user level
  • -> the future for personal Super Intelligence?

Recipe for RL from Human Interaction (RLHI)๐Ÿง‘โ€๐Ÿณ:

  1. User-guided Rewrites:
  • LLM detects if a user response indicates preference / clear signals for how to improve
  • LLM rewrites response using signal as preference pair
  1. User-based Rewards:
  • LLM builds user profile (persona) from long-term history, and incorporates into Reward Model
  • Builds user-specific preference data based on it

Main Results:

  • RLHI outperforms the Instruct model, RLHF on top, or rewriting without implicit conversational feedback
  • Eval is via WildChat users with >=10 multi-turn chats (WildChat UserEval). RLHI performs well in both user personalization & instruct following metrics

Additional Results:

  • RLHI still performs well on standard (user-free) instruct following benchmarks
  • RLHI can improve reasoning benchmarks too given knowledgeable user feedback
Paper author Paper submitter
This comment has been hidden (marked as Resolved)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25137 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.25137 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.25137 in a Space README.md to link it from this page.

Collections including this paper 1