Papers
arxiv:2509.02492

GRAM-R^2: Self-Training Generative Foundation Reward Models for Reward Reasoning

Published on Sep 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

GRAM-R², a generative reward model, uses self-training on unlabeled data to produce reward rationales, achieving strong performance across various tasks with minimal fine-tuning.

AI-generated summary

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R^2, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R^2 can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R^2 consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

Community

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.02492 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.