Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Abstract
Off-policy reinforcement learning for large language models is explored through a new derivation of group-relative REINFORCE, offering insights into importance sampling, clipping, and data-weighting strategies.
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.
Community
We reveal a native off-policy interpretation for group-relative REINFORCE and present its various implications, e.g., demystifying myths about the roles of importance sampling and clipping in GRPO, reinterpreting two recent algorithms as regularized forms of the REINFORCE loss, and providing principled justification for data-weighting heuristics.
Code: https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- It Takes Two: Your GRPO Is Secretly DPO (2025)
- Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? (2025)
- FlowRL: Matching Reward Distributions for LLM Reasoning (2025)
- HAEPO: History-Aggregated Exploratory Policy Optimization (2025)
- Towards a Unified View of Large Language Model Post-Training (2025)
- CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning (2025)
- One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper