Abstract
Mixed Advantage Policy Optimization (MAPO) dynamically reweights the advantage function to improve trajectory ranking in reinforcement learning for foundation models.
Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
Community
We propose the mixed advantage strategy for GRPO to deal with advantage reversion and advantage mirror problems.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- COPO: Consistency-Aware Policy Optimization (2025)
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)
- G2RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance (2025)
- FlowRL: Matching Reward Distributions for LLM Reasoning (2025)
- Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting (2025)
- DCPO: Dynamic Clipping Policy Optimization (2025)
- PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 10
Browse 10 models citing this paperDatasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper