PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning Paper • 2508.21104 • Published 11 days ago • 28
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning Paper • 2509.02479 • Published 6 days ago • 78