Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training Paper • 2509.03403 • Published 4 days ago • 18
Self-Rewarding Vision-Language Model via Reasoning Decomposition Paper • 2508.19652 • Published 12 days ago • 79
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback Paper • 2310.11550 • Published Oct 17, 2023 • 1