--- license: mit library_name: transformers --- # ReasonFlux-PRM [Code](https://github.com/Gen-Verse/ReasonFlux) | [Paper](https://arxiv.org/abs/2506.18896) We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling.
Model | Type | Size | Capabilities | Use Cases | Download |
---|---|---|---|---|---|
ReasonFlux-PRM | PRM | 7B | • Trajectory-aware scoring • Online/Offline supervision • Dense process rewards |
Data selection, RL training, Test-time scaling | 🤗 7B |
ReasonFlux-PRM | PRM | 1.5B | • Lightweight scoring • Efficient inference • Edge deployment |
Resource-constrained applications | 🤗 1.5B |
ReasonFlux-PRM-Qwen-2.5 | End-to-End Trained Policy Model | 7B | • Long CoT reasoning • Solving complex tasks and problems |
Math and Science Reasoning | 🤗 7B |