MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
This repository introduces the MMR1 family of multimodal reasoning models, presented in the paper "MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources".
MMR1 addresses critical limitations in the advancement of large multimodal reasoning models, specifically the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms during post-training.
Key Contributions
- Variance-Aware Sampling (VAS): A novel data selection strategy guided by a Variance Promotion Score (VPS). VAS combines outcome variance and trajectory diversity to promote reward variance, stabilize policy optimization, and improve convergence, especially in scenarios where Group Relative Policy Optimization (GRPO) is prone to gradient vanishing.
- Large-scale Curated Resources: The project releases carefully curated resources, including ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity.
- Open-source Codebase & Models: A fully reproducible end-to-end training codebase and a family of open-source multimodal reasoning models across multiple scales (3B, 7B, 32B), establishing standardized baselines for the community.
Methodology Overview
MMR1 introduces Variance-Aware Sampling (VAS) to mitigate the gradient vanishing problem in reinforcement learning fine-tuning with GRPO. The framework balances exploration and coverage by combining a random sampler with a weighted sampler guided by the Variance Promotion Score (VPS). This ensures that training focuses on prompts providing strong learning signals, with VPS scores periodically re-estimated for dynamic adaptation.
Open Resources
The project open-sources the following resources for the community:
- MMR1-SFT Dataset (~1.6M): Supervised fine-tuning dataset with long Chain-of-Thought (CoT) cold-start trajectories.
- MMR1-RL Dataset (15k): Reinforcement learning dataset with question-answer pairs.
- MMR1-3B-SFT, MMR1-7B-SFT, MMR1-32B-SFT: Checkpoints trained with MMR1-SFT.
- MMR1-3B-RL, MMR1-7B-RL, MMR1-32B-RL: Checkpoints trained with MMR1-SFT and MMR1-RL.
These resources cover diverse domains, including mathematics, science, charts/figures, document tables, and general understanding, integrating existing public resources with newly curated data.
Evaluation Results
MMR1 models have been evaluated on a suite of mathematics-related multimodal reasoning benchmarks (MathVerse, MathVista, MathVision, LogicVista, and ChartQA).
- MMR1-7B-RL achieves an average score of 58.4, setting a new state-of-the-art among 7B-scale reasoning models.
- MMR1-3B-RL demonstrates competitive performance with 52.7, highlighting strong reasoning capabilities even at a smaller scale. The results demonstrate the effectiveness of Variance-Aware Sampling (VAS) and the curated long CoT training data.
For detailed instructions on installation, training, and further evaluation, please refer to the GitHub repository.
Citation
If you find MMR1 useful for your research and applications, please cite using this BibTeX:
@misc{leng2025mmr1,
title={MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources},
author={Sicong Leng and Jing Wang and Jiaxi Li and Hao Zhang and Zhiqiang Hu and Boqiang Zhang and Yuming Jiang and Hang Zhang and Xin Li and Lidong Bing and Deli Zhao and Wei Lu and Yu Rong and Aixin Sun and Shijian Lu},
year={2025},
eprint={2509.21268},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.21268},
}
License
This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.
- Downloads last month
- 20