--- pipeline_tag: image-text-to-text library_name: transformers license: apache-2.0 --- [](https://arxiv.org/abs/2509.21268) [](https://huggingface.co/papers/2509.21268) [](https://github.com/LengSicong/MMR1) # MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources This repository introduces the **MMR1** family of multimodal reasoning models, presented in the paper "[MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources](https://huggingface.co/papers/2509.21268)". MMR1 addresses critical limitations in the advancement of large multimodal reasoning models, specifically the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms during post-training. ## Key Contributions - **Variance-Aware Sampling (VAS)**: A novel data selection strategy guided by a Variance Promotion Score (VPS). VAS combines outcome variance and trajectory diversity to promote reward variance, stabilize policy optimization, and improve convergence, especially in scenarios where Group Relative Policy Optimization (GRPO) is prone to gradient vanishing. - **Large-scale Curated Resources**: The project releases carefully curated resources, including ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity. - **Open-source Codebase & Models**: A fully reproducible end-to-end training codebase and a family of open-source multimodal reasoning models across multiple scales (3B, 7B, 32B), establishing standardized baselines for the community. ## Methodology Overview MMR1 introduces **Variance-Aware Sampling (VAS)** to mitigate the *gradient vanishing problem* in reinforcement learning fine-tuning with GRPO. The framework balances exploration and coverage by combining a random sampler with a weighted sampler guided by the Variance Promotion Score (VPS). This ensures that training focuses on prompts providing strong learning signals, with VPS scores periodically re-estimated for dynamic adaptation. ## Open Resources The project open-sources the following resources for the community: - **[MMR1-SFT Dataset](https://huggingface.co/datasets/MMR1/MMR1-SFT)** (~1.6M): Supervised fine-tuning dataset with long Chain-of-Thought (CoT) cold-start trajectories. - **[MMR1-RL Dataset](https://huggingface.co/datasets/MMR1/MMR1-RL)** (15k): Reinforcement learning dataset with question-answer pairs. - **[MMR1-3B-SFT](https://huggingface.co/MMR1/MMR1-3B-SFT)**, **[MMR1-7B-SFT](https://huggingface.co/MMR1/MMR1-7B-SFT)**, **[MMR1-32B-SFT](https://huggingface.co/MMR1/MMR1-32B-SFT)**: Checkpoints trained with MMR1-SFT. - **[MMR1-3B-RL](https://huggingface.co/MMR1/MMR1-3B-RL)**, **[MMR1-7B-RL](https://huggingface.co/MMR1/MMR1-7B-RL)**, **[MMR1-32B-RL](https://huggingface.co/MMR1/MMR1-32B-RL)**: Checkpoints trained with MMR1-SFT and MMR1-RL. These resources cover diverse domains, including mathematics, science, charts/figures, document tables, and general understanding, integrating existing public resources with newly curated data. ## Evaluation Results MMR1 models have been evaluated on a suite of **mathematics-related multimodal reasoning benchmarks** (MathVerse, MathVista, MathVision, LogicVista, and ChartQA).