Large Language Models (LLMS) have recently shown significant advances in multi-step inference, establishing mathematical problem solutions as a rigorous benchmark for evaluating advanced functions. Although proprietary models such as GPT-4O and Claude Sonnet 4, their closed nature hinders transparency and repeatability. Resolve these gaps, MiroMind AI releases the MiroMind-M1 series, a fully open source pipeline– Spanning datasets, models, training codes and evaluation scripts – New standards for openness and latest mathematical reasoning are set in the QWEN-2.5 model ecosystem.
Building foundations and power
MiroMind-M1 Built on a powerful QWEN-2.5 backbone and explicitly used for mathematical reasoning. The team adopted a two-stage training program:
- Supervised fine-tuning (SFT): The model is fine-tuned on carefully planned and proven mathematical problems, giving it a strong step-by-step reasoning ability.
- Reinforcement learning with Verified Rewards (RLVR): Next, the model experiences RL on 62k challenging and strictly verifiable mathematical problems and takes advantage of the reward signal of a powerful external validator.
This approach is motivated by the need for strong mathematical logic and lessons learned from leading RLM: imitating thoughtful paradigms can improve general reasoning while strengthening learning further improves accuracy and efficiency under the guidance of precise rewards.
Data transparency and quality
The MiroMind-M1 project is marked by the full openness and cleanliness of its training data:
- SFT corpus composition: Extracted from OpenR1, Opthoughts, Light-R1 and Synthetic-1 ensures that the problem validates solutions and rich traces of multi-step reasoning.
- Strict deduplication: Use N-Gram overlap filtering to eliminate duplication and data leakage using evaluation sets (e.g. AIME24, AIME25, MATH500).
- Prefer long tracks: Experiments show that training samples with longer inference traces always yields higher benchmark scores, highlighting the importance of deep semantic content in inference signals.
The final dataset provides a 719K validated training trajectory – an open reproducible study of previous efforts is evident.
Supervisory fine-tuning: Excellent excellence
For SFT, MiroMind-SFT-7B is initialized from QWEN2.5-MATH-7B and accepts a larger context window (maximum 32,768 tokens) and a packaging-free strategy to avoid cross-sample attention contamination. It performs more than peers on key math benchmarks to open the model:
Model | Aime24 | Aime25 | MATH500 |
---|---|---|---|
DeepSeek-R1-Distill | 55.5 | 40.4 | 92.8 |
mimo-7b-sft | 58.7 | 44.3 | 93.0 |
miromind-sft-7b | 60.4 | 45.0 | 94.6 |
These results validate the efficacy of data curation and training designs: richer, deeper samples and no packaging lead to consistently superior performance.
Campo: Multi-stage policy optimization for context-aware
The key innovation in the RLVR phase of MiroMind-M1 is Campo algorithm. Campo addresses two key RL challenges by: training instability and inefficient efficiency –
- Multi-stage training with extended context limitations: Training starts with the output length of the constraint (e.g., 16K token) and then gradually increases to further reason, balancing efficiency and thoroughness.
- Dynamic repetition punishment: Specialized repetitive criticism punishes the performance of early or over-repetitive outputs to prevent utility collapse and diversity in execution outputs.
- Accurate external validator: The reward feedback system will greatly improve with robust mathematical answers (including tricky cases with units, π and percentages) to ensure that the training signal is closely aligned with true correctness.
Campo not only stabilizes RL dynamics, but also causes the model to solve fewer, more relevant token problems – speeding up inference and reducing costs without sacrificing accuracy.
Benchmark performance: state-of-the-art efficiency
MiroMind’s open model can achieve highly competitive or state-of-the-art results for the mathematical model of open QWEN-2.5 (7B/32B parameters):
Model | Aime24 | Aime25 | MATH500 |
---|---|---|---|
DeepSeek-r1-7b | 55.5 | 39.2 | – |
MIMO-7B-RL | 68.2 | 55.4 | 95.8 |
Skywork-Or1-7b | 72.2 | 54.6 | – |
miromind-rl-7b | 73.4 | 57.8 | 96.7 |
Skywork-Or1-32b | 77.1 | 68.2 | 97.5 |
miromind-rl-32b | 77.5 | 65.6 | 96.4 |
It is worth noting that the MiroMind-M1-RL model not only matches or exceeds peer accuracy, but also operates with higher token efficiency – thanks to Campo’s training, the 32B model can produce shorter, more concise solutions without losing correctness.
Complete stack and repeatability
Every component of the MiroMind-M1 stack is publicly released:
- Model weight (SFT and RL checkpoints for 7B and 32B scales)
- Dataset (Full 719k SFT, 62K RLVR)
- Training scripts (Supports multi-node distributed training on rays)
- Evaluation code (Standard scripts and benchmark configuration)
Researchers can replicate, review and expand MiroMind-M1 from raw data to trained models, improving repeatability and accelerating new open research.
in conclusion
MiroMind-M1 demonstrates that through careful data curation, innovative RL algorithms (CAMPOs), and fundamental transparency, open source language models can rival proprietary systems in advanced mathematical reasoning. The project sets a new standard for the repeatability and collaborative advancement of inference LLM, providing high-quality resources and a strong platform for future innovation.
Check Paper, github pages and models about hugging faces. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
