Shanghai AI Lab Releases Oreal-7b and Oreal-32B: Advancing Mathematical Inference through Result Reward-Based Rewards

Due to the complexity of problem solving and the need for structured logical thinking, mathematical reasoning remains a difficult area of artificial intelligence (AI). Despite significant progress in large language models (LLMs), they often struggle with tasks that require multi-step reasoning. Reinforcement learning (RL) shows hope in improving these abilities, but when rewarding sparse and binary, traditional methods face challenges with little feedback on correct or incorrect answers.
Shanghai AI Laboratory has developed Reinforcement learning based on results rewards (Europe)a series of mathematical inference models available Oreal-7b and Oreal-32b. The framework is designed for situations where only binary rewards (correct or incorrect). Unlike conventional RL methods that rely on intensive feedback, Oli Best n(bon) sampling for behavioral cloning and reshape negative rewards to maintain gradient consistency.
Oreal-7b and Oreal-32b demonstrate that smaller models can use significantly larger models for competitive performance. Oreal-7b scored 94.0% of pass @1 score on Math-500 Benchmarkthe results are comparable to the previous 32B model, and Oreal-32b reaches 95.0% pass, surpassing previous models trained by distillation.
Technical insights and advantages
The Oreal framework introduces several key technologies to improve mathematical reasoning:
- Best N sampling for behavioral cloning: BON sampling helps select the best positive reasoning trajectory, allowing the model to learn from forming a good solution.
- Reward negative sample reshaping: By adjusting for negative rewards, the framework ensures gradient consistency between correct and wrong samples, thus perfecting model optimization.
- Token-level reward model for basic reasoning: Mathematical reasoning often involves long sequences of logical steps. Oreal assigns important weights to critical inference tokens, thus solving the challenge of sparse binary feedback.
- Send reinforcement learning: The model dynamically improves itself based on sampling query, thereby improving training efficiency and adaptability.
These techniques can provide more stable training and better performance in long-term inference tasks, making reinforcement learning a viable alternative to traditional distillation methods.
Performance and evaluation
The Oli model has been tested in several benchmarks:
- Math-500 Benchmark:
- Oreal-7b hits 94.0% pass @1a performance level previously seen only in the 32B model.
- Oreal-32B hits 95.0% pass @1set new standards for mathematical reasoning.
- Aime2024 and Olympia Fort:
- The Oli model outperformed multiple baselines, showing strong generalization across problem types.
- Comparison with Openai O series and DeepSeek models:
- Oreal-32B exceeds DeepSeek-R1-Distill-Qwen-32b and OpenAi-O1-preiviewdemonstrate effective training strategies.
- Results obtained by Oreal-7b and QWQ-32B-preview and Openai-O1-Miniemphasizing the impact of its enhanced learning methods.

in conclusion
Shanghai AI Laboratory Oreal-7b and Oreal-32b Models provide an exquisite approach to reinforcement learning in mathematical reasoning. By solving the challenge of sparse binary rewards Best N sampling, reward molding and token level importance weighting,These models achieve competitive performance even on smaller scales. The Oreal framework provides valuable insights into how to optimize enhanced learning for complex inference tasks, which provides new directions for improving AI’s problem-solving capabilities in the structured field.
Check Paper, Euria-7B and Oreal-32b. All credits for this study are to the researchers on the project. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 75K+ ml reddit.
🚨 Recommended open source AI platform: ‘Intellagent is an open source multi-proxy framework that evaluates complex dialogue AI systems‘ (Promotion)

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can both sound technically, be understood by a wide audience through technical voices, and be understood by a wide audience. . The platform has over 2 million views per month, demonstrating its popularity among its audience.
✅ [Recommended] Join our Telegram Channel