Large inference models (LRMS) improve rapidly, showing impressive performance in complex problem-solving tasks across fields such as mathematics, coding and scientific reasoning. However, the current evaluation method focuses primarily on single problem testing, which reveals significant limitations. This article introduces Rest (inference evaluation through simultaneous testing) – A novel multi-problem stress testing framework designed to push LRM toward isolated problem solutions and better reflect its real-world multi-context reasoning capabilities.
Why the current evaluation benchmark lacks for large inference models
Most current benchmarks (such as GSM8K and Mathematics) evaluate LRM by asking questions at a time. Although effective for initial model development, this isolated problem approach faces two key drawbacks:
- Reduce discrimination ability: Many state-of-the-art LRMSs now achieve near-perfect scores in popular benchmarks (for example, DeepSeek-R1 achieves 97% accuracy on the MATH500). These saturation results make it increasingly difficult to distinguish between real model improvements, forcing expensive, continuously creating more difficult datasets to distinguish features.
- Lack of real-world multi-text evaluation: Real-world applications (such as educational tutoring, technical support, or multitasking AI assistants) require the reasoning of multiple potential intervention issues simultaneously. Single problem testing does not capture these dynamic, multi-problem challenges that reflect true cognitive loading and reasoning robustness.


Introducing a break: Stress test with multiple problems at once LRM

To address these challenges, researchers from Tsinghua University, Opendatalab, Shanghai AI Laboratory and Renmin University have developed resta simple and powerful way to evaluate, while testing LRM on multiple questions simultaneously, bundling it into a prompt.
- Multi-problem benchmark reconstruction: Rest the break by connecting multiple questions to a prompt, relisting them in existing benchmarks, and adjusting Stress level Control the parameters of how many questions are raised at the same time.
- Comprehensive assessment: REST evaluates key reasoning abilities beyond basic problem-solving – including Context priority allocation,,,,, Interference resistance across problemsand Dynamic Cognitive Load Management.
- Wide applicability: The framework is validated on 34 advanced LRMS, ranging from 1.5 billion to 671 billion parameters, and performs 7 different benchmarks on a variety of difficulty levels, from simple GSM8K to challenging AIME and GPQA.
Rest reveals key insights about LRM reasoning capabilities
The remaining assessments found several groundbreaking findings:
1. Significant performance degradation under multi-problem pressure
even The most advanced LRM Like DeepSeek-R1, it shows a significant decrease in accuracy when dealing with multiple issues. For example, DeepSeek-R1’s accuracy on challenge benchmarks (such as Aime24) has almost decreased 30% Compared to isolated problem tests, under rest. This contradicts the previous assumption that large language models are essentially able to perform multitasking effortlessly on the problem.
2. Enhanced discrimination between similar models
The break greatly amplifies the differences between models with nearly single problem scores. For example, on Math500:
- R1-7B and R1-32B Achieving 93% and 94.6% close to single-problem accuracy, respectively.
- Under rest, the accuracy of R1-7B drops to 66.75% R1-32B stays high 88.97%revealing a distinctive 22% performance gap.
Similarly, in the same model such as Areal-Boba-RL-7B and OpenthInker2-7B, REST captures significant differences in the multi-problem processing capability of a single problem evaluation mask.
3. Post-training methods may not guarantee powerful multi-problem reasoning
Models that fine-tune through enhanced learning or supervised adjustments to individual problem reasoning often fail to retain their advantages in the REST’s multi-question environment. This requires rethinking the training strategy to optimize inference robustness in realistic multi-context scenarios.
4. “long2short” training enhances performance under stress
Well-trained model “long2short” technology – Encourage concise and effective reasoning chains – maintain higher accuracy in a static state. This suggests a promising avenue for designing a model that is more suitable for simultaneous multi-problem reasoning.
How to take a break stimulates the realistic reasoning challenge
By adding Cognitive load In LRMS, by simultaneously appearing problem performance, REST simulates the real world demands that the inference system must be dynamically prioritized, avoid overthinking a problem, and resist interference from concurrent tasks.
REST also systematically analyzes error types, revealing common failure modes, such as:
- Problem omission: Ignore future problems in multiple question prompts.
- Summary error: The answers across questions were wrongly summarized.
- Reasoning error: Logical or calculation errors in the inference process.
These subtle insights are essentially invisible in a single problem assessment.
Actual evaluation settings and benchmark coverage
- REST evaluated 34 LRMS, spanning 1.5b to 671b parameters.
- The benchmarks for testing include:
- Simple: GSM8K
- Medium: Math500, AMC23
- challenging: Aime24, Aime25, GPQA diamonds, livecodebench
- The model generation parameters are set according to official guidelines, and the output token limit is 32K for inference model.
- Use a standardized Opencompass toolkit to ensure consistent, repeatable results.


Conclusion: As a future, the realistic LRM evaluation paradigm rests
REST makes a significant leap in evaluating large inference models:
- Solve benchmark saturation: Revitalize existing datasets without expensive complete replacements.
- Reflecting the real-world multitasking needs: Test the model under realistic high cognitive load conditions.
- Guide model development: The importance of training methods such as Long2short is emphasized to alleviate overthinking and encourage adaptive reasoning priorities.
All in all, REST paves the way for more reliable, reliable and relevant benchmarking of next-generation inference AI systems.
Check Paper, project page and Code. All credits for this study are to the researchers on the project. Subscribe now To our AI newsletter

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.