This AI paper introduces ARM and ADA-GRPO: Adaptive Inference Models to Solve Effective and Scalable Problems

The reasoning task is a basic aspect of artificial intelligence, including common sense understanding, mathematical problem solving, and symbolic reasoning. These tasks often involve multiple steps of logical reasoning, and large language models (LLMSs) attempt to mimic them through structured methods (e.g., theork of-thought (COT) prompt). However, as LLMs grow in size and complexity, they produce longer outputs in all tasks regardless of difficulty, resulting in significant inefficiency. The field has been working to balance the depth of reasoning with computational costs, while also ensuring that the model can adapt its inference strategy to meet the unique needs of each problem.
A key problem with current inference models is the inability to customize the inference process for different task complexities. Most models, including famous models such as Openai’s O1 and DeepSeek-R1, adopt a unified strategy, usually relying on long beds in all tasks. This leads to an “overthinking” problem where the model generates unnecessary detailed explanations for simple tasks. Since too much reasoning can introduce irrelevant information, not only is the waste of resources but also reduces accuracy. Methods such as rapid generation or token budget estimates are attempting to mitigate this problem. Nevertheless, these methods are limited by dependence on predefined assumptions that are not always reliable.
Attempts to solve these problems include methods such as GRPO (Group Relative Policy Optimization), length-rebirth mechanisms, and rule-based timely controls. While GRPO enables models to learn different inference strategies by rewarding the correct answer, it leads to a “format crash” in which models increasingly rely on long beds, squeeze out more efficient formats, such as short beds or direct answers. Length measurement techniques, such as those applied in methods such as ThinkPrune, controlling output lengths during training or inference, are often at the expense of reducing accuracy, especially in complex problem-solving tasks. These solutions strive to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for adaptive approaches.
A team of researchers at Fudan University and Ohio State University introduced an adaptive reasoning model (ARM), which dynamically adjusts the inference format based on task difficulty. ARM supports four different ways of reasoning: direct answers to simple tasks, short cribs for concise reasoning, codes for structured problem solving, and long beds for deep multi-step science. It runs in adaptive mode by default, automatically selects the appropriate format, and also provides modes that guide and consensus boots for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes ADA-GRPO, an extension of GRPO, introducing a format diversity reward mechanism. This prevents the dominance of the long bed and ensures that the arms continue to explore and use simpler reasoning formats when appropriate.
The arm approach is built on a two-stage framework. First, the model is supervised fine-tuning (SFT) through a 10.8k problem, each annotated on four inference formats, which are from datasets such as Aqua-Rat and generated using tools such as GPT-4O and DeepSeek-R1. This stage teaches the model the structure of each reasoning format, but does not instill adaptability. ADA-GRPO was applied in the second phase, where the model received a scaling reward using lower frequent formats such as direct answers or short COT. A decay factor ensures that this reward gradually shifts to accuracy as training progresses, thus preventing long-term bias towards inefficient exploration. This structure enables ARM to avoid format crashes and dynamically match inference strategies to achieve a balance of efficiency and performance.
ARMs show impressive results in a variety of benchmarks, including common sense, math and symbolic reasoning tasks. Simpler tasks reduce token usage by an average of 30% compared to models that rely solely on long beds and 70% for simpler tasks. ARM achieves 2x training speed on GRPO-based models, accelerating model development without sacrificing accuracy. For example, the ARM-7B achieves 75.9% accuracy on the challenging AIME’25 task, while 32.5% less use. Compared with the QWEN2.5SFT+GRPO model, ARM-14B has an accuracy of 85.6% on OpenBookQA, an accuracy of 86.4% on mathematical datasets, and a reduction in token usage by more than 30%. These figures suggest that ARM is able to maintain competitive performance while providing significant efficiency improvements.
Overall, the adaptive inference model solves the persistent inefficiency of the inference model by enabling adaptive selection based on task difficulty. The introduction of ADA-GRPO and multi-format training frameworks ensures that the model no longer wastes overthinking. Instead, ARM provides flexible and practical solutions for accuracy and computational costs in inference tasks, making it a promising approach to scalable and effective large language models.
View the paper, embrace the model and project page on the surface. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
