Pull out: Improve the robustness of GSM benchmarks by strengthening teaching LLMS abstract reasoning

by admin · July 6, 2025

Recent research shows that LLM, especially smaller LLM, often struggles with strong reasoning. They tend to perform well on familiar issues, but they can twist when the same issues change slightly, such as changing names or numbers, or adding irrelevant but relevant information. This weakness is called the poor distribution (OOD) generalization, which can lead to significant decreases in accuracy even in simple mathematical tasks. A promising solution is to create a comprehensive change in the inference problem, helping the model learn to focus on basic logic rather than surface details. Strengthening reasoning in this way is crucial to developing more general and reliable AI systems.

The core logic of abstract LLM reasoning failure

LLMs show impressive inference ability, but often stagger when exposed to distribution changes, such as changes in wording, numerical values, or introduction of distractions. This vulnerability is evident between benchmarks in terms of logic, mathematics, and common sense reasoning. Previous solutions rely on data scaling to expose models to a wider input, thereby increasing robustness, but increasing computing demand. The researchers also explore formats such as abstraction of ideas and easy-to-curate formats to teach abstract reasoning while planning techniques such as thinking chains and thinking auxiliary assistance. Reinforcement learning and preference-based approaches provide additional support for the development of inference skills beyond pattern memory.

Pulled symbol learning method to improve inference consistency

Researchers from Apple and EPFL proposed Abstral, which teaches LLMS to understand abstract reasoning patterns rather than remembering surface details. Abstral does not generate many various training examples, which are computationally expensive, but rather helps LLMS learn the basic structure of the reasoning problem using reinforcement learning. This approach connects these abstract patterns with symbolic tools, enabling more reliable problems. Tested in the GSM benchmark, Abstal significantly improves the performance of LLM, especially when facing input changes or distracting information. It outperforms models trained only by promoting more consistent and context-dependent reasoning.

Four steps to reason through avtral abstract symbols

Abstract is a four-step framework designed to teach LLMs to abstractly reason rather than rely on surface patterns. First, it identifies the key variables in the problem and replaces them with symbolic placeholders. Then, using specially made data (grained size), the model learned to use these abstract symbols in steps. Next, it retrieves general reasoning structures (abstract) from symbolic answers. Finally, it computes this abstraction with the original value the correct answer. Reinforcement learning is performed through two rewards, one is correct and the other is for symbolic similarity, further improving the model’s ability to produce accurate, context-independent inference patterns.

GSM8K variation reveals robustness across LLM sizes

The researchers keenly evaluated mathematical reasoning tasks using models such as Llama-3 and Qwen2 and trained them using a dataset called granularity that rewrites mathematical problems in abstract symbolic forms. This helps the model focus on structure rather than surface details. They used a changed version of GSM8K issue, number, name and wording to test robustness. Compared to baselines such as standard chain prompts, pull-out patterns have stronger consistency, and these changes have less accuracy. Especially for smaller models, it can improve the reliability of reintroduction input. The results show that the teaching model abstractly makes it more adaptable and is less dependent on the patterns of memory.

Strong reasoning can be produced by strengthening the abstract thinking of professor LLM

In short, Abstral is a method designed to enhance abstract reasoning in LLM, making them more resilient to the surface changes of the problem. Unlike traditional fine-tuning or data augmentation, Abstral uses hardened learning to train models to mix Socrates’ chains of thought with detailed abstractions. This approach helps the model peel off surface level interference and better connect with symbolic tools. After challenging GSM8K perturbation benchmarks, the pull-out symbol greatly reduces performance degradation under allocation changes, especially in smaller models. This study shows that learning abstraction improves the robustness of reasoning more effectively than relying solely on direct supervision.

Check Paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter,,,,, Youtube and Spotify And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Pull out: Improve the robustness of GSM benchmarks by strengthening teaching LLMS abstract reasoning

The core logic of abstract LLM reasoning failure

Pulled symbol learning method to improve inference consistency

Four steps to reason through avtral abstract symbols

GSM8K variation reveals robustness across LLM sizes

Strong reasoning can be produced by strengthening the abstract thinking of professor LLM

You may also like...

live chat

Recent Posts

Pull out: Improve the robustness of GSM benchmarks by strengthening teaching LLMS abstract reasoning

The core logic of abstract LLM reasoning failure

Pulled symbol learning method to improve inference consistency

Four steps to reason through avtral abstract symbols

GSM8K variation reveals robustness across LLM sizes

Strong reasoning can be produced by strengthening the abstract thinking of professor LLM

You may also like...

Advantages + Sales replace Advantages + Shopping

Bytedance Research releases DAPO: A fully open source LLM reinforced learning system

Dry spine – Science Poetry

live chat

Recent Posts