Can we improve Llama 3’s reasoning by training alone? Astro displays +16% to +20% benchmark gain

admin7 hours ago

0 2 3 minutes read

Can we improve Llama 3’s reasoning by training alone? Astro displays +16% to +20% benchmark gain

In the absence of architectural changes, improving the reasoning capabilities of large language models (LLMs) is a core challenge in improving AI consistency and usability. Meta AI and University of Washington researchers introduced astronaut–Automatic regression search teaching reasoner– A novel post-training framework designed to enhance reasoning Llama-3.1-70B Teaching. Astro is unique in teaching models In closed search,,,,, Self-reflectionand Backtrackingmechanisms often associated with human problem solving and traditional symbol search algorithms. With this approach, Astro enhances the mathematical performance of Llama 3 in several competitive benchmarks and makes significant improvements:

Mathematics 500: 65.8%➝ 81.8%
AMC 2023: 37.5% ➝ 64.4%
Aime 2024: 10.0% ➝ 30.0%

A thoughtful generation that leads to search

Astro’s methodology begins with Monte Carlo Tree Search (MCTS) Beyond the trajectory of mathematical problem solving. This search explores the correct and wrong paths of reasoning. The key innovation is Program cloning: The entire search tree has been linearized into Long Chain (COT) This will naturally pass failure and recovery Self-reflection and Backtracking. These linearized traces are rewritten in natural language and are used as the basis for supervised fine-tuning (SFT).

This leads to a model that not only solves the problem step by step, but also reevaluates its trajectory – usually backtracking after self-evaluation to correct intermediate reasoning errors. For example, the model might insert phrases like “Let’s go back to where we set the equation” when its internal confidence drops.

Supervisory fine-tuning: Injection search prior

Astro fine-tuning Llama-3.1-70b-Instruct on 36.1k curated COT solutions from math, AMC/AIME and AOPS-style datasets. Achievements of Models trained by Astro-SFT:

Mathematics 500: 69.6%
AMC 2023: 51.9%
Aime 2024: 16.3%

These scores are competitive or exceed the scores of baseline and SPOC/Step-KTO variants without a clear search prior. Importantly, even SFTs can improve performance by exposing the model to the inference data of the search structure, even without reinforcement of learning.

Reinforcement learning through search initialization learning

Astro performs enhanced learning (RL) by initializing using SFT checkpoints and running RL loops with modified Group Relative Policy Optimization (GRPO). Unlike standard preferred RL, Astro uses Verified reward signal (+1 is correct, -1, incorrect), a moderately difficult hint in 8.7K. During training, the model’s COT generation time is longer (from ~1.8K to ~6K tokens) extends deeper internal exploration.

result astro-rl Model Achievements:

Mathematics 500: 81.8%
AMC 2023: 64.4%
Aime 2024: 30.0%

These results compete or exceed models with larger parameter counts and confirm the importance of Astro’s search-aware initialization.

Backtracking behavior is related to reasoning success

The amazing experience observation is Positive correlation Between backtracking frequency and performance. As the training progressed, Astro-RL showed more self-correcting actions and more in-depth exploration. The Pearson correlation coefficient across the benchmark exceeds 0.8, indicating that self-reflection and backtracking are not only cosmetic behaviors, but also functionally correlated with better accuracy.

Comparative insights and broader impacts

Controlled experiments compare Astro with models trained in direct COT solutions (no search priors), even when trained Same Astro always outperforms performance for question sets and search trees. For example, Astro-RL beats Direct-RL:

+2% of Math 500
+3.9% of AMC 2023
+2.9% of Aime 2024

In addition, the output of Astro can be seen Directional maptaking nodes as inference steps and edges, capture transitions, reflect and correct, thereby improving better interpretability.

Astro Key Takeaway Table

in conclusion

Astro shows that LLMs like Llama 3 can learn to reason more effectively, rather than through larger models or longer training, but through principled post-training techniques. By imitating natural language search algorithms, Astro can enable models to Three thoughts on the first answer,,,,, Doubt about their own stepsand Correct your own mid-term. The framework sets a new benchmark for fine-tuning open llms through search-inspired behavior.

Check Paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.