Researchers from Shanghai Ruotang propose the enhanced learning level LLM development of Bashin

Introduction: Enhanced learning progress through thought chain prompts
LLM combines COT cues with large-scale reinforcement learning (RL) to show outstanding progress in complex inference tasks. Models such as DeepSeek-R1-Zero show strong inference ability by applying RL directly to the basic model. Likewise, methods such as SimpleerL and Open Tournament Schedule show improvements in smaller models like the QWEN series. However, success in different basic model families remains a challenge. Furthermore, applying R1-Zero style training to basic models, such as the difficulty encountered in the Llama series, posed a fundamental problem with the fundamental factors that led to inconsistencies in the reinforcement learning process of different basic models.
Limitations of RL scaling on camel models
OpenAI’s O1, O3 and DeepSeek’s R1 have large-scale RL advancements in competitive mathematical problems, motivating RL to explore RL on smaller models with less than 100B parameters. However, they are limited to the QWEN model family, and it is difficult to repeat the results for families such as camels. The lack of transparency in the pipeline before training makes it difficult to understand how pre-training affects RL scaling. This prompted unconventional research, finding that a sound that prompted QWEN’s reasoning improved the reasoning, but had little benefit in llamas. Curated high-quality math training pre-corpus through projects such as OpenWebmath, MathPile, Infimm-Web-Math and Finemath, but the scale below 100B tokens remains limited.
Explore mid-term training through stable strategies
Researchers at Joao University in Shanghai investigated how mid-term training strategies shape RL dynamics, focusing on Qwen and Llama. The study proposes several insights: First, high-quality mathematical corpus, such as Megamath-Web-Pro, promotes basic models and RL results. Secondly, using QA-style data, especially data with long COT inference, further enhances the RL results. Third, long beds introduce verbose and instability in RL training. Finally, applying scaling during mid-term training results in stronger downstream RL performance. The researchers introduced a two-stage mid-term training strategy called “stability”, where the basic model was first trained on the 200B token, followed by 20B tokens at three COT-focused branches, resulting in the eight-spike model that showed strong RL compatibility.
RL configuration and benchmark evaluation
The researchers used the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini batch size of 64, and experiments were performed on the Llama-3.2-3B basic and QWEN2.5-3B basic models. For evaluation, the basic language model has little shooting prompts, and the RL adjusts the model’s zero shooting across indicator tasks, including GSM8K, Math500, OlympiaDbench and AMC23. During RL training, the QWEN model exhibits an increasing number of response lengths that remain reasonable throughout the process, while Llama displays exceptional behavior, with the average response length escalating to 4,096 tokens. The evaluation further demonstrates that RL-regulated QWEN2.5-3B improves between benchmarks, while Llama-3.2-3b shows only edge growth.
Octothinker outperforms Llama in RL compatibility
Each Octothinker branch showed 10%-20% improvement over the original Llama base model and the stabilization phase model for all dimensions was consistently improved when evaluated through 13 mathematical benchmarks. The Octothinker-Zero family revealed various thinking behaviors during RL scaling, and its octave variants performed very well. When comparing three 3B-scale basic models during RL training, Octothinker-Long-3b outperformed the original Llama-3.2-3b model and achieved performance parity with QWEN2.5-3B, which is known for its strong inference capabilities and extensive pre-training. Hybrid and short branches have slightly lower performance, especially on challenging benchmarks
Conclusion and future work: A fundamental model for RL readiness
This paper examines why basic models such as Llama and Qwen are exhibited for inference to show different behaviors during RL, which suggests that mid-term training plays a major role in RL scalability. The two-stage mid-term training strategy transformed the llama into a base model that was more suitable for RL, resulting in the Octothinker model. Future research instructions include:
- Planning a higher quality math corpus to improve mid-term training.
- Create RL-friendly base models with open recipes without distilling out of long COT inference models.
- Separate the quality inspection format and content to understand their contributions separately.
- The Octothinker family has been extended with new branches such as tool integration reasoning.
Check Paper, hug pages and github pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
