Google AI launches supervised reinforcement learning (SRL): a step-by-step framework with expert trajectories for teaching small language models to reason through difficult problems

How can a small model learn to solve a currently unachievable task without requiring rote learning or relying on correct deployment? A team of researchers from Google Cloud AI Research Center and UCLA has released a training framework, Supervised Reinforcement Learning (SRL), that enables 7B-scale models to actually learn from very difficult mathematical and agent trajectories that ordinary supervised fine-tuning and outcome-based reinforcement learning RL cannot learn from.

Even with good teacher tracking, small open source models such as Qwen2.5 7B Instruct cannot solve the hardest problems in s1K 1.1. If we apply supervised fine-tuning to a full DeepSeek R1-style solution, the model will imitate token one by one, the sequence is very long, the data only has 1,000 items, and the final score will be lower than the base model.

The core idea of ​​”supervised reinforcement learning” SRL

“Supervised reinforcement learning” (SRL) retains RL-style optimization, but injects supervision into the reward channel instead of the loss. Each expert trajectory in s1K 1.1 is parsed into a sequence of actions. For each prefix of the sequence, the research team creates a new training example, and the model first generates a then the action for that step is output, and only the action is compared to the teacher’s action using a difflib-based sequence similarity measure. The rewards are intensive because every step counts, even if the final answer is wrong. The rest of the text, the inference part, is unconstrained, so the model can search its own chain without being forced to copy teacher tokens.

math scores

All models were initialized from Qwen2.5 7B Instruct and trained on the same s1K 1.1 set in DeepSeek R1 format, so the comparison is clean. The specific numbers in Table 1 are:

  • Basic Qwen2.5 7B instructions, AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7.
  • SRL, AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3.
  • SRL is RLVR, AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0.

This is a key improvement, SRL alone has eliminated SFT degradation and improved AIME24 and AIME25, and when running RLVR after SRL, the system achieved the best open source score in the study. The research team made it clear that the best process is SRL followed by RLVR, not SRL in isolation.

Software engineering achievements

The research team also applied SRL to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories generated by claude 3 7 sonnet. Each trajectory was decomposed into step-by-step instances, generating a total of 134,000 step items. The assessment has been validated on SWE Bench. The basic model achieves 5.8% efficiency in oracle file editing mode and 3.2% efficiency in end-to-end mode. SWE Gym 7B received 8.4% and 4.2%. SRL obtains 14.8% and 8.6%, which are approximately 2 times that of the base model and significantly higher than the SFT baseline.

Main points

  1. SRL reformulates hard inference as stepwise action generation, where the model first produces an internal monologue, then outputs a single action, and only that action is rewarded for sequence similarity, so even if the final answer is wrong, the model receives a signal.
  2. SRL runs on the same DeepSeek R1 format s1K 1.1 data as SFT and RLVR, but unlike SFT it does not overfit for long demonstrations, and unlike RLVR it does not crash when not rolled out correctly.
  3. Mathematically, the exact sequence that gave the strongest results in the study was to initialize the Qwen2.5 7B Instruct with SRL and then apply RLVR, which improved the inference benchmark over either method alone.
  4. The same SRL recipe generalizes to agent software engineering, using 5,000 verified trajectories from claude 3 7 sonnet 20250219 and boosting SWE Bench Verified well above the base Qwen2.5 Coder 7B Instruct and SFT-style SWE Gym 7B baselines.
  5. Compared with other stepwise reinforcement learning methods that require additional reward models, this SRL retains the GRPO-style objective and uses only actions and lightweight string similarities from expert trajectories, making it easy to run on small hard datasets.

“Supervised Reinforcement Learning” (SRL) is the research team’s practical contribution. It retains the GRPO-style reinforcement learning setup, but it replaces fragile outcome-level rewards with supervised stepwise rewards calculated directly from expert trajectories, so the model always receives an informative signal, even in D.difficult Both the RLVR and the SFT are stuck in stagnant regimes. Importantly, the research team mathematically demonstrated SRL using the same recipe on SWE Bench Verified and that the strongest configuration was SRL, followed by RLVR, rather than either one alone. This makes SRL a realistic approach for open model learning of difficult tasks. Overall, SRL is a clear bridge between process oversight and RL that open model teams can adopt immediately.


Check Paper. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...