RA3: Mid-term training using temporal action abstraction to speed up post-training reinforcement learning (RL) in Code LLM

by admin · October 9, 2025

Long story short: A new Apple study formalizes what reinforcement learning RL should do “during training” after training and introduces RA3 (Reasoning as action abstraction)— An EM-style program that learns time-consistent potential actions from expert trajectories and then fine-tunes these guided trajectories. It indicates that the mid-training period should (1) prune to a compact near-optimal action subspace, and (2) shorten the effective planning range to improve RL convergence. Empirically, RA3 improves HumanEval/MBPP by about 8/4 percentage points over base/NTP and accelerates RLVR on HumanEval+, MBPP+, LiveCodeBench and Codeforces.

What are the research results?

The research team proposes the first formal approach to how mid-training shapes post-training reinforcement learning RL: they decompose the results into (i) Pruning efficiency– How to select a compact, near-optimal subset of actions to shape the initial policy prior mid-training – and (ii) Reinforcement learning convergence——How quickly you can improve within a limited range after training. Analysis believes that the mid-term training period is most effective when: Decision space is tight and Short validity periodbias time abstraction beyond the original next token operation.

Algorithm: RA3 one pass

RA3 get a sequential variational lower bound (time ELBO) and Optimize it using an EM-like loop:

Step E (potential discovery): Using RL for inference time-consistent latent structure (Abstract) Alignment with expert sequences.
M step (model update): Perform next token prediction Guided, potentially annotated traces Make these abstractions part of your model strategy.

Results: Code Generation and RLVR

In a Python code task, the research team reported that across multiple base models, RA3 improves the average pass@k of HumanEval and MBPP by about 8 points and MBPP by about 4 points On the base model and NTP training mid-term baseline. After training, RVVR convergence hurry up and arrive higher final performance exist HumanEval+, MBPP+, LiveCodeBench and Codeforces When initializing from RA3. These are the in-training and post-training effects respectively; the evaluation scope is code generation.

Main points

The research team formalized mid-term training through two determinants—pruning efficiency and Impact on RL convergence— Argument effectiveness increases when the decision space is compact and the validity period is short.
RA3 Optimize the sequential variational lower bound by Iterative discovery of time-consistent latent structures using RL and then Fine-tune the bootstrap trace (EM style).
Regarding code generation, RA3 reports ~+8 (human assessment) and ~+4 (MBPP) Average pass@k gain across multiple model scales over base/NTP mid-training baseline.
After initializing training using RA3 Accelerate RLVR convergence and improve Progressive performance On HumanEval+, MBPP+, LiveCodeBench and Codeforces.

RA3’s contribution is specific and narrow: it formalizes intermediate training around two determinants (pruning efficiency and RL convergence) and operates them via temporal ELBO optimized in an EM loop to learn persistent action abstractions before RLVR. The researchers report average pass@k gains of ~+8 (HumanEval) and ~+4 (MBPP) relative to Base/NTP, and faster convergence of RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Check technical paper. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

RA3: Mid-term training using temporal action abstraction to speed up post-training reinforcement learning (RL) in Code LLM

What are the research results?

Algorithm: RA3 one pass

Results: Code Generation and RLVR

Main points

You may also like...

live chat

Recent Posts

RA3: Mid-term training using temporal action abstraction to speed up post-training reinforcement learning (RL) in Code LLM

What are the research results?

Algorithm: RA3 one pass

Results: Code Generation and RLVR

Main points

You may also like...

Google DeepMind introduces Aeneas: AI-driven contextualization and restoration of ancient Latin inscriptions

Trump announces $500 billion private sector investment in AI infrastructure

Brain-inspired chips could cut artificial intelligence’s soaring energy needs

live chat

Recent Posts