Microsoft releases Agent Lightning: a new AI framework that enables reinforcement learning (RL)-based LLM training for any AI agent

How to convert real agent tracking into reinforcement learning RL to improve policy LLM without changing the existing agent stack? The Microsoft AI team released Agent Lightning to help optimize multi-agent systems. Agent Lightning is an open source framework that makes reinforcement learning applicable to any AI agent without rewriting. It separates training from execution, defines a unified tracking format, and introduces LightningRL, a layered approach that transforms complex agent runs into transformations that can be optimized by standard single-round RL trainers.

what does lightning agent do?

This framework models agents as decision-making processes. It formalizes the agent as a partially observable Markov decision process, where observations are the current inputs to the policy LLM, actions are model calls, and rewards can be terminal or intermediate. From each run, it extracts only the calls made by the policy model along with the inputs, outputs, and rewards. This eliminates other frame noise and provides clean transitions for training.

LightningRL performs credit assignment in a multi-step event and then uses a single-round RL target optimization strategy. The research team describes compatibility with single-round reinforcement learning methods. In practice, teams often use trainers that implement PPO or GRPO, such as VeRL, which fits this interface.

System architecture

Agent Lightning uses training agent decomposition. Lightning servers run training and serving, and expose an OpenAI-like API for updated models. The Lightning client runs the agent runtime where it already exists, capturing traces of prompts, tool calls, and rewards and streaming them back to the server. This allows tools, browsers, shells, and other dependencies to be close to production, while GPU training remains at the server layer.

The runtime supports two trace paths. The default path uses OpenTelemetry spans, so you can transfer agent telemetry data through the standard collector. There is also a lightweight embedded tracker for teams that don’t want to deploy OpenTelemetry. Both paths end up training at the same shop.

unified data interface

Agent Lightning records every model call and every tool call as a span containing input, output, and metadata. The algorithm layer adjusts the span to an ordered triple of prompt, response, and reward. This selective extraction allows you to optimize one agent in a multi-agent workflow, or multiple agents simultaneously, without touching the orchestration code. The same tracking can also drive automated prompt optimization or supervised fine-tuning.

Experiments and datasets

The research team reports on three missions. For text-to-SQL, the team uses the Spider benchmark. Spider contains more than 10,000 questions across 200 databases in 138 fields. The strategy model is Llama 3.2 3B Instruct. The implementation uses LangChain with writer agent, rewriter agent and checker. Writers and rewriters are optimized, while checkers remain fixed. Rewards increase steadily during training and testing.

For search-enhanced generation, the setup uses the MuSiQue benchmark and the Wikipedia index containing approximately 21 million documents. The retriever uses BGE embedding with cosine similarity. The agent is built using the OpenAI Agents SDK. The reward is a weighted sum of the format score and the F1 correctness score. The reward curve shows stable gains during training and evaluation using the same base model.

For using tools to answer math questions, the agent implements and calls the Calculator tool using AutoGen. The data set is Calc X. The base model is also Llama 3.2 3B Instruct. Training improves the ability to correctly call tools and integrate the results into the final answer.

Main points

  1. Agent Lightning uses Training Agent Disaggregation and a unified tracking interface, so existing agents in LangChain, OpenAI Agents SDK, AutoGen or CrewAI connect with near-zero code changes.
  2. LightningRL converts trajectories into transitions. It applies credit allocation to multi-step runs and then optimizes the policy using single-round RL methods such as PPO or GRPO in the standard trainer.
  3. Automatic intermediate reward, AIR, provides intensive feedback. AIR converts system signals such as tool return status into intermediate rewards to reduce the problem of reward sparseness in long workflows.
  4. The study evaluates SQL text on Spider using BGE embedding and cosine similarity, RAG on MuSiQue, and Wikipedia indexing using BGE embedding and cosine similarity, as well as the use of mathematical tools on Calc X, all with Llama 3.2 3B Instruct as the base model.
  5. The runtime records traces via OpenTelemetry, streams them to the training server, and exposes OpenAI-compatible endpoints for updated models, enabling scalable deployment without the need for mobile tooling.

Agent Lightning is a practical bridge between agent execution and reinforcement learning, not another framework rewrite. It formalizes agent running as a Markov Decision Process (MDP), introduces LightningRL for credit distribution, and extracts transformations that are inserted into a single-round RL trainer. The training agent decomposition design separates the client that runs the agent from the server that trains and serves OpenAI-compatible endpoints, so the team retains the existing stack. Automatic intermediate rewards convert runtime signals into dense feedback, reducing sparse rewards in long workflows. Overall, Agent Lightning is a clean, minimally integrated path that allows agents to learn from their own traces.


View the paper and repo. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...