Moonshot AI researchers launch Seer: an online contextual learning system for rapid synchronous reinforcement learning RL rollout

How do you prevent reinforcement learning for a large inference model from stalling over several very long, very slow deployments when the GPU is idle? A team of researchers from Moonshot AI and Tsinghua University launch “Seer”a new online contextual learning system that targets specific system bottlenecks in reinforcement learning of large language models. In a synchronous strategy setting, the rollout phase dominates the cost of each iteration. Seer restructures this phase and reports a 74% to 97% improvement in rollout throughput and a 75% to 93% reduction in tail latency compared to a strong synchronization baseline called veRL.

Why is the synchronous rollout of inference models slow??

Modern inference reinforcement learning workloads use long-chain thinking style output. In the Seer experiment, the researchers applied GRPO to three different models: Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes, each with 8 H800 GPUs. The three tasks used 32, 128, and 256 GPUs respectively, with 400, 600, and 800 prompts per iteration, and 8 or 16 responses per prompt.

The maximum generated length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B is configured for 40,960 tokens, and Kimi K2 is configured for 98,304 tokens. As decoding proceeds, a long chain of thought requests can grow from a few hundred megabytes to tens of gigabytes with KVCache. This memory growth forces the instance to reduce concurrency or preempt requests, triggering expensive re-decodes.

The research team defines tail requests as the last 10% of requests completed in a deployment. For Moonlight and Qwen2 VL 72B, this tail alone can consume 50% of the total rollout time in the baseline system. Rollout already dominates the iteration time, so this tail effect will directly slow down RL.

Seer architecture on top of Mooncake and vLLM

Seer-preserving RL algorithm is the same as synchronous veRL. Each training iteration only uses data from the current rollout iteration, so the system retains policy behavior. Megatron is used for distributed optimization during the training phase. The rollout phase uses the internal implementation of vLLM as the inference engine.

To support aggressive request scheduling, Seer relies on a global KVCache pool built on the Mooncake decomposition KVCache architecture used in Kimi production. Mooncake provides two-tier DRAM and SSD KV cache storage shared between inference nodes, which allows Seer to migrate requests without recomputing prepopulation.

On top of this substrate, Seer introduces three key mechanisms:

  1. rolled out in stages
  2. context-aware scheduling
  3. Adaptive block speculative decoding

These are orchestrated by request buffers, context managers, and a pool of inference engines connected to the global KVCache pool.

Segmented Rollout, fine-grained scheduling and migration

Traditional synchronous rollout assigns the entire GRPO group to the inference instance. A group is a group of requests that share a tip. Once assigned, the group remains on the same instance until all responses are completed. Since the output lengths vary greatly, this can lead to load imbalance and long-running stragglers.

The prophet divided the crowd into two steps. It first breaks each group into separate requests. It then splits each request into chunks based on the build length. When the scheduler dispatches a request from the request buffer, it sets a small maximum token value for the block, such as 8,000 tokens. After each block, requests are requeued until the end of the sequence token is reached or its original maximum token limit is reached.

Because KVCache is stored in the global KVCache pool, partitioned requests can be moved between instances on block boundaries without re-running prepopulation. The scheduler maintains a concurrency level that keeps memory utilization high while avoiding preemption. This reduces waste and smoothes KVCache usage across iterations.

Context-aware scheduling using group length statistics

The research team observed that different requests within the same group tend to have related output lengths. Seer uses this structure as online context. For each prompt group, it designates one request as a speculative request. The scheduler keeps speculative requests in a high-priority queue and serves them using a smallest-first policy based on the tokens generated so far. Short requests complete quickly and exit. Long requests remain, and groups of potential tail candidates are identified.

The context manager maintains an estimate of the length of each group. It updates this estimate to the maximum length generated among completed requests in the group. If no request completes, it will use the original maximum token as a conservative bound. Once a speculative request starts running or completes, the Seer schedules the remaining requests at the group level using the approximate longest first policy. This design achieves throughput and tail behavior close to that of an oracle scheduler that knows all output lengths ahead of time.

Adaptive block speculative decoding

Seer adds adaptive packet speculative decoding to the first two components to speed up decoding, especially for long requests at the tail. It introduces the Distributed Group Draft Server (DGDS). DGDS maintains a compressed suffix tree for each group and aggregates token sequences from all requests in that group. The instance asynchronously appends the generated token to the DGDS, periodically fetches updated suffix trees and performs local speculative decoding based on shared mode statistics.

The system adjusts the draft length and number of paths based on the model architecture, batch size, and measured acceptance length. For dense and expert mixture models, it precomputes different speculation thresholds and uses them to limit the depth of drafts for each batch. In later stages, concurrency is lower, so Seer increases draft depth and enables multipath drafts to increase the number of tokens accepted per step.

Ablation results show that the split deployment improves throughput by up to 35% compared to the baseline. Adding context-aware scheduling increases this speed to 47% of the baseline. Enabling grouped speculative decoding improves the overall speedup over the baseline by 77% to 87% over the evaluation iterations.

End-to-end impact on reinforcement learning training

The research team evaluated Seer on three RL tasks built on Moonlight, Qwen2 VL 72B, and Kimi K2. They ran 10 push iterations for each task and measured the output tokens per second and the completion time of each push. Relative to veRL using the same RL algorithm and vLLM-based inference engine, Seer improves pushout throughput for these workloads by 74% to 97%.

Tail latency is reduced by 75% to 93%. For memory-constrained tasks, the baseline system spends up to half its time on the last 10% of requests. Seer eliminates much of the tail by combining segmented deployment, context-aware scheduling, and adaptive packet speculative decoding on top of a Mooncake-based global KVCache pool.

Main points

  • rollout bottleneck: The goal of Seer is to synchronize the rollout phase of reinforcement learning, which accounts for approximately 63% to 87% of the iteration time and is dominated by long-tail requests and KV cache fragmentation.
  • Three core mechanisms:Seer combines segmented push-out, context-aware scheduling, and adaptive packet speculative decoding to exploit output length and pattern similarities between GRPO responses for shared cues.
  • Fine-grained scheduling of global KV cache: Requests are broken into chunks and migrated across a Mooncake-style global KVCache pool, which maintains synchronization on policy RL while keeping GPU memory utilization high and reducing preemption.
  • Online context to reduce tail latency: Group-level length statistics from speculative requests drive context-aware scheduling that approximates Oracle’s longest first scheduler and significantly reduces the time spent in the last 10% of requests.
  • Measured end-to-end gain: On production-grade RL workloads using Moonlight, Qwen2 VL 72B, and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% relative to state-of-the-art synchronous vLLM-based baselines.

Seer is an important system contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves policy guarantees and reproducibility while fixing real infrastructure bottlenecks. The combination of split push-out, context-aware scheduling, and adaptive grouped speculative decoding provides a practical template for other RL stacks that rely on long-chain mental inference models and large KVCache footprints. Overall, Seer shows that system-level online contextual learning is now as important as efficiently scaling model architectures for inferential RL.


Check Paper is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...