0

Alibaba introduces Group Sequence Strategy Optimization (GSPO): An Effective Enhancement Learning Algorithm Powering the QWEN3 Model

Reinforcement learning (RL) plays a crucial role in scaling language models, allowing them to solve complex tasks such as competitive-level mathematics and programming through deeper inference. However, achieving stable and reliable training motivation is a challenge when scaling RL with larger computing resources. Current latest algorithms (such as GRPO) fight severe stability problems during training huge language models, often leading to catastrophic failures. These instabilities come from incorrect importance sampling weight applications that introduce high-variable noise. This noise accumulates with a longer response and deteriorates through the editing mechanism. This causes the model to crash and hinder progress.

Existing approaches such as PPO and GRPO rely on mechanisms such as editing to address policy learning challenges that respond to outdated policies. However, these approaches face limitations due to insufficient objectives, especially in large models that deal with long-term response tasks. GRPO’s token-level importance sampling introduces high-variable noise and irreversible model crashes. An attempt to recover from collapses through high-parameter tuning or checkpoint recovery highlights the basic design flaws. The mismatch between token-level correction and sequence-level rewards emphasizes the need for a new approach that is optimized directly at the sequence level to ensure stability and scalability.

Alibaba researchers have proposed Group Sequence Policy Optimization (GSPO) for RL Algorithms designed to train LLM. The main innovation of GSPO is its theoretical importance ratio, which originates from sequence possibilities, which may match the principle of importance sampling. Furthermore, it computes the normalized rewards as the advantages of multiple responses to the query, thereby facilitating consistency between sequence-level rewards and optimization goals. Empirical evaluation shows that GSPO significantly outperforms GRPO in terms of stability, efficiency and overall performance. By solving stability challenges in training large expert (MOE) models, GSPO eliminates the need for complex stability technologies.

The researchers used a cold start model with a fine-tuned QWEN3-30B-A3B basis to conduct experiments, reporting training reward curves and model performance curves across Aime’24, livecodebench and CodeForceS benchmarks. During the training period, the rollout data in each batch was divided into four small batches for gradient updates. GSPO clips the entire response, rather than a single token, with formula clip ranges set to 3E-4 and 4E-4. This results in a two-order difference in the token scores of clipped compared to GRPO. Although more tokens are removed for gradient estimation, GSPO can improve higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

Unlike GRPO, GSPO stabilizes the process through consistent expert activation across gradient updates, thus providing a significant advantage for MOE training, unlike GRPO, which struggles with volatility of expert activation. This eliminates the need for complex solutions such as routing replay, simplifying the infrastructure and allowing the model to take full advantage of its full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependence on token-level possibilities, making it more robust to exact mismatches. This allows the possibility of using the inference engine directly, avoiding expensive recomputations, and improving the efficiency of partial rollout and multi-transfer RL. GSPO also simplifies the RL infrastructure for large-scale language model training.

In summary, the researchers introduced the Group Sequence Policy Optimization (GSPO) proposed for the RL algorithm for training LLM. GSPO builds on the principle of importance sampling and introduces sequence-level editing, beneficial and optimization to overcome the instability and inefficiency seen in GRPO. Its outstanding performance in training stability, efficiency and scalability, especially for MOE models, which emphasizes its importance as a strong algorithmic basis. Advances in GSPO have become possible, playing a key role in the outstanding performance of the QWEN3 model. Researchers plan to expand the RL approach based on GSPO, opening the door to groundbreaking advances in AI.


Check Paper. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.