Non-policy-enhanced learning RL related to KL differences produces higher inference in large language models

by admin · June 2, 2025

The policy gradient approach has significantly improved the reasoning ability of LLM, especially through RL. A key tool to stabilize these approaches is Kullback-Leibler (KL) regularization, which discourages huge changes between current and reference strategies. Although widely used in algorithms such as PPO, there are many ways to explore how to estimate and apply different KL variants (such as forward KL, reverse KL and their non-standardized forms) in loss functions. These options, along with a variety of gradient estimators and policies and sales environments, shape training stability and performance in a nuanced and unquestionable way.

Fine-tuning LLM with human feedback is crucial to establishing aligned AI systems. Two main strategies are adopted: optimization through reward models using policy gradient methods (e.g. PPO), and direct training on human preferences through methods such as direct priority optimization (DPO). While PPOs are stable trained through reward models, DPOs and their variants use pairwise comparisons to simplify and extend learning and are becoming increasingly popular in recent models. Reinforcement learning is also increasingly used to augment LLM reasoning, especially in complex tasks such as mathematics and coding. In general, it is often possible to reduce computational costs and improve training stability by replacing the value network or modifying the KL penalty.

Researchers from UCLA, Tsinghua University and Shanghai Qi Zhi introduced the Formal Policy Gradient (RPG), a unified framework for the KL registration policy gradient in online enhanced learning. They used forward and reverse KL differences to derive policy gradients and alternative loss functions, thus solving normalized and non-standardized strategies. RPG supports fully distinguishable objectives and enhanced style estimators, which are customized for non-policy training and have important samples. The study also identified and solved theoretical problems in existing methods such as GRPO and examined KL regularization in REANFORCE++. Experiments on the LLM inference task show that RPG can improve stability and performance compared to leading baselines, including GRPO, REANFORCE++, and DAPO.

This study introduces a policy gradient approach that incorporates KL differences regularization in online and non-policy settings through the importance sampled from older policies. For forward KL, this gradient involves importance-weighted rewards and regularization terms, with losses similar to the maximum likelihood loss when the reward is zero. The uneven forward KL adds correction to the mismatched distribution quality. Similarly, the deviation between the reverse KL and its non-standardized form of punishment and the reference strategy is modified based on the log probability ratio. All methods have similarly enhanced gradient structures, which can be used to implement alternative implementations using a fixed gradient operator, and the operator supports stable and effective optimization in practice.

The researchers thoroughly evaluated their proposed RPG method (including distinguishable and enhanced styles) by comparing them with several established baselines for complex mathematical reasoning tasks using the QWEN2.5 language model. They trained the training in the DAPO-MATH-17K dataset and evaluated the performance using benchmarks such as AMC23 and AIME. RPG variants always show strong accuracy, training stability and effective memory usage. The implementation utilizes VERL frameworks and technologies such as KL regularization, PPO-style editing and unplanned ADAMW for smoother optimization. RPG models are often superior to others in reward molding, entropy control, and response length, highlighting their robustness and suitability for stable, high-performance learning.

In summary, RPG is an integrated framework for designing and analyzing strategic gradient approaches that incorporate KL approval into online, non-policy reinforcement learning. They explored a range of configurations including forward and reverse KL differences, normalized and non-standardized strategy distributions, and two types of estimators: fully distinguishable and enhanced styles. RPGs are intended to provide a structured approach to understanding and implementing these changes. Applied to inference tasks with large language models, exhibiting more stable training, competitive or improved performance, such as GRPO, Readforce++, and DAPO, compared to established baselines.

View paper and GitHub pages . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.