AI

From exploration collapse to predictable limitations: Shanghai AI Laboratory proposes entropy-based scale law to enhance enhanced learning of LLMS

Recent advances in the inference-centric large language model (LLM) have expanded the reinforcement learning scope (RL) of narrow, task-specific applications, enabling broader generalization and reasoning capabilities. However, this shift introduces significant challenges, especially when expanding the training calculations required to learn from experience. Unlike imitation learning through pre-training and fine-tuning learning, RL requires a more reinforcement approach. A core issue is the decline in policy entropy, which affects the balance between leveraging known strategies and exploring new strategies. This trade-off of exploitative exploration is crucial in RL, and controlling policy entropy is crucial for maintaining effective exploration during training.

Existing efforts address the exploration-exploratory trade-offs of RL by leveraging policy entropy. Maximum entropy RL introduces regularized terms of reward function, thereby promoting uncertainty in action choice and encouraging broader exploration. Although this technology has been widely adopted in conventional RL algorithms, its application on LLMS is still under debate. Furthermore, RL predictability of RL was not explored. Although the neural scaling law guides LLM development, similar prediction frameworks for RL training are still limited. Existing RL approaches with LLMs with verifiable rewards show promise in terms of inference improvements, but lack a deep understanding of its core mechanism.

Researchers from Shanghai AI Labs, Tsinghua University, UIUC, Peking University and CUHK provide a solution to the entropy collapse of RL policy for reasoning-centric LLM. They established a transformation equation, r = -a exp h + b, where h is entropy, r is downstream performance, and a and b are fit coefficients. This empirical method strongly suggests that policy performance is traded from policy entropy and is therefore bottlenecked due to its exhaustion. The researchers studied entropy dynamics and their derivations emphasized that changes in strategy entropy are driven by the covariance between action probability and logic changes. They also proposed two techniques, namely fixture-COV and KL-COV, which respectively imposed kl fines on tokens with high covariance.

To investigate and verify entropy crashes in RL-regulated LLM, the researchers used autoregressive generation settings to apply RL to verifiable tasks such as LLMS, such as math and coding, where the model generates a sequence of tokens based on input cues. The study involved 11 widely adopted open source models covering four families: QWEN2.5, MISTRAL, LLAMA and DEEPSEEK, with parameters ranging from 0.5B to 32B. Evaluation was conducted, including eight public benchmarks including Math500, AIME 2024, AMC and EURUS-2-RL-2-RL-RL codes. In addition, RL training follows the VERL framework in the zero-spot setting, using algorithms such as GRPO, REANFORCE++ and PRIME to optimize strategy performance when observing entropy dynamics.

Fixture-COV and KL-COV techniques proposed on the QWEN2.5 model were evaluated for mathematical tasks using the DAPOMATH dataset. These methods achieve non-trivial performance improvements across all benchmarks. Compared with the GRPO baseline, these methods averaged 2.0% improvements in the performance of the 7B model, while the performance of the 32B model averaged 6.4%. For example, when the entropy at the baseline reaches a plateau, the KL-COV method can still maintain entropy levels above 10 times. These methods can maintain higher entropy levels throughout the training process. Furthermore, these methods produced greater growth on the larger QWEN2.5-32B models compared to the most challenging benchmarks Aime24 and Aime25, with an increase of 15.0% and 14.6%, respectively.

In short, the researchers overcome the challenge of the RL policy entropy collapse of the reasoning-centered LLM. These findings highlight the tradeoff between performance improvement and exploration reduction, which ultimately limits further growth. Through theoretical analysis and empirical verification, the researchers identified entropy dynamics as a key bottleneck and proposed two effective regularization strategies – Clip-CoV and KL-COV to manage high steady-state tokens and maintain exploration. When RL serves as a critical axis for extended pretraining, solving entropy collapse becomes critical. This work provides basic insights into the role of entropy, thereby guiding future efforts to scale RL toward smarter and more capable language models.


View paper and GitHub pages . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Check Also
Close
Back to top button