High penetration token selection in enhanced learning with Verified Rewards (RLVR) improves accuracy and reduces training costs for LLMS

The step-by-step response generated by a large language model (LLMS) is called the chain of ideas (COTS), and each token will contribute to a coherent and logical narrative. In order to improve the quality of reasoning, various reinforcement learning techniques have been adopted. These methods allow the model to learn from the feedback mechanism by aligning the output generated with the correctness criteria. As LLM’s complexity and capabilities grow, researchers have begun to probe the internal structure of token generation to identify patterns that enhance or limit performance. One area worthy of attention is the token entropy distribution, a measure of uncertainty in token prediction, which is now related to the model’s ability to make meaningful logical decisions in the inference process.

The core problem in the training inference model using enhanced learning is to treat all output tokens equally. When using reinforcement learning optimization models with verifiable rewards (RLVR), the update process traditionally involves generating each token in the sequence regardless of its functional role. This unified processing fails to distinguish tokens, resulting in significant reasoning with tokens that only extend the existing language structure. As a result, most of the training resources may be targeted at tokens that provide minimal contribution to the model’s inference ability. These methods do not prioritize the few tokens that play a decisive role in navigating different logical paths, but miss the opportunity to focus and effectively optimize.

Most RLVR frameworks, including proximity policy optimization (PPO), group relative strategy optimization (GRPO), and dynamic sampling strategy optimization (DAPO), work by evaluating the entire token output sequence of reward functions for assessment correctness. PPO relies on stabilizing policy updates through clip objective functions. GRPO improves this by estimating dominant values using packet responses rather than individual value networks. DAPO introduces other enhancements such as clip high mechanism and longer reward molding. However, these methods do not consider token-level entropy, nor distinguish the importance of individual tokens in the inference chain, but instead apply unified gradient updates in a comprehensive manner.

To refine how RLVR training affects LLM reasoning, researchers at Alibaba and Tsinghua University have proposed a new approach to the token entropy pattern. They observed that a small fraction of tokens (about 20%) showed significantly higher entropy in the COT sequences generated by the QWEN3 model. These tokens are marked as “fork tokens” and usually correspond to the moment when the model must make a decision between multiple inference paths. The remaining 80% of tokens usually exhibit low entropy and act as an extension of the previously stated. By limiting policy gradients to only update it to these high penetration tokens, the research team is able not only to maintain, but in many cases improve the performance of challenging reasoning benchmarks.

To quantify token entropy, the researchers used the entropy formula on possible token selections for each step based on the probability distribution. They found that more than half of the tokens produced had entropy values below 0.01, indicating near-determined behavior. Only 20% exceeds 0.672 entropy, labeling it as the decision center within COTS. High penetration tokens usually include logical operators and connective words such as “assumption”, “since” or “here” that introduce new conditions or transitions in logic. In contrast, low penetration tokens contain predictable symbols, suffixes, or code snippets. Through controlled experiments, it is obvious that manipulating the entropy of these forked tokens directly affects the model’s inference performance, while changing the effect of low penetration tokens has no effect.

The research team conducted extensive experiments on three models: QWEN3-8B, QWEN3-14B and QWEN3-32B. When only the top 20% of the high penetration tokens are trained, the QWEN3-32B model scores 63.5 on Aime’24 and Aime’25, both setting new performance benchmarks for models below 600B parameters. Additionally, increase the maximum response length from 20K to 29k and increase the Aime’24 score to 68.1. In contrast, training on the bottom 80% low penetration token resulted in a significant drop in performance. The QWEN3-14B model has increased +4.79 on Aime’25 and Aime’24, while QWEN3-8B maintains competitive results relative to full-word training. Ablation study further confirmed the importance of retaining the 20% threshold. Reducing the score to 10%, omitting the basic decision point and increasing it to 50% or 100% diluting the effect by including too many low permeability tokens, thus reducing entropy diversity and hindering exploration.

Essentially, this study provides a new direction for enhancing the inference capabilities of language models, which can promote inference success by identifying and selectively training a few tokens that have resulted in disproportionate success. It avoids inefficient training and instead proposes a scalable approach that aligns reinforcement learning objectives with actual decision-making moments in the token sequence. The success of this strategy lies in the use of entropy as a guide to distinguishing useful tokens from fillers.

Several key points of research include:

About 20% of tokens exhibit high entropy and are used as a fork point for the direct inference path.
Training is done only for these high gagging tokens, and its performance is equal to or better than the complete set of tokens.
QWEN3-32B scored 63.5 on Aime’25 on Aime’24 and 56.7, outperforming the larger sizes traditionally trained.
Expand the response length from 20k to 29k, further increasing the Aime’24 score to 68.1.
Training for the remaining 80% of the low penetration tokens results in a drastic performance drop.
Keep the 20% threshold of the high penetration token for optimal balance of exploration and performance.
As its capabilities benefit from enhanced exploration, larger models gain more from this strategy.
The extension of this strategy is good and can guide more effective training on next-generation inference models.

In summary, this study effectively rethinks the application of reinforcement learning in language models by introducing attention to token-level entropy. By optimizing only minorities that affect inference paths, this approach enhances performance while reducing computational overhead. It provides a practical roadmap for future efforts to improve LLM’s reasoning without unnecessary complexity.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 98k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

High penetration token selection in enhanced learning with Verified Rewards (RLVR) improves accuracy and reduces training costs for LLMS

You may also like...

Leave a Reply Cancel reply

Recent Posts

High penetration token selection in enhanced learning with Verified Rewards (RLVR) improves accuracy and reduces training costs for LLMS

You may also like...

Google DeepMind releases Gemini Robot Technology: a local AI model for real-time robot flexibility

Confused AI is really worth $14 billion?

Scientists merge two “impossible” materials into new artificial structures

Leave a Reply Cancel reply

Recent Posts