Improve artificial intelligence reasoning: The art of sampling that can be learned in LLM training

by admin · February 28, 2025

Reinforcement learning (RL) has always been a core component of training large language models (LLMS) to perform tasks involving reasoning, especially mathematical problem-solving. Quite a lot of efficiency occurred during the training period, including always answers or unresolved situations. The lack of differences in success rates is attributed to inefficient learning outcomes, as the problem of not producing gradient signals does not allow improvement of the model’s performance. Traditional RL-based fine-tuning strategies tend to bear expensive computing costs, increased energy use, and inefficient use of resources. Correction is a necessary condition for improving training efficiency and enabling language models to learn from problems that greatly improve their reasoning.

Standard training schemes for large language models (LLMS) use policy gradient techniques, such as proximal policy optimization (PPO), where models interact repeatedly and apply corrections based on signs of success or failure. However, one of the biggest drawbacks of this approach is that most training examples belong to the extreme group – always right or always incorrect. Repeated attempts do not provide further learning information when an example is always correctly resolved. Conversely, impossible queries have no improved feedback. As a result, valuable computing resources are wasted on useless training scenarios. Different course learning techniques, such as unsupervised environmental design (UED), attempt to dynamically control training difficulty. However, these techniques rely on heuristics such as regret-based choices that are largely insufficient to foresee the best problem difficulty and are not well generalized to reasoning tasks related to LLM training.

To address this inefficient efficiency, a novel training policy has been proposed and a sample focused on the difference in success rate is proposed, forcing the model to focus on issues that are less easy and less difficult. By identifying and selecting problems of improper execution of models, this approach focuses training on solutions that provide the most useful learning signals. This system selection approach is different from previous strategies that utilize random sampling batch batch batches to improve update efficiency by eliminating the problem that does not allow significant improvement. This process adapts during training and continuously optimizes problem selection to track the fluctuation intensity of the model. By targeting moderate difficulty instances, this approach can better learn and better generalize new tasks.

The structured selection process runs through a multi-step pipeline that starts with identifying candidate problems in each training iteration. Multiple generalizations were generated to evaluate the probability of success of each problem, and the variance of these success rates was calculated using the function 𝑝 (1 -−), where 𝑝 represents the possibility of a correct solution. The most learnable problems with moderate probability of success are preferred and stored in dynamic buffers. The training batch is then formed by selecting a combination of high-variability problems from the buffer and other random sampling examples in the dataset. The carefully crafted batches are then used to calculate the policy gradient and update the model parameters. By applying two reinforcement learning algorithms PPO and Vineppo to two mathematical inference data sets: mathematics, including 12,000 competition level problems and school-level problems including 8,000 grades. Additional tests were performed on Collegemath and OlympiaDbench datasets to quantify generalization capabilities beyond the original training distribution. The entire framework combines vinegar with smooth optimizations such as gradient accumulation, multi-roll estimation and DeepSpeed zeros to provide scalable performance.

The learning-driven selection mechanism greatly improves the speed and efficiency of model training. The models trained in this course are as accurate as those trained in traditional methods, with about four times the training steps and a significant increase in convergence speed. With multiple datasets, performance is consistently improved with better test accuracy and math. The structured course is also summarized as a distribution task, which provides a better summary of datasets such as Collegemath and Olympiadbench. The training batch composition is optimized by eliminating the problem of zero-learning signals, resulting in more efficient training. This method is also found to be computationally beneficial because the samples can be scaled efficiently without redundant model updates. The combination of faster fusion, better generalization and lower computational overhead makes this adaptive learning process a valuable and effective tool for LLM fine-tuning based on enhanced learning.

The paradigm of highly variable learning opportunities goal problem selection effectively solves the inefficiency witnessed in the fine-tuning of language models based on reinforcement learning. Focusing on the problems that generate the most useful training, signals can improve learning efficiency, enable faster improvements and better adaptability through new samples. Large-scale experiments have verified that the strategy is better in improving training speed, testing accuracy and generalization of more than one dataset. These findings highlight the hope of structured sample selection in model training improvements and computational resource optimization. Future research on this strategy can investigate its applicability to other reinforcement learning tasks, such as reward model optimization, preference-based fine-tuning, and generalized decision-making tasks in AI.

Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 80k+ ml subcolumn count.

Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets