AI

Meta and NYU’s new AI approach uses semi-consortium reinforcement learning to improve LLM alignment

Optimize artificially aligned LLMs with enhanced learning

Large language models often require further alignment phases to optimize them for human use. At this stage, enhanced learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning keeps the model closer to user expectations, making it more suitable for teaching-based applications or precise math tasks.

Challenges in choosing offline and online enhanced learning strategies

A major difficulty arises when choosing the most effective method to fine-tune. Training methods fall into two extremes – relying on static, pre-generated data and fully online methods of OFFLINE methods that are constantly updated with each new interaction. Each approach has different challenges. Offline models cannot adapt during training, which limits performance, while online models often require more computing resources. Furthermore, ensuring that the model performs well in both mathematical (verifiable) and open (verified) tasks adds further complexity to this choice.

Overview of alignment algorithms: DPO and GRPO

Historically, tools such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been used for model alignment. DPO runs offline and is designed to work with preference-based data pairs. Its simplicity and data efficiency are valued, but it lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing the relative advantages of output groups with calculations. Although GRPO adapts to real-time and is suitable for dynamic reward systems, its properties increase computational load and make the experiments increasingly demanding.

Balanced alternatives to LLM alignment

Research introduced by Meta and NYU explores a way to overcome these limitations through semi-line training setups. This technology modulates the frequency at which the model’s power generation and training components are synchronized, rather than being updated in each training step (such as a fully online approach), or not as if it were offline at all. The semi-dividing method strikes the intermediate ground by adjusting the synchronization rate. The researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allows them to apply DPO or GRPO to task-specific reward models in a flexible way.

Guide the following and mathematical reasoning

This method involves fine-tuning of the Llama-3.1-8B teaching model using two types of tasks: open-ended instructions below and mathematical problem solving. For unverified tasks, user prompts are sampled from the Wildchat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team, together with the mathematical verification toolkit, leverages the Numinamath dataset that verifies whether the generated answers are consistent with the expected output. Training experiments were performed on 32 NVIDIA H200 GPUs, trained, and inferences were performed on 8 GPUs, and different settings were performed to compare offline, semi-biased and online synchronization intervals.

Performance is obtained in verifiable and unverifiable tasks

Performance differences were observed. On the Math500, the accuracy of offline DPO reaches 53.7%, while the synchronization interval of semi-junction DPO is S=100, reaching 58.9%. The results for online DPO and GRPO were 58.7% and 58.1%, respectively. A similar trend was observed in the Numinamath benchmark, with offline DPO reaching 36.4%, and the semi-ac variant increased it to 39.4% (s = 10). Performance growth is not limited to mathematical tasks. The trained mixed reward type model is consistently performed when evaluating unverifiable tasks using Alpacaeval 2.0 and Arena-Hard-Hard benchmarks. Combining verifiable and unverifiable rewards in a single training setup results in stronger average scores, suggesting that the method is effectively generalized.

Flexible, scalable enhanced learning approach in LLMS

This study shows that fine-tuning large language models do not require strict adherence to offline or online settings. By introducing flexible synchronization solutions, the research teams at META and NYU effectively improve training efficiency while maintaining or improving performance. The results show that careful balance of reward types and training synchronization frequency results in the model performing well on task types without incurring high computational costs.


Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter,,,,, Youtube and Spotify And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button