AI

Bytedance Research releases DAPO: A fully open source LLM reinforced learning system


Reinforcement learning (RL) has become the core of advancing large language models (LLM), thus enhancing their abilities to improve the inference skills required for complex tasks. However, the research community faces enormous challenges in reproducing state-of-the-art RL technology as key industry players incompletely disclosed key training details. This opacity limits broader scientific efforts and advances in collaborative research.

Researchers from Bondedance, Tsinghua University and the University of Hong Kong recently introduced DAPO (Dynamic Sampling Strategy Optimization), an open source large-scale augmented learning system designed to enhance the inference capabilities of large language models. The DAPO system attempts to bridge the gap in repeatability by publicly sharing all algorithm details, training procedures and datasets. DAPO builds on the VERL framework and includes training code and a thoroughly prepared dataset called DAPO-MATH-17K, designed for mathematical reasoning tasks.

DAPO’s technical foundation includes four core innovations designed to address the key challenges of reinforcement learning. First, “clip high” solves the problem of entropy collapse, which prematurely falls into a limited exploration model. By carefully managing the clip ratios in policy updates, the technology encourages more diversity in model output. The “dynamic sampling” counter is inefficient in training, ensuring a more consistent gradient signal through usefulness based on sample. “Token-level policy gradient loss” provides an elaborate method of loss calculation that emphasizes token level rather than sample-level adjustments to better adapt to different inference sequences. Finally, “Long-term Reward Shaping” imposes controlled penalties on excessively long responses, gently guiding the model to simplify and effective reasoning.

In practical experiments, DAPO showed significant improvement. An evaluation of the US Invited Mathematics Exam (AIME) 2024 benchmark shows that using the QWEN2.5-32B basic model, the DAPO trained model scored 50 points, improving previous methods such as DeepSeek-R1-Zero-Zero-Zero-QWEN-32B, which can achieve 47 points. It is worth noting that DAPO achieves this improvement through approximately half of the training steps, emphasizing the efficiency of the proposed method. Systematic analysis reveals incremental enhancement capabilities for each introduced technology, moving from a baseline of 30 points (GRPO alone) to a maximum of 50 points for the full DAPO method.

In addition to quantitative results, DAPO’s training dynamics also provide insights into the model’s evolving inference patterns. Initially, the model has little reflexive behavior and is usually done linearly through the task without reconsidering previous steps. However, through ongoing training, these models gradually exhibit greater reflective behaviors, demonstrating an iterative form of self-censorship. This shift emphasizes the ability to enhance learning, which not only enhances existing reasoning pathways, but also cultivates entirely new cognitive strategies over time.

In summary, DAPO’s open source represents a meaningful contribution to the reinforcement learning community, eliminating the barriers to previously inaccessible methods. By clearly documenting and providing comprehensive access to system technology, datasets and code, the collaborative program invites further research and innovation. The integrated efforts of Orcs, Tsinghua University and the University of Hong Kong demonstrate the potential of transparent and collaborative research to improve the collective understanding and practical capabilities of large-scale reinforcement learning systems.


Check Paper and project pages. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 80k+ ml subcolumn count.

Post Bontedance Research releases DAPO: Large-scale open LLM augmented learning system first appeared on Marktechpost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button