AI

Openthoughts: Scalable supervised fine-tuning SFT data curation pipeline for inference models

The complexity of inference data planning is growing

Recent inference models such as DeepSeek-R1 and O3 have performed outstandingly in the fields of mathematics, coding and science, and have leveraged training techniques such as post-training techniques (SFT) and enhanced learning (RL). However, the complete approach behind these cutting-edge inference models is not public, which makes research on building inference models difficult. Although SFT data curation has become a powerful method to develop strong reasoning capabilities, most existing efforts explore only limited design options, such as relying solely on human-written questions or a single teacher model. In addition, exploring a wide range of design spaces for various technologies to generate question-and-answer pairs requires high costs for teacher reasoning and model training.

The inference traces provided by models such as Gemini, QWQ and DeepSeek-R1 have enabled knowledge distillation techniques to train smaller inference models. Projects such as OpenR1, OpenMathresing and OpenCodereason collect problems from public forums and competitive locations, while natural reasoning uses pre-training Corpora as seed data. Some efforts like S1 and Limo focus on manually curating small, high-quality datasets with challenging tips. Other methods, such as DeepMath-103K and Nvidia nemotron, introduce innovations in the data procurement, filtering and scaling phases. RL methods, including AceReason and SkyWork-OR1, go beyond traditional SFT methods and have enhanced inference capabilities.

Openthighs: A scalable framework for SFT dataset development

Researchers from Stanford University, University of Washington, Bespokelabs.ai, Toyota Research Institute, UC Berkeley and 12 other organizations have proposed new SOTA open inference data recipes. Openthoughts adopts an incremental approach in three iterations: Openthoughts-114k performs Sky-T1 pipeline scales with automatic verification, Openthoughts2-1M enhances data scales by enhancing the diversity of problems and synthesis generation strategies, while OfectightS3-1.2.2M can develop a simple, higher scale-range data from more than 1,000 experimental experiments. In addition, the OpenthInker3-7B model achieves state-of-the-art performance in the Open-DATA model with a 7B scale.

Opthoughts 3-1.2m is constructed by independently ablation of each pipeline component while maintaining constant conditions at other stages, each strategy generates 31,600 data points per strategy and fine-tune the QWEN2.5-7B-construction on each result dataset. The goal during training is to create the best problem response dataset for SFT inference. The evaluation occurs in eight inference benchmarks for mathematics (AIME24, AMC23, MATH500), encoding (CodeElo, CodeForces, LiveCodeBench), and Science (GPQA Diamond, Jeebench). The experimental design involves a rigorous purification process to remove high-similar samples and maintain fixed benchmark settings for generalized testing. Evaluation machines are the main evaluation tool to ensure a consistent evaluation protocol.

Evaluate insights and benchmark performance

Openthought’s pipeline evaluation reveals key insights between question procurement, mixing, filtering, answer filtering, and teacher models. Problem procurement experiments show that Codogolf and competitive coding problems achieve the highest performance in code tasks (25.3-27.5 average score), while LLM-generated and humanistic-written problems excel in mathematics (58.8.8-58.5 scores), physics problems and chemistry textbook extracts perform best in science (43.2-45.2-45.2-3). Mixed questions show that combining multiple problem sources reduces performance, as well as the best results with 5% improvement in accuracy over various hybrid strategies. In the teacher model, QWQ-32B outperformed DeepSeek-R1 in knowledge distillation, with an accuracy improvement of 1.9-2.6%.

In summary, the researchers introduced the Opthoughts project, showing that systematic experiments can significantly improve the curation of SFT data for inference models. Researchers have developed Opthoughts 3-12m, the most advanced open dataset dataset across the fields of science, mathematics and coding. The resulting OpenthInker3-7b model achieves the excellent performance of the open data inference model on its scale. However, some limitations have not been developed, including RL approaches, fine-tuning and course learning strategies. Future research directions include investigating cross-domain transfer effects, optimizing individual domains with overall performance, and understanding the scaling dynamics of student models as they approach teacher abilities.


Check Paper,,,,, Project page and Github page. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button