AI

Send researchers to introduce seed code: Model-centric code LLM trained on 6 trillion tokens

Restandard LLM training via scalable automatic data pipeline

Code data plays a key role in training LLMs, benefiting not only the coding task but also the wider inference capability. Although many open source models rely on manual filtering and expertly produced rules to curate code datasets, these approaches are both time-consuming and biased and difficult to scale across languages. Proprietary models such as Claude 3.7 and OpenAi O3 perform well in coding tasks, but do not share detailed information about their data. Even open source models such as DeepSeek and Qwen2.5 still depend heavily on filters designed by humans. However, this dependence limits progress and responds to the “painful lesson” that true breakthroughs come from scalable, data-driven approaches rather than handmade heuristics.

Model-first pipeline of seed code minimizes human dependency in preprocessing

Researchers at BOCTENANCE introduced seed code, a family of 8B open source LLMs, including basic, mentoring and inference models designed to reduce human involvement in code data planning. Their model-centric pipeline does not rely on manual rules, but instead uses LLMS to score and filter large-scale code data from sources such as GITHUB and code-related websites, resulting in a data set of 6-trillion tokens. The guidance model is fine-tuned using synthetic data and preference optimization, while the inference model enhances multi-step code logic through long-chain reinforcement learning. Seed-Coders achieve maximum performance in size, usually exceeding larger models and are shared publicly to encourage further R&D.

6-trillion token corpus builds LLM quality filters and web data

The seed code is trained using a model-driven approach that minimizes manual intervention. The corpus of training trains includes approximately 6 trillion tokens from a variety of sources, including GitHub code, commit history and code-related web data. Initially, basic filtering deletes files that use syntax problems or inappropriate content. Large language models are then used to evaluate and score remaining code, ensuring high-quality data without relying on handmade rules. Preprocessing is divided into two stages: first with core code and web data, and then using more complex structures such as complete repositories and novel tasks such as intermediate fill centers to enhance the coding capabilities of the model.

Multi-step code understanding can be enabled through instruction adjustment and long-distance training post-training

After pretreatment, the seed code will be further refined through two post-training stages. First, a supervised fine-tuned training instruction model is used, which is trained on a combined instruction data generated and filtered by LLMS, thereby helping it better understand and follow human prompts. Then, the performance is improved using Direct Preference Optimization (DPO), which aligns model responses more with human preferences. For complex inference tasks, using LongCot enhancement learning can improve the inference model, thereby enhancing its ability to handle multi-step coding challenges. These steps greatly improve the performance of seed code in various code generation and inference tasks.

Seed code stands out in code generation, editing, and multi-step reasoning benchmarks

The evaluation showed that three seed code models, foundations, instructions, and reasoning performed well in a series of coding tasks. The basic model has outperformed other open source models of similarly sized open source models on code generation tasks, resulting in strong scores on benchmarks such as HumaneVal and Multipl-E. Instruction models are good at tasks that require code editing and guidance to follow, and are cost-effective in evaluations such as CodeedOrditorBench and FullStack. Inference models trained with long-chain technology show excellent multi-step problem-solving skills, especially on challenging benchmarks such as LiveCodebench and CodeForces, even surpassing models several times larger.

In short, seed code is a family of efficient and high-performance open source language models designed specifically for coding tasks. These models rely primarily on LLM rather than humans to filter and curate training data, greatly reducing manual efforts and thus stand out. Although less token training is received compared to some larger numbers, seed code performs excellently in tasks such as code generation, completion, editing, and reasoning. However, its general language comprehension ability remains limited due to the lack of extensive web data and mathematical content. Future updates aim to expand the model family and improve their capabilities across different model sizes.


Check Paper, Model Series, GitHub Pages and Project Pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button