AI

Researchers from Tsinghua and Modelbest versions Ultra-Fineweb: Trillions of data sets enhance LLM accuracy across benchmarks

The quality of data used in preprocessing LLMS has become increasingly important. To build an informative corpus, researchers have moved from heuristic filtering methods such as rule-based noise deletion and deduplication to model-driven filtering that utilize neural classifiers to identify high-quality samples. Despite its benefits, this approach faces a critical problem: it lacks an effective verification mechanism to evaluate data quality in a timely manner and often relies on manually curated seed datasets that introduce subjectivity. Although early datasets such as C4 and PILE laid the foundation for model development, recent efforts such as refined Web, Dolma, and DCLM have expanded significantly and combined up to trillions of tokens. Model-driven filtering has gained appeal in these newer corpuses because it is able to refine large-scale datasets and improve LLM performance in downstream tasks.

However, the effectiveness of model-driven filtration is limited by the high cost and inefficiency of current verification methods and lacks clear criteria for seed data selection. Recent datasets (such as FineWeb-EDU and Ultra-FineWeb) demonstrate that model performance has been improved by using multiple classifiers across validation of data quality. These datasets have been better than previous versions on benchmarks such as MMLU, ARC and C-Eval, suggesting that refined filtering methods can enhance understanding in English and Chinese. To further optimize this process, some studies suggest multidimensional data evaluation using LLM by prompting or leveraging token-level confusion scores. These innovations are designed to reduce computational overhead while improving data quality, ultimately enabling fewer tokens to be trained more efficiently.

Researchers at Modelbest Inc., Tsinghua University and Soochow University have developed effective data filtering pipelines to improve LLM training. They introduced a validation strategy that uses a near-trained LLM to evaluate new data by observing performance improvements in the final training step, thereby reducing computational costs. A classifier based on lightweight fast text further improves filtration speed and accuracy. The method is applied to FineWeb and China FineWeb datasets, generating ultra-fine network datasets containing 1 trillion English and 120 billion Chinese tokens. LLMS trained on ultrafine WEB showed significant performance improvements, confirming the effectiveness of the pipeline in improving data quality and training efficiency.

This study outlines effective high-quality data filtering pipelines to reduce computational costs while maintaining data integrity. It first uses a cost-effective validation strategy to select reliable seed samples from the candidate pool and then uses them for training data classifiers. Positive seeds come from LLM annotations, curated data sets, textbooks, and synthetic content, while negative factors come from different corpuses. Classifier training avoids overthinking and focuses on high-quality seed selection. Compared to the LLM-based approach, FastText-based classifiers are used for scalable filtering to significantly reduce the competitive performance of inference costs and employ preprocessing steps to ensure balanced, clean data input.

These models were trained using Megatronlm using MiniCPM-1.2 B architecture for 100B tokens. The evaluation used lights in English and Chinese benchmarks. The results show that models trained on ultrafine networks are always better than those trained on fine nets and fineweb-edu, individual and mixed language settings. Ultra-Fineweb-en scored the highest average English score, while Ultra-Fineweb-ZH performed better on Chinese tasks. Ablation studies show that Ultra-FineWeb maintains balance tokens and benefits of an effective filtering strategy, emphasizing its excellent quality and effectiveness in improving model performance.

In summary, the study introduces Ultra-Fineweb, a high-quality multilingual dataset that includes about 1 trillion English tokens and 120 billion Chinese tokens. It is based on FineWeb and Central FineWeb, which leverages a novel, efficient data filtering pipeline with a lightweight classifier based on fast text and a low-cost verification strategy. Pipeline improves filtration accuracy, reduces dependence on manual seed data selection, and ensures robust performance with minimal computational overhead. Experimental results show that models trained on ultrafine networks are always better than those trained in early data sets, indicating improved performance of the benchmark. This approach ensures repeatability and provides valuable insights to optimize data quality in future LLM training.


Check Paper and datasets. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button