Meta AI releases “Natural Inference”: A multi-domain dataset with 2.8 million problems to enhance the reasoning capabilities of LLMS

Large Language Models (LLMS) have made significant progress in their inference ability to solve complex tasks. Although models such as OpenAI’s O1 and DeepSeek’s R1 have greatly improved challenging inference benchmarks such as competitive math, competitive coding and GPQA, there are still key limitations in evaluating their true inference potential. The current inference dataset focuses on problem-solving tasks, but fails to include domains that require open inference. Furthermore, these datasets have limited diversity in scale and difficulty levels, thus the challenges of evaluating and enhancing the reasoning capabilities of LLMs at different domains and complexity levels.
Previous attempts to enhance LLM’s inference ability have focused on two approaches: synthetic data generation and unsupervised self-training. In synthesis data generation, the Star and metaTh methods can enhance existing data sets with new reasons and problems that pass through the experience chain. Nevertheless, they depend heavily on pre-existing high-quality datasets. Although methods like OpenMathInstruct-2, Numinamath, and Xwin-Math produce new data from seed examples, they are working to scale to novel areas. In unsupervised self-training, most methods rely on the final answers of human logout or external reward models, making it resource-intensive and expensive, especially for complex multi-step questions that require humans to evaluate LLM output.
Researchers from Meta and NYU proposed nature planning, a comprehensive dataset of 2.8 million inference problems extracted from pre-approved corpus. The dataset spans a variety of fields including mathematics, physics, computer science, and economics and business. Unlike synthetic datasets such as MetAmathQA and OpenMathinstruct-2, natural planning represents a real realistic reasoning problem by pre-processing the reflectionization of CORPORA. It uniquely combines verifiable and open-ended problems, including theorem proof, making it valuable for developing algorithms, thereby enhancing the inference capabilities of LLMS beyond simple verification tasks and distilling knowledge from stronger models to weaker model.
The efficacy of natural planning methods is shown in two ways to enhance reasoning. First, it utilizes knowledge distillation and supervised fixation to achieve a steeper scaling trend than existing datasets. Second, it is the source of domain-specific seed data extraction. To target scientific reasoning benchmarks, such as GPQA, the method examples 250 benchmarking problems and uses cosine similarity between problem embeddings to retrieve 1K similar pollution problems from natural seasons. Then, repeat these questions and aggregate them into 15K groups. The evaluation protocol uses zero-fire tests to perform zero-fire tests in various benchmarks, including Mathematics, GPQA, GPQA-Diamond and Mmlupro, using greedy decoding for consistent performance measurements.
The evaluation results show that there are only 1.5 million training examples, and the training model for natural seasons outperforms llama3.1-8B – teaching, but other datasets (such as OpenMathinStruct-2) and WebRinstruct, even with 2.8 million data points, cannot Achieve comparable performance. While mathematical datasets such as OpenMathinStruct-2 show strong performance in mathematical benchmarks (from 50.83 to 59.25 in mathematical), they strive to generalize the GPQA accuracy plateau at about 26-27%, MMLU- PRO is inconsistent. In addition, datasets such as WebIstruct show a reduced return rate, with GPQA performance of 29.02% and a content of 500K samples, but dropped to 26.12% at 2.8 million samples.
In summary, the researchers introduced Natural Planning, a dataset that represents a significant advancement in the development of a comprehensive inference dataset for LLM. The dataset collects 2.8 million questions and covers multiple fields including mathematics, physics, computer science, economics, and social sciences. The results show that knowledge distillation using natural planning methods leads to consistent improvements in benchmark performance as data size increases. Its effectiveness extends to implement unsupervised self-training of LLM through external reward models and self-reward techniques, which marks a step forward to enhance LLMS’ inference capabilities in different fields.
Check Paper and datasets. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 75K+ ml reddit.
Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets
Meta AI Post released “Natural Inference”: a multi-domain dataset with 2.8 million questions to enhance LLMS inference capabilities, first appeared on Marktechpost.