ZeroSearch from Alibaba uses reinforcement learning and simulation documents to teach LLMS search without real-time search

From coding to academic coaching and automation assistants, large language models are now at the heart of a variety of applications. But how the design of these models persists. They are trained on static datasets that are outdated over time. This creates a fundamental challenge because language models cannot update their knowledge or verify responses against fresh real-world data. As a result, while these models show strong performance in inference tasks or structured queries, their answers can still include fabricated or outdated information, reducing their reliability in the real world. To maintain credibility, especially for applications that require updated knowledge (such as news, research or product reviews), models must interact with external data sources in a timely manner.
The core problem is to teach these models to effectively retrieve and merge external information. While preprocessing preprocessing helps to establish a strong baseline understanding, the ability to perform meaningful dynamic searches is lacking. Providing this functionality to the language model introduces practical constraints. Search engines for external information retrieval provide different document quality, thus introducing inconsistencies in model training. In addition, integrating reinforcement learning to simulate real-world searches requires large-scale interaction with live APIs and making hundreds of thousands of calls, which becomes very expensive. This leads to bottlenecks for academic research and commercial deployment, cost and training scalability are critical.
Various methods have been developed to enhance the search and retrieval capabilities of language models. Some early technologies rely on time-based instructions that guide the model by generating processes such as subconquests or managing multi-step searches. However, these methods rely heavily on manual tuning and often require a large amount of computing resources to ensure consistent output. Other methods rely on supervised fine-tuning to enable smaller models to perform more targeted retrieval, and models like self-lag and Retrolm appear in this space. Techniques such as Monte Carlo tree search were also performed to extend possible answer paths during dynamic inference. Reinforcement learning-based solutions such as Search-R1 and Deepresearcher allow models to interact directly with real search engines, bringing the training experience closer to how users behave. However, these innovations still suffer from complexity, high computing demands, or financial costs due to real-time interaction limitations.
Researchers at Alibaba Group Tonti Labs propose an innovative solution ZeroSearch. This reinforcement learning framework completely eliminates the need for real-time API-based search. Instead, it uses another language model to simulate the behavior of search engines. Simulation models are fine-tuned through supervised training to generate documents that help or mislead policy models, depending on whether the content is designed as relevant or noisy. This allows for complete control over document quality and cost while enabling a realistic search training experience. A key innovation is the use of course-based learning during training, which means gradually introducing more difficult retrieval tasks by adjusting how much noise exists in the generated documents. This advancement helps policy models develop resilience and better reasoning skills over time without the need for real search queries.
The structure of ZeroSearch involves different stages in the reasoning process. The model first thinks internally using the specified tags, and then generates a query if it is determined that additional information is needed. Finally, it can output the answer only if enough context is obtained. This structured approach achieves clarity in decision making and has been shown to improve transparency and answer quality. The minimum change in prompts guides the simulation search engine’s document generation, which controls whether the document looks useful or misleading. Simulated LLM is fine-tuned using interaction data, where each search trajectory is marked based on the correctness of the final answer. Teach policy models deal with direct and complex search criteria by systematically changing the quality of documents. The performance scaling function determines how much noise is introduced at each training stage, thereby improving the model’s ability to navigate uncertainty over time.
A 3 billion parameter model can effectively simulate the search process. Larger models become particularly noteworthy. 7B search modules were performed on the level of response quality related to Google search. The 14B model even surpasses Google search benchmarks. ZeroSearch also shows flexibility, effectively playing a role in LLMs with different sizes of base and instruction adjustments. It integrates well with a range of enhanced learning algorithms including PPO, GRPO, and Augment++, and it uses a reward design based on F1 scores, rather than an exact match to prevent the model from producing too long answers to increase keyword overlap. Furthermore, ZeroSearch uses masking mechanisms during backpropagation to ensure that gradients are calculated only on the output of the policy model, thereby stably training without sacrificing performance.
This study demonstrates a clear and effective alternative to real-time search engine dependency. Using simulation-driven document generation eliminates the need for high-cost APIs and precisely controls the quality of training inputs. This method also improves model inference capabilities by introducing progressive noise and uncertainty, thereby effectively mimicking the failure or misleading of data retrieval in the real world. Training strategy models to extract the most useful information. These features make ZeroSearch a scalable and practical solution for commercial-grade applications.
This approach successfully identifies and solves the dual challenges of document quality variability and economic costs, with limited real-time search integration in language model training. It combines document simulation, structured interactions and enhanced learning to ensure effectiveness and scalability. By relying solely on simulated data generation, researchers have achieved superior or comparable results over existing methods while eliminating the costly API.
Several key points of the research include the following:
- A 3B model simulates realistic documents to be retrieved efficiently at zero API cost.
- A 7B search module matches Google search performance in benchmarks.
- The 14B model exceeds actual search engine performance.
- Accelerated learning is gradually introduced through course-based introduction of noise.
- The simulated LLM generates relevant and noisy documents through lightly supervised fine-tuning.
- Structured interaction phase (
,,,,, ,,,,, ) Improves the clarity and accuracy of the model. - F1-based rewards block reward hackers by punishing irrelevant answer lengths.
- Compatible with major RL algorithms including PPO, GRPO and Enhanced++.
- Stable training is used using gradient masking mechanism to prevent instability of simulated tokens.
Check Paper and model on hugging face. Also, don’t forget to follow us twitter.
Here is a brief overview of what we built in Marktechpost:

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.