Satisfies research: A novel AI framework that trains LLMS to reason through searches through enhanced learning without using any supervised data on the inference step

Large language models (LLMs) have shown significant progress in various tasks, especially in terms of reasoning capabilities. However, effectively integrating the inference process with external search operations is still challenging, especially for multi-hop problems that require complex inference chains and multiple search steps. The current approach depends primarily on tips or heuristics of manual design, thus limiting scalability and flexibility. Furthermore, generating supervised data for multi-step inference schemes is often very expensive and in fact is not feasible.
Researchers at Baichuan Inc., Tongji University, Edinburgh University and Zhejiang University introduced the study, a novel AI framework designed to train LLMS to combine reasoning with search through reinforcement learning, especially without relying on supervised reasoning steps. The core method of the research incorporates search operations directly into the reasoning chain. Using Group Relative Policy Optimization (GRPO), an enhanced learning technique, studies the research guides LLMS to autonomously determine the best moments and strategies for performing search operations, which can then affect ongoing reasoning. This approach allows the model to gradually refine its reasoning and naturally promote advanced capabilities such as reflexes and self-correction.
From a technical point of view, the research adopts a structured output format by embedding specific tags (e.g. <think>
,,,,, <search>
,,,,, <result>
and <answer>
– Related to reasoning chains. These tags facilitate clear communication between the model and the external retrieval environment, systematically organizing the generated output. During the training period, search results from intentionally excluding loss calculations were studied to prevent model bias. The reward signal guiding the enhanced learning process is based on a direct criteria: accurate evaluation by F1 scores and adhere to a predefined structured output format. This design encourages the autonomous development of complex inference patterns, thus circumventing the need for manually annotated inference datasets.
Experimental evaluation confirmed the robustness of the study. When multi-hop question benchmarks including HotPotQA, 2Wikimultihopqa, Musique, and Bamboogle were evaluated, the study always exceeded the baseline method. Specifically, the study-QWEN-32B instruction achieved an improvement of 8.9% to 22.4% compared to established baselines. It is worth noting that although the model is trained on only a single dataset, it still emphasizes its strong generalization ability, but progress is still made. Further analysis shows that the model gradually increases its dependence on iterative search operations throughout the training, which indicates an improved inference ability. A detailed case study illustrates the model’s ability to identify suboptimal search queries, reflect on its reasoning steps and implement corrective actions independently.

In summary, the study proposes significant methodological advances in training LLMS, which can seamlessly integrate reasoning with external search mechanisms through reinforcement learning. By eliminating the dependence on supervised inference data, the framework effectively solves the key scalability and adaptability problems inherent in multi-hop inference solutions. Its self-reflection and correction capabilities enhance its practical applicability in complex, realistic environments. Future research directions may further extend the reinforcement-based learning framework to a wider range of applications, combined with other external knowledge resources.
Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.
The Post Conference Study: A novel AI framework that trains LLMS to search inference through reinforcement learning without any supervised data about the inference steps first appearing on Marktechpost.