AI

Reinforcement learning enables LLMS to master search: Ant Group researchers introduce SEM to optimize tool usage and reasoning efficiency

Recent advances in LLMS show their potential in performing complex inference tasks and the effectiveness of using search engines, such as search engines. Nevertheless, teaching models make informed decisions about when to rely on internal knowledge and search remains a key challenge. While simple time-based approaches can guide the model to call the tool, LLM is still working on more nuanced behaviors, such as identifying when the initial search is incorrect and deciding to search again. RL has been explored to improve these behaviors by rewarding effective search usage. However, RL often leads to unnecessary tool use, and even for simple tasks, the model performs redundant searches, highlighting inefficiencies that must be addressed.

Various RL strategies, including proximity policy optimization (PPO), direct preference optimization (DPO), and group relative strategy optimization (GRPO), have been used to combine LLM behavior with human expectations. PPO helps balance learning exploration and maintain strategy stability, while DPO simplifies alignment by directly optimizing model responses based on user preferences. GRPO introduces group-based assessments to better capture subtle improvements in reasoning. Meanwhile, seeing LLM as an autonomous agent who plans and performs multi-step reasoning tasks is gaining traction. Frameworks such as AutoGPT and Langchain show how these agents perfect their output by iterative reasoning and searching. However, current proxy systems often depend on fixed prompts or heuristic-based use, thus limiting their adaptability and efficiency.

Researchers at Ant Group introduced SEM, a post-training reinforcement learning framework designed to teach LLMS when to use search tools and when to rely on internal knowledge. By training on a balanced dataset, combining issues that need to be done and do not require external retrieval, SEM guides can publish search requests if necessary. Using structured inference formats and GRPOs, the rewards of framework rewards are accurate answers without searching and penalizing unnecessary tools for use. The results show that SEM improves response accuracy and efficiency, helping the model better judge when external information is needed, thereby enhancing reasoning in complex situations.

To integrate search tools into the inference process of models, SEM uses reinforcement learning to teach when and how to use search models effectively. The training data combines Musique (questions that require external information) and MMLU (questions that can be answered from prior knowledge) to help the model learn when it is necessary to search. Using the GRPO framework, the model obtains accurate, effective answers, unnecessary searches and encourages them when internal knowledge is insufficient. Structured response format (,,,,, ,,,,,

,,,,, ) Standardize training and allows precise reward allocation, improving the quality of reasoning and search decisions.

The study evaluates trained models to determine when to rely on their internal knowledge and when to use external searches. It combines Musique (Unfamiliar Question) and MMLU (Familiar Question) for training and evaluates the performance of datasets such as HotPotQA, GSM8K, and MMLU. The proposed SEM method outperforms research baselines such as naive rags and answer accuracy and search efficiency. SEM reduces unnecessary searches for known problems while improving the reasoning of unknown problems. Case studies and training curves confirm stable learning and intelligent decision-making in SEM. Overall, SEM enhances retrieval decision-making and internal reasoning in large language models.

In short, SEM is a post-training enhanced learning framework designed to improve the way large language models use external search tools. The model is trained on a dataset combining Musique and MMLU, helping it distinguish between questions it can answer internally from those that require external retrieval. SEM uses structured inference methods and reward functions to punish unnecessary searches while promoting accurate and effective retrieval. Experiments on benchmark tests such as HotPotQA, GSM8K, and MMLU have shown that SEM reduces redundant searches and improves accuracy. This approach improves the efficiency of reasoning and intelligently use of external knowledge in LLM.


View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit.


Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.

🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button