Bytedance reveals tools: a new tool-integrated enhanced learning RL framework that redefines repo in-depth search

by admin · August 14, 2025

Problem localization involves determining the exact code location that needs to be modified to resolve a software problem, and the process often requires a lot of manual effort from developers, especially in large repositories. Automating this task has become a key priority due to its complexity and time-consuming nature. LLM-based proxy enables language models to use a variety of tools for dynamic repository exploration. However, these models face challenges when performing Repo Deep Search, a sequential navigation task that requires multi-step reasoning and effective tool use. Current LLMs encounter these high demands, which often lead to incorrect tool calls or failures in inference chains that remain coherent during exploration.

Existing work includes fault localization and proxy training. In fault localization, methods such as DEEPFL and DEEPRL4FL utilize deep neural networks and CNNs to identify error codes by analyzing test coverage, data dependencies, and static code representations. Recent advances include LLMs, such as Agentless, to narrow down the code position. LLMs often lack the complexity required for complex reasoning and tool use in repository exploration. To solve this problem, use high-quality trajectories such as SWE-GYM and SEALIGN, fine-tune LLM such as SWE-GYM and SEALIGN. Another way of Locatent is to build the basic truth of problem localization based on the functionality modified by Github’s Golden Patchers.

Researchers from Peking University, BYTEDANCE and Beijing Institute of Technology have proposed a training framework for integrated tools to enhance the multi-jump reasoning capabilities of LLM during issuance localization. Tortrain introduced RepoSearcher, a lightweight proxy equipped with a simple search tool that enables LLMS to find functionality or class definitions by name. To help LLM use these tools for multi-hop inference, the researchers constructed data tagged from open source repositories and followed a two-stage process: SFT rejecting sampling and RL integrated with the tool. This approach ensures that the model learns to use tools strategically, avoids redundant exploration while focusing on promising code paths.

The researchers used SWE-Bench Verified to build their evaluation dataset, which is a benchmark derived from a real GitHub problem and is manually verified by professional developers. This dataset provides basic truth answers to problem localization by identifying modified features and files in the gold patch. To evaluate the performance of RepoSearcher, metrics of Reque@K, MAP, MRR, NDCG@K and % analytics were used. Additionally, Tooltrain is applied to two models of QWEN-7B and QWEN-32B, and then compared to four latest frameworks: proxyless, Crcaloca, Cosil, and Locent. These baselines represent a variety of design ideas that ensure a detailed assessment of the effectiveness of the tool’s precise and strategic code exploration.

Researchers with tool training achieve state-of-the-art performance between similar-sized models, even outperforming larger business models on specific metrics. For example, using Tooltrain-32B for researchers achieved a score of 68.55, surpassing Claude-3.7-Sonnet (66.38). The 7B parameter model uses the 32B model to outperform other frameworks, thus enhancing the tooling capabilities of smaller models. In problem solving, researchers using Tooltrain-7b can recall 5 of 62.38 with a resolution of 14.00, which is the best in the 7B model. However, despite using similar localization results, different patch generation models were seen in using different patch generation models (Tooltrain-7b) vs. 31.60 (Tooltrain-32b).

In short, the researchers introduced the tools to enhance problem localization of LLM. By combining SFT with RL, Tooltrain equips models like RepoSearcher with efficient browsing of code repositories and performing precise multi-hop inference. Evaluated on actual benchmarking benchmarks, the tool-trained models achieve state-of-the-art performance among similarly sized models, even surpassing larger business models on specific tasks, such as Claude-3.7. This demonstrates its ability to optimize tool usage and reasoning in smaller models, thereby reducing redundancy and increasing efficiency. This study highlights the potential of localizing tool conversion distributions and providing effective solutions to complex software challenges.

Check Paper and GitHub page. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.