Microsoft AI introduces code researchers: in-depth research agents for large system code and submit history

The rise of automatic coding agents in system software debugging
The use of AI in software development has gained appeal in the emergence of large language models (LLMS). These models are able to perform coding-related tasks. This shift has led to the design of autonomous coding agents that help human developers traditionally perform tasks. These agents range from simple script authors to complex systems that are able to navigate code bases and diagnose errors. Recently, the focus has shifted to bringing these agents to more complex challenges. Especially related to a wide range of and complex software environments. This includes basic system software, and precise changes require not only understanding of direct code, but also understanding of its architectural context, interdependence and historical development. Therefore, there is an increasing interest in buildings that can perform in-depth reasoning and integrate repairs or changes in human intervention.
Challenges of debugging large-scale system code
Updating large-scale system code presents many challenges due to its inherent size, complexity and historical depth. These systems (such as operating systems and network stacks) are composed of thousands of interdependent files. Over the decades, numerous contributors have improved them. This leads to highly optimized low-level implementations, which trigger cascade effects even minor changes. Furthermore, traditional error descriptions in these environments usually take the form of original crash reports and stack traces, often without guiding natural language prompts. As a result, diagnosing and fixing problems in such code requires a deep contextual understanding. This requires not only the grasp of the current logic of the regulations, but also the understanding of its past modifications and global design constraints. Automating such diagnosis and repairs remains elusive because it requires extensive reasoning that most encoders cannot perform.
Limitations of existing coding agents for system-level crashes
Popular coding agents such as SWE-Agent and OpenHands use large language models (LLMS) for automatic error fixes. However, they focus primarily on smaller application-level code bases. These agents often rely on structural problem descriptions provided by humans to narrow down their searches and propose solutions. Tools such as AutoCoderover explore codebases using syntax-based techniques. They are usually limited to specific languages such as python and avoid system-level complexity. Furthermore, none of these methods contain code evolution insights in the “commit history”, an important part of dealing with legacy errors in large-scale code bases. While some people use heuristics for code navigation or editing power generation, they cannot go deep into reasoning across code bases and consider that historical contexts limit their effectiveness in solving complex system-level crashes.
Code Researcher: Microsoft’s In-depth Research Agent
Researchers from Microsoft Research have introduced Code Researcheran in-depth research agent specifically for system-level code debugging design. Unlike previous tools, the agent does not rely on predefined knowledge of the faulty file, but runs in totally helpless mode. It was tested on Linux kernel crash benchmarks and a multimedia software project to evaluate its generality. Code researchers aim to execute multiphase strategies. First, it uses various exploratory actions such as symbol definition lookup and pattern search to analyze the crash context. Second, it synthesizes patch solutions based on cumulative evidence. Finally, it verified these patches using an automatic testing mechanism. The agent utilizes tools to explore code semantics, identify functional flows, and analyze submission history. This is a key innovation that has not been previously found in other systems. Through this structured process, the agent runs not only as a bug fixer, but also as an autonomous researcher. Before stepping into the code base, it collects data and forms hypotheses.
Three-phase architecture: analysis, synthesis and verification
The functions of code researchers are divided into three defined stages: analysis, synthesis, and validation. During the analysis phase, the agent first handles the crash report and initiates the iterative reasoning step. Each step includes tool calls for searching for symbols, scanning code patterns using regular expressions, and exploring historical commit messages and differences. For example, an agent might search past commitments for terms such as “memory leak” to understand the potential introduction of unstable code changes. The memory it builds is structured, recording all queries and their results. When it determines that enough relevant context has been collected, it transitions to the synthesis stage. Here, it filters out irrelevant data and generates patches by identifying one or more potential failure fragments, even if distributed across multiple files. During the final verification phase, the original crash scheme was tested to verify its effectiveness. Show only validated solutions for use.
Benchmark performance of Linux kernel and FFMPEG
In terms of performance, code researchers have made significant improvements to their predecessor. When benchmarking against Kbenchsyz, a set of Linux kernels generated by Syzkaller Fuzzer crashed, and it solved a 58% crash using GPT-4O and 5 title execution budget. By comparison, SWE-Agent manages only 37.5% resolution. On average, Code Researcher explores 10 files per track, significantly exceeding the 1.33 files that SWE-Agent navigation. In 90 cases where both agents modified all known error files, the code researchers solved 61.1% of the crash, while Swe-Agent solved 37.8%. Furthermore, when O1 (inference-centric model) is used only in the patch generation step, the resolution remains at 58%. This reinforces the conclusion that strong contextual reasoning greatly enhances debug results. The method was also tested on the open source multimedia project FFMPEG. It successfully generated a collision preparatory patch in 10 reported crashes, demonstrating its applicability beyond the kernel code.
Key technical points for cryptographic researchers’ research
- On Linux kernel benchmarks, a 58% crash resolution was achieved, while Swe-Agent reached 37.5%.
- Compared to the baseline method, an average of 10 files were explored per error, while 1.33 files were found.
- Even without prior guidance, agents have to discover off-road vehicle documents, which show effectiveness.
- Merge novel use of new submission historical analysis to facilitate contextual reasoning.
- To summarize as new areas such as FFMPEG, addressing 7 crashes in 10 reports.
- Use structured memory to preserve and filter contexts to generate patches.
- It is proved that even if more calculations are given, deep reasoning agents perform better than traditional reasoning.
- The patch was verified with the real crash copy script to ensure actual validity.
Conclusion: A step towards autonomous system debugging
In short, this study proposes a compelling advance in the automatic debugging of large-scale system software. By treating bug solutions as research problems that require exploration, analysis and hypothesis testing, code researchers embody the future of autonomous agents in complex software maintenance. It avoids the pitfalls of previous tools by operating autonomously, thoroughly examining current code and its historical evolution, and synthesising proven solutions. Significant improvements in resolution, especially in strange projects such as FFMPEG, demonstrate the robustness and scalability of the method. It shows that software agents can be more than just reactive responders. They can act as investigative assistants who can make informed decisions in environments previously considered too complex.
Check Paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
