Code Researcher: Deep Research Agent for Large Systems Code and Commit History
Pith reviewed 2026-05-22 02:15 UTC · model grok-4.3
The pith
Code Researcher outperforms existing agents by using multi-step reasoning on code semantics, patterns, and commit history to resolve crashes in large systems codebases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Code Researcher is the first deep research agent for code that retrieves all relevant context from a large codebase and its commit history through multi-step reasoning about semantics, patterns, and commit history, leading to higher crash-resolution rates when generating patches for systems code crashes.
What carries the argument
The multi-step reasoning process in Code Researcher that analyzes semantics, patterns, and commit history to retrieve relevant context from the codebase and history.
If this is right
- Code Researcher achieves a crash-resolution rate of 48% on kBenchSyz using GPT-4o, compared to 31.5% for SWE-agent and 31% for Agentless.
- Scaling sampling budget to 10 trajectories increases the CRR to 54%.
- Using the Gemini 2.5-Flash model, Code Researcher reaches 67% CRR.
- It generalizes to an open-source multimedia software and shows robustness through ablations.
- The experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.
Where Pith is reading between the lines
- Similar deep research approaches could improve agent performance on other complex software engineering tasks involving large repositories.
- Emphasizing commit history might help agents understand evolution of bugs and fixes in any large project.
- Future tests on different systems codebases could confirm if the method scales beyond the Linux kernel.
- Combining this with other tools might further boost resolution rates on crash benchmarks.
Load-bearing premise
The design assumes that multi-step reasoning about semantics, patterns, and commit history can reliably retrieve all relevant context from the codebase and its commit history before generating patches.
What would settle it
Running Code Researcher on a new set of unreported kernel crashes and finding that it retrieves incomplete context or achieves lower resolution rates than the baselines would falsify the claim.
Figures
read the original abstract
Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate (CRR) of 48%, compared to 31.5% by SWE-agent and 31% by Agentless, using OpenAI's GPT-4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5-Flash model. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Code Researcher, a deep research agent for large systems code that performs multi-step reasoning over code semantics, patterns, and commit history to retrieve relevant context before generating patches for crashes. It evaluates the agent on the kBenchSyz benchmark of Linux kernel crashes, reporting a crash-resolution rate (CRR) of 48% with GPT-4o (outperforming SWE-agent at 31.5% and Agentless at 31%), with further gains to 54% via increased sampling and 67% with Gemini 2.5-Flash, plus generalizability results on multimedia software and ablations highlighting global context gathering.
Significance. If the baseline comparisons control for equivalent commit-history access and the reported gains prove robust, the work would be significant for demonstrating the value of multi-faceted, history-aware reasoning in LLM agents applied to complex systems codebases. The empirical outperformance on a named external benchmark, combined with scaling and ablation experiments, would strengthen the case for deep research capabilities in automated debugging of large code.
major comments (2)
- [Evaluation on kBenchSyz] Evaluation section (kBenchSyz experiments): the central claim of outperformance (48% CRR vs. 31.5%/31%) is load-bearing for attributing gains to Code Researcher's multi-step semantics/pattern/commit-history design. The manuscript provides no indication that SWE-agent or Agentless were given comparable tools or prompting to inspect git history; off-the-shelf versions typically operate on the current snapshot only. If baselines lacked equivalent history access, the measured gap is confounded by information asymmetry rather than the agent's reasoning architecture. Please clarify the exact tool access, retrieval setup, and prompting provided to each baseline.
- [Experimental setup] Experimental setup and results: the abstract and evaluation report concrete CRR numbers without details on statistical significance testing, number of runs, variance, exact prompt templates, retrieval implementations, or how commit history was encoded and accessed. These omissions make it difficult to assess reproducibility and whether the 48% result reliably supports the multi-step reasoning claims.
minor comments (2)
- [Abstract] Abstract: the acronym CRR is used without an initial definition; expand it on first use for readability.
- [Results and ablations] Figure and table captions: ensure all ablation and scaling results include error bars or confidence intervals where applicable to aid interpretation of the reported improvements.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Evaluation on kBenchSyz] Evaluation section (kBenchSyz experiments): the central claim of outperformance (48% CRR vs. 31.5%/31%) is load-bearing for attributing gains to Code Researcher's multi-step semantics/pattern/commit-history design. The manuscript provides no indication that SWE-agent or Agentless were given comparable tools or prompting to inspect git history; off-the-shelf versions typically operate on the current snapshot only. If baselines lacked equivalent history access, the measured gap is confounded by information asymmetry rather than the agent's reasoning architecture. Please clarify the exact tool access, retrieval setup, and prompting provided to each baseline.
Authors: We appreciate the referee for identifying this potential source of confounding. In our experiments, SWE-agent and Agentless were evaluated using their publicly available standard implementations and default configurations as described in the original papers; these do not incorporate explicit git history inspection or multi-step commit reasoning by default. Code Researcher's primary contribution lies in its explicit multi-step reasoning pipeline that retrieves and reasons over commit history in addition to code semantics and patterns. To eliminate ambiguity, we will add a new subsection to the evaluation section that explicitly documents the tool access, retrieval setup, and prompting for every method, confirming that baselines received only the current snapshot unless their original designs specified otherwise. This revision will make clear that the observed gains stem from the agent's architecture rather than unequal information access. revision: yes
-
Referee: [Experimental setup] Experimental setup and results: the abstract and evaluation report concrete CRR numbers without details on statistical significance testing, number of runs, variance, exact prompt templates, retrieval implementations, or how commit history was encoded and accessed. These omissions make it difficult to assess reproducibility and whether the 48% result reliably supports the multi-step reasoning claims.
Authors: We agree that the manuscript would benefit from greater transparency on these experimental details. In the revised version we will expand the experimental setup and results sections to report: the number of runs (main results reflect a single execution per configuration owing to the substantial cost of LLM calls and tool invocations, while ablations incorporate multiple trajectories), observed variance where measured, statistical significance testing of CRR differences (e.g., via bootstrap resampling or suitable non-parametric tests), the precise prompt templates, retrieval implementation details, and the exact encoding and access mechanism for commit history (e.g., summarized diffs and targeted git queries). These additions will allow readers to assess reproducibility and the contribution of multi-step reasoning more rigorously. revision: yes
Circularity Check
No circularity: empirical agent evaluation on external benchmark
full rationale
The paper describes the design of Code Researcher as a multi-step reasoning agent for retrieving context from codebases and commit histories, then reports direct empirical results on the external kBenchSyz benchmark (48% CRR vs. baselines). No equations, fitted parameters, self-referential definitions, or load-bearing self-citations are present that would reduce the claimed performance to the inputs by construction. The evaluation uses standard out-of-sample metrics on a fixed benchmark, making the derivation self-contained as an engineering experiment rather than a tautological reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We equip this phase with (a) actions to efficiently search over the codebase and the commit history... (3) search_commits(regex)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
AI-Driven Research for Databases
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
Reference graph
Works this paper leans on
-
[1]
Claude takes research to new places, 2025
Anthropic. Claude takes research to new places, 2025. URL https://www.anthropic.com/ news/research
work page 2025
-
[2]
ccache. ccache. URL https://github.com/ccache/ccache
- [3]
-
[4]
D. Engler, D. Y . Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: a general approach to inferring errors in systems code. SIGOPS Oper. Syst. Rev., 35(5):57–72, Oct. 2001. ISSN 0163-5980. doi: 10.1145/502059.502041. URL https://doi.org/10.1145/502059. 502041
-
[5]
A. Godbole, N. Monath, S. Kim, A. S. Rawat, A. McCallum, and M. Zaheer. Analysis of plan-based retrieval for grounded text generation. arXiv preprint arXiv:2408.10490, 2024
-
[6]
Oss-fuzz | documentation for oss-fuzz
Google. Oss-fuzz | documentation for oss-fuzz. URL https://google.github.io/ oss-fuzz/
-
[7]
Google. Gemini deep research, 2025. URL https://gemini.google/overview/ deep-research
work page 2025
-
[8]
Generative ai | documentation | long context
Google. Generative ai | documentation | long context. https://cloud.google.com/ vertex-ai/generative-ai/docs/long-context , 2025. Accessed: 2025-03-13
work page 2025
-
[9]
Generative ai | documentation | gemini 2.5 pro
Google. Generative ai | documentation | gemini 2.5 pro. https://cloud.google.com/ vertex-ai/generative-ai/docs/models/gemini/2-5-pro , 2025. Accessed: 2025-05- 16
work page 2025
-
[10]
Google. Kasan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KASAN.md
work page 2025
-
[11]
Google. Kcsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KCSAN.md
work page 2025
- [12]
-
[13]
Google. Ubsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/UBSAN.md
work page 2025
-
[14]
J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weis- senberger, K. Rong, R. Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley. Longcoder: A long-range pre-trained language model for code completion. In International Conference on Machine Learning, pages 12098–12107. PMLR, 2023
work page 2023
-
[16]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66
work page 2024
- [18]
- [19]
-
[20]
X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [21]
-
[22]
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
- [23]
- [24]
-
[25]
Introducing researcher and analyst in microsoft 365 copilot, 2025
Microsoft. Introducing researcher and analyst in microsoft 365 copilot, 2025. URL https://www.microsoft.com/en-us/microsoft-365/blog/2025/03/25/ introducing-researcher-and-analyst-in-microsoft-365-copilot/
work page 2025
-
[26]
F. Nielson, H. R. Nielson, and C. Hankin. Principles of program analysis. springer, 2015
work page 2015
-
[27]
Introducing openai o1-preview, 2024
OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ hello-gpt-4o/
work page 2024
-
[28]
Introducing openai o1-preview, 2024
OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ introducing-openai-o1-preview/
work page 2024
-
[29]
Openai deep research integration with github, 2025
OpenAI. Openai deep research integration with github, 2025. URL https://help.openai. com/en/articles/11145903-connecting-github-to-chatgpt-deep-research
-
[30]
Introducing deep research, 2025
OpenAI. Introducing deep research, 2025. URL https://openai.com/index/ introducing-deep-research/
work page 2025
-
[31]
Introducing perplexity deep research, 2025
Perplexity. Introducing perplexity deep research, 2025. URLhttps://www.perplexity.ai/ de/hub/blog/introducing-perplexity-deep-research
work page 2025
- [32]
-
[33]
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
universal-ctags. ctags. URL https://github.com/universal-ctags/ctags
- [35]
-
[36]
X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...
work page 2025
- [37]
- [38]
-
[39]
C. S. Xia, Y . Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering, 2024. URL https: //arxiv.org/abs/2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[42]
Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, page 1592–1604, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706127. doi: 10.1145/3650212.3680384. URL https://doi.org/10....
-
[43]
There was a fix for a general protection fault in the `smsusb` driver related to endpoint initialization, which suggests that endpoint management might have been problematic in the past
-
[44]
A past commit addresses use-after-free bugs caused by `do_submit_urb()`, which is relevant since improper URB management could lead to synchronization issues
-
[45]
Changes to `cancel_work_sync` function suggest potential race conditions or synchronization problems involving work items. The definition of `cancel_work_sync` is straightforward, but I need to examine `__cancel_work_timer` The contents from the memory are added to prompt Figure 6: Code Researcher trajectory while solving the crash described in Figure 3: ...
-
[46]
can: gs_usb: add extended bt_const feature
for caching build files generated during kernel compilation and our own logic for cachinggit checkouts. kGym has a distributed setup featuring five workers -kBuilder, kReproducer, kScheduler, kDash- board and kmq. (1) kBuilder takes as input a source commit, a kernel config, and (optionally) a patch. It checks out the kernel at the source commit, applies ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.