Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Abhav Mehrotra; Aditya Kanade; Nagarajan Natarajan; Nalin Wadhwa; Ramakrishna B Bairi; Ramneet Singh; Sathvik Joel

arxiv: 2506.11060 · v2 · pith:3OKFZJAInew · submitted 2025-05-27 · 💻 cs.SE · cs.AI

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Ramneet Singh , Sathvik Joel , Abhav Mehrotra , Nalin Wadhwa , Ramakrishna B Bairi , Aditya Kanade , Nagarajan Natarajan This is my paper

Pith reviewed 2026-05-22 02:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords deep research agentsystems codecrash resolutionLinux kernelLLM coding agentscommit historypatch generationmulti-step reasoning

0 comments

The pith

Code Researcher outperforms existing agents by using multi-step reasoning on code semantics, patterns, and commit history to resolve crashes in large systems codebases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Code Researcher, a deep research agent designed for large systems code like the Linux kernel. It performs multi-step reasoning to gather relevant context from the entire codebase and its extensive commit history before generating patches for reported crashes. This approach addresses the limitation of current LLM-based coding agents that struggle with the size and complexity of systems code. Evaluation on the kBenchSyz benchmark shows it achieves a 48% crash-resolution rate, significantly higher than baselines like SWE-agent and Agentless. The work demonstrates that global context gathering and multi-faceted reasoning are key for improving performance on such tasks.

Core claim

Code Researcher is the first deep research agent for code that retrieves all relevant context from a large codebase and its commit history through multi-step reasoning about semantics, patterns, and commit history, leading to higher crash-resolution rates when generating patches for systems code crashes.

What carries the argument

The multi-step reasoning process in Code Researcher that analyzes semantics, patterns, and commit history to retrieve relevant context from the codebase and history.

If this is right

Code Researcher achieves a crash-resolution rate of 48% on kBenchSyz using GPT-4o, compared to 31.5% for SWE-agent and 31% for Agentless.
Scaling sampling budget to 10 trajectories increases the CRR to 54%.
Using the Gemini 2.5-Flash model, Code Researcher reaches 67% CRR.
It generalizes to an open-source multimedia software and shows robustness through ablations.
The experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar deep research approaches could improve agent performance on other complex software engineering tasks involving large repositories.
Emphasizing commit history might help agents understand evolution of bugs and fixes in any large project.
Future tests on different systems codebases could confirm if the method scales beyond the Linux kernel.
Combining this with other tools might further boost resolution rates on crash benchmarks.

Load-bearing premise

The design assumes that multi-step reasoning about semantics, patterns, and commit history can reliably retrieve all relevant context from the codebase and its commit history before generating patches.

What would settle it

Running Code Researcher on a new set of unreported kernel crashes and finding that it retrieves incomplete context or achieves lower resolution rates than the baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2506.11060 by Abhav Mehrotra, Aditya Kanade, Nagarajan Natarajan, Nalin Wadhwa, Ramakrishna B Bairi, Ramneet Singh, Sathvik Joel.

**Figure 1.** Figure 1: Code Researcher conducts deep research over code in three phases: (1) Starting with the codebase and crash report as input, the Analysis phase performs multi-step reasoning about semantics, patterns, and commit history of code. It gathers context in a structured memory. (2) The Synthesis phase filters the contents of the memory to keep relevant context and generates a patch. (3) The Validation phase uses e… view at source ↗

**Figure 2.** Figure 2: SWE-agent vs Code Researcher (both at GPT-4o, P@5, 15 max calls) 2) Code Researcher has better overlap with developer-referenced context: We use LLM-as-judge to analyze the context gathered by Code Researcher (GPT-4o) and SWE-agent; and determine the overlap of the contexts with the context mentioned by the developer in the fix commit message (details in Appendix D). This context overlap is 54.18% (over ca… view at source ↗

**Figure 3.** Figure 3: Crash report of the kernel crash example discussed in Appendix [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Code Researcher trajectory while solving the crash described in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Developer commit message and patch. 6Bug in Syzbot dashboard: https://syzkaller.appspot.com/bug?id= 92a742e993c8b9e769f8502a0497c88c0afa78af. 7Developer’s fix commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux. git/commit/?id=50d34a0d151dc7abbdbec781bd7f09f2b3cbf01a. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Code Researcher actions (search_commits in green box). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Code Researcher patch and analysis. D LLM-as-Judge evaluation of overlap between developer commit and tool-gathered context We use LLM-as-judge to analyze the context gathered by Code Researcher and SWE-agent to determine the overlap of context in their trajectory with the context mentioned by the developer in the ground-truth fix commit message. We first identify code symbols mentioned in the commit mess… view at source ↗

read the original abstract

Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate (CRR) of 48%, compared to 31.5% by SWE-agent and 31% by Agentless, using OpenAI's GPT-4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5-Flash model. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Code Researcher adds commit history and multi-step retrieval to coding agents for kernel crash fixes and reports a 48% resolution rate, but the edge over baselines could stem from unequal information access rather than the reasoning design.

read the letter

The main point is that this paper builds a coding agent that does explicit multi-step research over code semantics, patterns, and git commit history before generating patches for crashes in large systems code. It tests the idea on the kBenchSyz Linux kernel benchmark and gets 48% crash-resolution rate with GPT-4o, rising to 54% with more samples and 67% with Gemini 2.5-Flash. They also run it on multimedia software and include ablations to show the value of global context gathering.

Referee Report

2 major / 2 minor

Summary. The paper introduces Code Researcher, a deep research agent for large systems code that performs multi-step reasoning over code semantics, patterns, and commit history to retrieve relevant context before generating patches for crashes. It evaluates the agent on the kBenchSyz benchmark of Linux kernel crashes, reporting a crash-resolution rate (CRR) of 48% with GPT-4o (outperforming SWE-agent at 31.5% and Agentless at 31%), with further gains to 54% via increased sampling and 67% with Gemini 2.5-Flash, plus generalizability results on multimedia software and ablations highlighting global context gathering.

Significance. If the baseline comparisons control for equivalent commit-history access and the reported gains prove robust, the work would be significant for demonstrating the value of multi-faceted, history-aware reasoning in LLM agents applied to complex systems codebases. The empirical outperformance on a named external benchmark, combined with scaling and ablation experiments, would strengthen the case for deep research capabilities in automated debugging of large code.

major comments (2)

[Evaluation on kBenchSyz] Evaluation section (kBenchSyz experiments): the central claim of outperformance (48% CRR vs. 31.5%/31%) is load-bearing for attributing gains to Code Researcher's multi-step semantics/pattern/commit-history design. The manuscript provides no indication that SWE-agent or Agentless were given comparable tools or prompting to inspect git history; off-the-shelf versions typically operate on the current snapshot only. If baselines lacked equivalent history access, the measured gap is confounded by information asymmetry rather than the agent's reasoning architecture. Please clarify the exact tool access, retrieval setup, and prompting provided to each baseline.
[Experimental setup] Experimental setup and results: the abstract and evaluation report concrete CRR numbers without details on statistical significance testing, number of runs, variance, exact prompt templates, retrieval implementations, or how commit history was encoded and accessed. These omissions make it difficult to assess reproducibility and whether the 48% result reliably supports the multi-step reasoning claims.

minor comments (2)

[Abstract] Abstract: the acronym CRR is used without an initial definition; expand it on first use for readability.
[Results and ablations] Figure and table captions: ensure all ablation and scaling results include error bars or confidence intervals where applicable to aid interpretation of the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Evaluation on kBenchSyz] Evaluation section (kBenchSyz experiments): the central claim of outperformance (48% CRR vs. 31.5%/31%) is load-bearing for attributing gains to Code Researcher's multi-step semantics/pattern/commit-history design. The manuscript provides no indication that SWE-agent or Agentless were given comparable tools or prompting to inspect git history; off-the-shelf versions typically operate on the current snapshot only. If baselines lacked equivalent history access, the measured gap is confounded by information asymmetry rather than the agent's reasoning architecture. Please clarify the exact tool access, retrieval setup, and prompting provided to each baseline.

Authors: We appreciate the referee for identifying this potential source of confounding. In our experiments, SWE-agent and Agentless were evaluated using their publicly available standard implementations and default configurations as described in the original papers; these do not incorporate explicit git history inspection or multi-step commit reasoning by default. Code Researcher's primary contribution lies in its explicit multi-step reasoning pipeline that retrieves and reasons over commit history in addition to code semantics and patterns. To eliminate ambiguity, we will add a new subsection to the evaluation section that explicitly documents the tool access, retrieval setup, and prompting for every method, confirming that baselines received only the current snapshot unless their original designs specified otherwise. This revision will make clear that the observed gains stem from the agent's architecture rather than unequal information access. revision: yes
Referee: [Experimental setup] Experimental setup and results: the abstract and evaluation report concrete CRR numbers without details on statistical significance testing, number of runs, variance, exact prompt templates, retrieval implementations, or how commit history was encoded and accessed. These omissions make it difficult to assess reproducibility and whether the 48% result reliably supports the multi-step reasoning claims.

Authors: We agree that the manuscript would benefit from greater transparency on these experimental details. In the revised version we will expand the experimental setup and results sections to report: the number of runs (main results reflect a single execution per configuration owing to the substantial cost of LLM calls and tool invocations, while ablations incorporate multiple trajectories), observed variance where measured, statistical significance testing of CRR differences (e.g., via bootstrap resampling or suitable non-parametric tests), the precise prompt templates, retrieval implementation details, and the exact encoding and access mechanism for commit history (e.g., summarized diffs and targeted git queries). These additions will allow readers to assess reproducibility and the contribution of multi-step reasoning more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical agent evaluation on external benchmark

full rationale

The paper describes the design of Code Researcher as a multi-step reasoning agent for retrieving context from codebases and commit histories, then reports direct empirical results on the external kBenchSyz benchmark (48% CRR vs. baselines). No equations, fitted parameters, self-referential definitions, or load-bearing self-citations are present that would reduce the claimed performance to the inputs by construction. The evaluation uses standard out-of-sample metrics on a fixed benchmark, making the derivation self-contained as an engineering experiment rather than a tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The approach implicitly assumes LLM reasoning steps can gather sufficient context without detailing how that assumption is validated.

pith-pipeline@v0.9.0 · 5823 in / 1290 out tokens · 63236 ms · 2026-05-22T02:15:18.823696+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We equip this phase with (a) actions to efficiently search over the codebase and the commit history... (3) search_commits(regex)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Driven Research for Databases
cs.DB 2026-04 unverdicted novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Claude takes research to new places, 2025

Anthropic. Claude takes research to new places, 2025. URL https://www.anthropic.com/ news/research

work page 2025
[2]

ccache. ccache. URL https://github.com/ccache/ccache

work page
[3]

community

F. community. Ffmpeg, 2025. URL https://ffmpeg.org/

work page 2025
[4]

Engler, D

D. Engler, D. Y . Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: a general approach to inferring errors in systems code. SIGOPS Oper. Syst. Rev., 35(5):57–72, Oct. 2001. ISSN 0163-5980. doi: 10.1145/502059.502041. URL https://doi.org/10.1145/502059. 502041

work page doi:10.1145/502059.502041 2001
[5]

Godbole, N

A. Godbole, N. Monath, S. Kim, A. S. Rawat, A. McCallum, and M. Zaheer. Analysis of plan-based retrieval for grounded text generation. arXiv preprint arXiv:2408.10490, 2024

work page arXiv 2024
[6]

Oss-fuzz | documentation for oss-fuzz

Google. Oss-fuzz | documentation for oss-fuzz. URL https://google.github.io/ oss-fuzz/

work page
[7]

Gemini deep research, 2025

Google. Gemini deep research, 2025. URL https://gemini.google/overview/ deep-research

work page 2025
[8]

Generative ai | documentation | long context

Google. Generative ai | documentation | long context. https://cloud.google.com/ vertex-ai/generative-ai/docs/long-context , 2025. Accessed: 2025-03-13

work page 2025
[9]

Generative ai | documentation | gemini 2.5 pro

Google. Generative ai | documentation | gemini 2.5 pro. https://cloud.google.com/ vertex-ai/generative-ai/docs/models/gemini/2-5-pro , 2025. Accessed: 2025-05- 16

work page 2025
[10]

Kasan, 2025

Google. Kasan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KASAN.md

work page 2025
[11]

Kcsan, 2025

Google. Kcsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KCSAN.md

work page 2025
[12]

Syzkaller, 2025

Google. Syzkaller, 2025. URL https://github.com/google/syzkaller/

work page 2025
[13]

Ubsan, 2025

Google. Ubsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/UBSAN.md

work page 2025
[14]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weis- senberger, K. Rong, R. Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley. Longcoder: A long-range pre-trained language model for code completion. In International Conference on Machine Learning, pages 12098–12107. PMLR, 2023

work page 2023
[16]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

work page 2024
[18]

Kagdi, M

H. Kagdi, M. L. Collard, and J. I. Maletic. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of software maintenance and evolution: Research and practice, 19(2):77–131, 2007

work page 2007
[19]

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024
[20]

X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Linux, 1991

Linus Torvalds. Linux, 1991. URL https://github.com/torvalds/linux

work page 1991
[22]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[23]

Mathai, C

A. Mathai, C. Huang, P. Maniatis, A. Nogikh, F. Ivan ˇci´c, J. Yang, and B. Ray. Kgym: A platform and dataset to benchmark large language models on linux kernel crash resolution. Advances in Neural Information Processing Systems, 37:78053–78078, 2024

work page 2024
[24]

Mathai, C

A. Mathai, C. Huang, S. Ma, J. Kim, H. Mitchell, A. Nogikh, P. Maniatis, F. Ivanˇci´c, J. Yang, and B. Ray. Crashfixer: A crash resolution agent for the linux kernel. arXiv preprint arXiv:2504.20412, 2025

work page arXiv 2025
[25]

Introducing researcher and analyst in microsoft 365 copilot, 2025

Microsoft. Introducing researcher and analyst in microsoft 365 copilot, 2025. URL https://www.microsoft.com/en-us/microsoft-365/blog/2025/03/25/ introducing-researcher-and-analyst-in-microsoft-365-copilot/

work page 2025
[26]

Nielson, H

F. Nielson, H. R. Nielson, and C. Hankin. Principles of program analysis. springer, 2015

work page 2015
[27]

Introducing openai o1-preview, 2024

OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ hello-gpt-4o/

work page 2024
[28]

Introducing openai o1-preview, 2024

OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ introducing-openai-o1-preview/

work page 2024
[29]

Openai deep research integration with github, 2025

OpenAI. Openai deep research integration with github, 2025. URL https://help.openai. com/en/articles/11145903-connecting-github-to-chatgpt-deep-research

work page arXiv 2025
[30]

Introducing deep research, 2025

OpenAI. Introducing deep research, 2025. URL https://openai.com/index/ introducing-deep-research/

work page 2025
[31]

Introducing perplexity deep research, 2025

Perplexity. Introducing perplexity deep research, 2025. URLhttps://www.perplexity.ai/ de/hub/blog/introducing-perplexity-deep-research

work page 2025
[32]

Y . Shao, Y . Jiang, T. A. Kanell, P. Xu, O. Khattab, and M. S. Lam. Assisting in writing wikipedia- like articles from scratch with large language models. arXiv preprint arXiv:2402.14207, 2024

work page arXiv 2024
[33]

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

universal-ctags. ctags. URL https://github.com/universal-ctags/ctags

work page
[35]

Wadhwa, A

N. Wadhwa, A. Sonwane, D. Arora, A. Mehrotra, S. Utpala, R. B. Bairi, A. Kanade, and N. Natarajan. MASAI: Modular architecture for software-engineering AI agents. In NeurIPS 2024 Workshop on Open-World Agents, 2024. URL https://openreview.net/forum?id= NSINt8lLYB

work page 2024
[36]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...

work page 2025
[37]

J. Wu, J. Zhu, and Y . Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025

work page arXiv 2025
[38]

W. Wu, S. Huang, Y . Jiang, P. Xie, F. Huang, and H. Zhao. Unfolding the headline: Iterative self- questioning for news retrieval and timeline summarization. arXiv preprint arXiv:2501.00888, 2025

work page arXiv 2025
[39]

C. S. Xia, Y . Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering, 2024. URL https: //arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[42]

In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024)

Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, page 1592–1604, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706127. doi: 10.1145/3650212.3680384. URL https://doi.org/10....

work page doi:10.1145/3650212.3680384 2024
[43]

There was a fix for a general protection fault in the `smsusb` driver related to endpoint initialization, which suggests that endpoint management might have been problematic in the past

work page
[44]

A past commit addresses use-after-free bugs caused by `do_submit_urb()`, which is relevant since improper URB management could lead to synchronization issues

work page
[45]

INIT_WORK

Changes to `cancel_work_sync` function suggest potential race conditions or synchronization problems involving work items. The definition of `cancel_work_sync` is straightforward, but I need to examine `__cancel_work_timer` The contents from the memory are added to prompt Figure 6: Code Researcher trajectory while solving the crash described in Figure 3: ...

work page
[46]

can: gs_usb: add extended bt_const feature

for caching build files generated during kernel compilation and our own logic for cachinggit checkouts. kGym has a distributed setup featuring five workers -kBuilder, kReproducer, kScheduler, kDash- board and kmq. (1) kBuilder takes as input a source commit, a kernel config, and (optionally) a patch. It checks out the kernel at the source commit, applies ...

work page 2017

[1] [1]

Claude takes research to new places, 2025

Anthropic. Claude takes research to new places, 2025. URL https://www.anthropic.com/ news/research

work page 2025

[2] [2]

ccache. ccache. URL https://github.com/ccache/ccache

work page

[3] [3]

community

F. community. Ffmpeg, 2025. URL https://ffmpeg.org/

work page 2025

[4] [4]

Engler, D

D. Engler, D. Y . Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: a general approach to inferring errors in systems code. SIGOPS Oper. Syst. Rev., 35(5):57–72, Oct. 2001. ISSN 0163-5980. doi: 10.1145/502059.502041. URL https://doi.org/10.1145/502059. 502041

work page doi:10.1145/502059.502041 2001

[5] [5]

Godbole, N

A. Godbole, N. Monath, S. Kim, A. S. Rawat, A. McCallum, and M. Zaheer. Analysis of plan-based retrieval for grounded text generation. arXiv preprint arXiv:2408.10490, 2024

work page arXiv 2024

[6] [6]

Oss-fuzz | documentation for oss-fuzz

Google. Oss-fuzz | documentation for oss-fuzz. URL https://google.github.io/ oss-fuzz/

work page

[7] [7]

Gemini deep research, 2025

Google. Gemini deep research, 2025. URL https://gemini.google/overview/ deep-research

work page 2025

[8] [8]

Generative ai | documentation | long context

Google. Generative ai | documentation | long context. https://cloud.google.com/ vertex-ai/generative-ai/docs/long-context , 2025. Accessed: 2025-03-13

work page 2025

[9] [9]

Generative ai | documentation | gemini 2.5 pro

Google. Generative ai | documentation | gemini 2.5 pro. https://cloud.google.com/ vertex-ai/generative-ai/docs/models/gemini/2-5-pro , 2025. Accessed: 2025-05- 16

work page 2025

[10] [10]

Kasan, 2025

Google. Kasan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KASAN.md

work page 2025

[11] [11]

Kcsan, 2025

Google. Kcsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/KCSAN.md

work page 2025

[12] [12]

Syzkaller, 2025

Google. Syzkaller, 2025. URL https://github.com/google/syzkaller/

work page 2025

[13] [13]

Ubsan, 2025

Google. Ubsan, 2025. URL https://github.com/google/kernel-sanitizers/blob/ master/UBSAN.md

work page 2025

[14] [14]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weis- senberger, K. Rong, R. Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley. Longcoder: A long-range pre-trained language model for code completion. In International Conference on Machine Learning, pages 12098–12107. PMLR, 2023

work page 2023

[16] [16]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

work page 2024

[18] [18]

Kagdi, M

H. Kagdi, M. L. Collard, and J. I. Maletic. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of software maintenance and evolution: Research and practice, 19(2):77–131, 2007

work page 2007

[19] [19]

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024

[20] [20]

X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou. Webthinker: Empowering large reasoning models with deep research capability. arXiv preprint arXiv:2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Linux, 1991

Linus Torvalds. Linux, 1991. URL https://github.com/torvalds/linux

work page 1991

[22] [22]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024

[23] [23]

Mathai, C

A. Mathai, C. Huang, P. Maniatis, A. Nogikh, F. Ivan ˇci´c, J. Yang, and B. Ray. Kgym: A platform and dataset to benchmark large language models on linux kernel crash resolution. Advances in Neural Information Processing Systems, 37:78053–78078, 2024

work page 2024

[24] [24]

Mathai, C

A. Mathai, C. Huang, S. Ma, J. Kim, H. Mitchell, A. Nogikh, P. Maniatis, F. Ivanˇci´c, J. Yang, and B. Ray. Crashfixer: A crash resolution agent for the linux kernel. arXiv preprint arXiv:2504.20412, 2025

work page arXiv 2025

[25] [25]

Introducing researcher and analyst in microsoft 365 copilot, 2025

Microsoft. Introducing researcher and analyst in microsoft 365 copilot, 2025. URL https://www.microsoft.com/en-us/microsoft-365/blog/2025/03/25/ introducing-researcher-and-analyst-in-microsoft-365-copilot/

work page 2025

[26] [26]

Nielson, H

F. Nielson, H. R. Nielson, and C. Hankin. Principles of program analysis. springer, 2015

work page 2015

[27] [27]

Introducing openai o1-preview, 2024

OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ hello-gpt-4o/

work page 2024

[28] [28]

Introducing openai o1-preview, 2024

OpenAI. Introducing openai o1-preview, 2024. URL https://openai.com/index/ introducing-openai-o1-preview/

work page 2024

[29] [29]

Openai deep research integration with github, 2025

OpenAI. Openai deep research integration with github, 2025. URL https://help.openai. com/en/articles/11145903-connecting-github-to-chatgpt-deep-research

work page arXiv 2025

[30] [30]

Introducing deep research, 2025

OpenAI. Introducing deep research, 2025. URL https://openai.com/index/ introducing-deep-research/

work page 2025

[31] [31]

Introducing perplexity deep research, 2025

Perplexity. Introducing perplexity deep research, 2025. URLhttps://www.perplexity.ai/ de/hub/blog/introducing-perplexity-deep-research

work page 2025

[32] [32]

Y . Shao, Y . Jiang, T. A. Kanell, P. Xu, O. Khattab, and M. S. Lam. Assisting in writing wikipedia- like articles from scratch with large language models. arXiv preprint arXiv:2402.14207, 2024

work page arXiv 2024

[33] [33]

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

universal-ctags. ctags. URL https://github.com/universal-ctags/ctags

work page

[35] [35]

Wadhwa, A

N. Wadhwa, A. Sonwane, D. Arora, A. Mehrotra, S. Utpala, R. B. Bairi, A. Kanade, and N. Natarajan. MASAI: Modular architecture for software-engineering AI agents. In NeurIPS 2024 Workshop on Open-World Agents, 2024. URL https://openreview.net/forum?id= NSINt8lLYB

work page 2024

[36] [36]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...

work page 2025

[37] [37]

J. Wu, J. Zhu, and Y . Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025

work page arXiv 2025

[38] [38]

W. Wu, S. Huang, Y . Jiang, P. Xie, F. Huang, and H. Zhao. Unfolding the headline: Iterative self- questioning for news retrieval and timeline summarization. arXiv preprint arXiv:2501.00888, 2025

work page arXiv 2025

[39] [39]

C. S. Xia, Y . Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering, 2024. URL https: //arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[42] [42]

In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024)

Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, page 1592–1604, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400706127. doi: 10.1145/3650212.3680384. URL https://doi.org/10....

work page doi:10.1145/3650212.3680384 2024

[43] [43]

There was a fix for a general protection fault in the `smsusb` driver related to endpoint initialization, which suggests that endpoint management might have been problematic in the past

work page

[44] [44]

A past commit addresses use-after-free bugs caused by `do_submit_urb()`, which is relevant since improper URB management could lead to synchronization issues

work page

[45] [45]

INIT_WORK

Changes to `cancel_work_sync` function suggest potential race conditions or synchronization problems involving work items. The definition of `cancel_work_sync` is straightforward, but I need to examine `__cancel_work_timer` The contents from the memory are added to prompt Figure 6: Code Researcher trajectory while solving the crash described in Figure 3: ...

work page

[46] [46]

can: gs_usb: add extended bt_const feature

for caching build files generated during kernel compilation and our own logic for cachinggit checkouts. kGym has a distributed setup featuring five workers -kBuilder, kReproducer, kScheduler, kDash- board and kmq. (1) kBuilder takes as input a source commit, a kernel config, and (optionally) a patch. It checks out the kernel at the source commit, applies ...

work page 2017