The End of Code Review: Coding Agents Supersede Human Inspection

Martin Monperrus

arxiv: 2606.13175 · v1 · pith:F6P3OWFVnew · submitted 2026-06-11 · 💻 cs.SE

The End of Code Review: Coding Agents Supersede Human Inspection

Martin Monperrus This is my paper

Pith reviewed 2026-06-27 06:09 UTC · model grok-4.3

classification 💻 cs.SE

keywords code reviewcoding agentslarge language modelssoftware qualityAI-assisted developmentautonomous software systems

0 comments

The pith

Coding agents now meet every goal of code review at lower cost and higher throughput, rendering human inspection unnecessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that LLM-based coding agents have reached a capability level where they can fully replace the human code review process that has been standard since 1976. It maintains that agents can fulfill all traditional objectives of review, including defect detection and quality assurance, while operating at reduced expense and greater volume than human teams. The paper further asserts that retaining humans as required reviewers for agent-generated code fails to deliver real guarantees and cannot handle the volume of changes produced by AI tools. This position implies a fundamental shift away from established software quality practices toward agent-driven pipelines.

Core claim

We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

What carries the argument

Coding agents, defined as LLM-based autonomous systems that read, write, test, and repair software, serving as the replacement mechanism for human inspection.

If this is right

Every traditional objective of code review, such as finding defects and improving maintainability, becomes achievable through agent operation alone.
Hybrid setups that require human review of agent output provide neither reliable assurance nor the ability to process increased change volumes.
Software development organizations can remove human code review from their quality pipelines without loss of effectiveness.
Quality assurance shifts entirely to agent capabilities, including testing and repair loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Development velocity could increase because agents handle review instantly rather than waiting for human availability.
Training programs for developers might redirect from review skills toward agent oversight and prompt design.
New failure modes could appear if agents share systematic blind spots on certain classes of issues.

Load-bearing premise

Coding agents are already capable of serving every stated goal of code review at lower cost and higher throughput.

What would settle it

A direct comparison study measuring defect detection accuracy, review coverage, and total cost per change for coding agents versus human reviewers on identical large-scale codebases.

Figures

Figures reproduced from arXiv: 2606.13175 by Martin Monperrus.

**Figure 1.** Figure 1: Argument map of the paper. Review goals support three claims that lead to the conclusion, which entails four implications for practice and tooling. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Code review has been the primary quality gate in software development since Fagan formalised code inspection in 1976. For five decades, having a human examine and comment on a colleague's changes before merge has been a cornerstone practice at organisations of every size. Coding agents are large language model (LLM)-based autonomous systems capable of reading, writing, testing, and repairing software. We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position paper asserts coding agents have ended the need for human code review but supplies no data or analysis to support the claim.

read the letter

This is a position paper arguing that coding agents now handle every goal of code review at lower cost and higher throughput, making human inspection unnecessary. The two supporting claims are that agents can fully replace review functions and that hybrid setups where humans still review AI-generated code will not scale.

The paper does a clean job of laying out those two points and connecting them to the history of code review since Fagan. It frames the hybrid approach as a dead end without much hedging, which makes the logic easy to follow.

The central weakness is that both claims rest on an untested assertion about current agent performance. There are no benchmarks, defect comparisons, case studies, or discussion of where agents still miss issues that humans catch. The argument treats the threshold-crossing as given rather than demonstrated.

This kind of piece could interest people already thinking about AI changes to software engineering workflows and might work for a reading group discussion. It does not add new results or verifiable grounding, so it is not something I would cite.

I would not send it for peer review as a research paper. It reads as an opinion piece that would need either empirical support or explicit positioning as commentary to be worth referee time.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that coding agents (LLM-based autonomous systems for reading, writing, testing, and repairing software) have crossed a capability threshold making traditional human code review unnecessary in software quality pipelines. The argument rests on two claims: agents can serve every goal of code review (bug detection, style enforcement, knowledge transfer, security review) at lower cost and higher throughput than humans, and hybrid workflows (agents write, humans review) are a dead end as they provide no meaningful assurance and fail to scale with AI throughput.

Significance. If the claims held with empirical support, the result would be highly significant for software engineering, challenging a practice formalized since Fagan's 1976 inspections and potentially enabling fully automated quality pipelines with major gains in speed and cost. The paper identifies a possible inflection point in AI-assisted development.

major comments (2)

[Abstract] Abstract, paragraph 2: The central assertion that 'every stated goal of code review can be served by agents at lower cost and higher throughput' is presented without any benchmarks, defect-rate comparisons, case studies, or failure-mode analysis demonstrating agent performance against human reviewers on tasks such as subtle bug detection or security review.
[Abstract] Abstract, paragraph 2: The claim that hybrid integration 'neither provides meaningful assurance nor scales with AI-assisted throughput' is an unsupported assertion; the manuscript contains no data on current hybrid workflow outcomes, assurance metrics, or throughput bottlenecks to substantiate why this approach is a 'dead end'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The manuscript is a position paper presenting an argumentative case for the obsolescence of human code review in light of coding agent capabilities, rather than an empirical study with new benchmarks. We address each major comment below and will revise the abstract and introduction to explicitly frame the work as a position paper synthesizing trends and logical implications.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph 2: The central assertion that 'every stated goal of code review can be served by agents at lower cost and higher throughput' is presented without any benchmarks, defect-rate comparisons, case studies, or failure-mode analysis demonstrating agent performance against human reviewers on tasks such as subtle bug detection or security review.

Authors: The manuscript advances a position based on the observed trajectory of LLM-based coding agents and published reports of their performance on code understanding, generation, testing, and repair tasks. It does not include new head-to-head empirical comparisons because the purpose is to outline the implications of current capabilities crossing a threshold, not to conduct a controlled evaluation. We will revise the abstract to state upfront that this is a position paper and to reference the body of existing agent evaluation literature while noting the absence of comprehensive failure-mode analyses for subtle bugs. revision: yes
Referee: [Abstract] Abstract, paragraph 2: The claim that hybrid integration 'neither provides meaningful assurance nor scales with AI-assisted throughput' is an unsupported assertion; the manuscript contains no data on current hybrid workflow outcomes, assurance metrics, or throughput bottlenecks to substantiate why this approach is a 'dead end'.

Authors: This claim follows from a logical analysis of throughput mismatch: agent code generation can scale to thousands of changes per day while human review capacity remains bounded. The paper does not present new measurements of hybrid workflows because it is not an empirical study of current practices; instead it argues that mandatory human review becomes a bottleneck under high AI throughput. We will expand the abstract and add a short section clarifying this as an argument about scaling limits rather than a data-driven claim about today's hybrid outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity detected; argument consists of asserted premises without self-referential derivation

full rationale

The manuscript is an argumentative position paper whose central thesis is explicitly introduced as resting on two stated claims about agent performance and hybrid workflow limitations. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text. The two claims function as premises rather than outputs of any derivation chain, so no reduction to inputs by construction occurs. The paper cites Fagan (1976) for historical context but does not rely on self-citation or prior author work to justify its threshold-crossing assertion. This is a normal non-finding for an opinion piece whose soundness is an external-evidence question, not an internal circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. The central claim rests on an unstated domain assumption about the current state of coding agent capabilities.

axioms (1)

domain assumption Coding agents have crossed a capability threshold sufficient to replace human code review for all stated goals
Invoked directly in the abstract as the basis for the argument that human review is no longer necessary.

pith-pipeline@v0.9.1-grok · 5656 in / 1348 out tokens · 46100 ms · 2026-06-27T06:09:28.078826+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Augmentation with Dilution: A Large-Scale Empirical Study of Human Contributor Ecosystems After AI Coding Agent Adoption
cs.SE 2026-06 unverdicted novelty 7.0

AI coding agent adoption causes no change in human contributor count but reduces contributor density and newcomer share by 3.7pp while increasing review depth by 5.3% in a staggered DiD analysis of 11k GitHub projects.

Reference graph

Works this paper leans on

32 extracted references · 7 linked inside Pith · cited by 1 Pith paper

[1]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” inProceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712– 721

2013
[2]

Modern code review: a case study at Google,

C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at Google,” inProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). ACM, 2018, pp. 181–190

2018
[3]

Usage, costs, and benefits of continuous integration in open-source projects,

M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs, and benefits of continuous integration in open-source projects,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 2016, pp. 426–437

2016
[4]

Process aspects and social dynamics of contemporary code review,

A. Bosu, M. Greiler, and C. Bird, “Process aspects and social dynamics of contemporary code review,” inIEEE Transactions on Software Engineering, vol. 43, no. 1. IEEE, 2016, pp. 56–75

2016
[5]

Do developers feel emotions? an exploratory analysis of emotions in software artifacts,

A. Murgia, P. Tourani, B. Adams, and M. Ortu, “Do developers feel emotions? an exploratory analysis of emotions in software artifacts,” inProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 2014, pp. 262–271

2014
[6]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,”arXiv preprint arXiv:2405.15793, 2024

Pith/arXiv arXiv 2024
[7]

Introducing Devin, the first AI software engineer,

Cognition AI, “Introducing Devin, the first AI software engineer,”Cog- nition AI Blog, 2024, https://www.cognition.ai/blog/introducing-devin

2024
[8]

OpenDevin: An open platform for AI software developers as generalist agents,

X. Wang, B. Chen, Y . Yuan, Y . Zhang, B. Li, C. Qianet al., “OpenDevin: An open platform for AI software developers as generalist agents,” in arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024
[9]

GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,

GitHub, “GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,”GitHub Blog, 2024, https://github.blog/ 2024-04-29-github-copilot-workspace/

2024
[10]

SWE-bench: Can language models resolve real-GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-GitHub issues?”arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023
[11]

CodeReviewer: Pre-training for automating code review activities,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, N. Sundaresan, M. Fuet al., “CodeReviewer: Pre-training for automating code review activities,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2022, pp. 1536–1546

2022
[12]

Automated code review in prac- tice,

C. Pornprasit and C. Tantithamthavorn, “Automated code review in prac- tice,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 394–405

2023
[13]

Design and code inspections to reduce errors in program development,

M. E. Fagan, “Design and code inspections to reduce errors in program development,”IBM Systems Journal, vol. 15, no. 3, pp. 182–211, 1976

1976
[14]

Code reviews do not find bugs: How the current code review best practice slows us down,

J. Czerwonka, M. Greiler, and J. Tilford, “Code reviews do not find bugs: How the current code review best practice slows us down,” pp. 27–28, 2015

2015
[15]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” inProceedings of the 34th IEEE International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 647–658

2023
[16]

Using pre-trained models to boost code review automa- tion

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion.” ACM, 2022, pp. 1–12

2022
[17]

CodeAgent: Autonomous communicative agents for code review,

X. Tang, K. Kim, Y . Song, C. Lothritz, B. Li, S. Ezzini, H. Tian, J. Klein, and T. F. Bissyande, “CodeAgent: Autonomous communicative agents for code review,”arXiv preprint arXiv:2402.02172, 2024

arXiv 2024
[18]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[19]

The Claude 3 model family: Opus, Sonnet, Haiku,

Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,”An- thropic Technical Report, 2024

2024
[20]

SWE-bench leaderboard,

S. bench Team, “SWE-bench leaderboard,” https://www.swebench.com, 2025

2025
[21]

A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,

C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,” pp. 3–13, 2012

2012
[22]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” pp. 1482–1494, 2023

2023
[23]

Competition- level code generation with AlphaCode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition- level code generation with AlphaCode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022
[24]

Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” inProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022
[25]

How secure is code generated by ChatGPT?

R. Khoury, A. R. Avci, J. Brunelle, and B. Marc Camara, “How secure is code generated by ChatGPT?” 2023

2023
[26]

Comparing ai agents to cybersecurity professionals in real-world penetration testing,

J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrikson, J. Z. Kolteret al., “Comparing ai agents to cybersecurity professionals in real-world penetration testing,” arXiv preprint arXiv:2512.09882, 2025

arXiv 2025
[27]

The impact of AI on developer productivity: Evidence from GitHub Copilot,

S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,”arXiv preprint arXiv:2302.06590, 2023

Pith/arXiv arXiv 2023
[28]

Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,

M. Shahin, M. A. Babar, and L. Zhu, “Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,”IEEE Access, vol. 5, pp. 3909–3943, 2017

2017
[29]

Language models (mostly) know what they know,

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bow- man, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark...

Pith/arXiv arXiv 2022
[30]

Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,”arXiv preprint arXiv:2503.03586, 2025

arXiv 2025
[31]

Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),

Berkeley Risk and Decisions Initiative, “Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),” https://rdi.berkeley. edu/frontier-ai-impact-on-cybersecurity/, 2025

2025
[32]

Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

Pith/arXiv arXiv 2023

[1] [1]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” inProceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712– 721

2013

[2] [2]

Modern code review: a case study at Google,

C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at Google,” inProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). ACM, 2018, pp. 181–190

2018

[3] [3]

Usage, costs, and benefits of continuous integration in open-source projects,

M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs, and benefits of continuous integration in open-source projects,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 2016, pp. 426–437

2016

[4] [4]

Process aspects and social dynamics of contemporary code review,

A. Bosu, M. Greiler, and C. Bird, “Process aspects and social dynamics of contemporary code review,” inIEEE Transactions on Software Engineering, vol. 43, no. 1. IEEE, 2016, pp. 56–75

2016

[5] [5]

Do developers feel emotions? an exploratory analysis of emotions in software artifacts,

A. Murgia, P. Tourani, B. Adams, and M. Ortu, “Do developers feel emotions? an exploratory analysis of emotions in software artifacts,” inProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 2014, pp. 262–271

2014

[6] [6]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,”arXiv preprint arXiv:2405.15793, 2024

Pith/arXiv arXiv 2024

[7] [7]

Introducing Devin, the first AI software engineer,

Cognition AI, “Introducing Devin, the first AI software engineer,”Cog- nition AI Blog, 2024, https://www.cognition.ai/blog/introducing-devin

2024

[8] [8]

OpenDevin: An open platform for AI software developers as generalist agents,

X. Wang, B. Chen, Y . Yuan, Y . Zhang, B. Li, C. Qianet al., “OpenDevin: An open platform for AI software developers as generalist agents,” in arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024

[9] [9]

GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,

GitHub, “GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,”GitHub Blog, 2024, https://github.blog/ 2024-04-29-github-copilot-workspace/

2024

[10] [10]

SWE-bench: Can language models resolve real-GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-GitHub issues?”arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023

[11] [11]

CodeReviewer: Pre-training for automating code review activities,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, N. Sundaresan, M. Fuet al., “CodeReviewer: Pre-training for automating code review activities,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2022, pp. 1536–1546

2022

[12] [12]

Automated code review in prac- tice,

C. Pornprasit and C. Tantithamthavorn, “Automated code review in prac- tice,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 394–405

2023

[13] [13]

Design and code inspections to reduce errors in program development,

M. E. Fagan, “Design and code inspections to reduce errors in program development,”IBM Systems Journal, vol. 15, no. 3, pp. 182–211, 1976

1976

[14] [14]

Code reviews do not find bugs: How the current code review best practice slows us down,

J. Czerwonka, M. Greiler, and J. Tilford, “Code reviews do not find bugs: How the current code review best practice slows us down,” pp. 27–28, 2015

2015

[15] [15]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” inProceedings of the 34th IEEE International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 647–658

2023

[16] [16]

Using pre-trained models to boost code review automa- tion

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion.” ACM, 2022, pp. 1–12

2022

[17] [17]

CodeAgent: Autonomous communicative agents for code review,

X. Tang, K. Kim, Y . Song, C. Lothritz, B. Li, S. Ezzini, H. Tian, J. Klein, and T. F. Bissyande, “CodeAgent: Autonomous communicative agents for code review,”arXiv preprint arXiv:2402.02172, 2024

arXiv 2024

[18] [18]

Evaluating large language models trained on code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[19] [19]

The Claude 3 model family: Opus, Sonnet, Haiku,

Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,”An- thropic Technical Report, 2024

2024

[20] [20]

SWE-bench leaderboard,

S. bench Team, “SWE-bench leaderboard,” https://www.swebench.com, 2025

2025

[21] [21]

A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,

C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,” pp. 3–13, 2012

2012

[22] [22]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” pp. 1482–1494, 2023

2023

[23] [23]

Competition- level code generation with AlphaCode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition- level code generation with AlphaCode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022

[24] [24]

Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” inProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022

[25] [25]

How secure is code generated by ChatGPT?

R. Khoury, A. R. Avci, J. Brunelle, and B. Marc Camara, “How secure is code generated by ChatGPT?” 2023

2023

[26] [26]

Comparing ai agents to cybersecurity professionals in real-world penetration testing,

J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrikson, J. Z. Kolteret al., “Comparing ai agents to cybersecurity professionals in real-world penetration testing,” arXiv preprint arXiv:2512.09882, 2025

arXiv 2025

[27] [27]

The impact of AI on developer productivity: Evidence from GitHub Copilot,

S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,”arXiv preprint arXiv:2302.06590, 2023

Pith/arXiv arXiv 2023

[28] [28]

Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,

M. Shahin, M. A. Babar, and L. Zhu, “Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,”IEEE Access, vol. 5, pp. 3909–3943, 2017

2017

[29] [29]

Language models (mostly) know what they know,

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bow- man, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark...

Pith/arXiv arXiv 2022

[30] [30]

Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,”arXiv preprint arXiv:2503.03586, 2025

arXiv 2025

[31] [31]

Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),

Berkeley Risk and Decisions Initiative, “Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),” https://rdi.berkeley. edu/frontier-ai-impact-on-cybersecurity/, 2025

2025

[32] [32]

Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

Pith/arXiv arXiv 2023