pith. sign in

arxiv: 2606.13175 · v1 · pith:F6P3OWFVnew · submitted 2026-06-11 · 💻 cs.SE

The End of Code Review: Coding Agents Supersede Human Inspection

Pith reviewed 2026-06-27 06:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords code reviewcoding agentslarge language modelssoftware qualityAI-assisted developmentautonomous software systems
0
0 comments X

The pith

Coding agents now meet every goal of code review at lower cost and higher throughput, rendering human inspection unnecessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that LLM-based coding agents have reached a capability level where they can fully replace the human code review process that has been standard since 1976. It maintains that agents can fulfill all traditional objectives of review, including defect detection and quality assurance, while operating at reduced expense and greater volume than human teams. The paper further asserts that retaining humans as required reviewers for agent-generated code fails to deliver real guarantees and cannot handle the volume of changes produced by AI tools. This position implies a fundamental shift away from established software quality practices toward agent-driven pipelines.

Core claim

We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

What carries the argument

Coding agents, defined as LLM-based autonomous systems that read, write, test, and repair software, serving as the replacement mechanism for human inspection.

If this is right

  • Every traditional objective of code review, such as finding defects and improving maintainability, becomes achievable through agent operation alone.
  • Hybrid setups that require human review of agent output provide neither reliable assurance nor the ability to process increased change volumes.
  • Software development organizations can remove human code review from their quality pipelines without loss of effectiveness.
  • Quality assurance shifts entirely to agent capabilities, including testing and repair loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Development velocity could increase because agents handle review instantly rather than waiting for human availability.
  • Training programs for developers might redirect from review skills toward agent oversight and prompt design.
  • New failure modes could appear if agents share systematic blind spots on certain classes of issues.

Load-bearing premise

Coding agents are already capable of serving every stated goal of code review at lower cost and higher throughput.

What would settle it

A direct comparison study measuring defect detection accuracy, review coverage, and total cost per change for coding agents versus human reviewers on identical large-scale codebases.

Figures

Figures reproduced from arXiv: 2606.13175 by Martin Monperrus.

Figure 1
Figure 1. Figure 1: Argument map of the paper. Review goals support three claims that lead to the conclusion, which entails four implications for practice and tooling. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Code review has been the primary quality gate in software development since Fagan formalised code inspection in 1976. For five decades, having a human examine and comment on a colleague's changes before merge has been a cornerstone practice at organisations of every size. Coding agents are large language model (LLM)-based autonomous systems capable of reading, writing, testing, and repairing software. We argue that coding agents have crossed a threshold of capability at which traditional human code review is no longer a necessary component of a software quality pipeline. Our argument rests on two claims: every stated goal of code review can be served by agents at lower cost and higher throughput; the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that coding agents (LLM-based autonomous systems for reading, writing, testing, and repairing software) have crossed a capability threshold making traditional human code review unnecessary in software quality pipelines. The argument rests on two claims: agents can serve every goal of code review (bug detection, style enforcement, knowledge transfer, security review) at lower cost and higher throughput than humans, and hybrid workflows (agents write, humans review) are a dead end as they provide no meaningful assurance and fail to scale with AI throughput.

Significance. If the claims held with empirical support, the result would be highly significant for software engineering, challenging a practice formalized since Fagan's 1976 inspections and potentially enabling fully automated quality pipelines with major gains in speed and cost. The paper identifies a possible inflection point in AI-assisted development.

major comments (2)
  1. [Abstract] Abstract, paragraph 2: The central assertion that 'every stated goal of code review can be served by agents at lower cost and higher throughput' is presented without any benchmarks, defect-rate comparisons, case studies, or failure-mode analysis demonstrating agent performance against human reviewers on tasks such as subtle bug detection or security review.
  2. [Abstract] Abstract, paragraph 2: The claim that hybrid integration 'neither provides meaningful assurance nor scales with AI-assisted throughput' is an unsupported assertion; the manuscript contains no data on current hybrid workflow outcomes, assurance metrics, or throughput bottlenecks to substantiate why this approach is a 'dead end'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The manuscript is a position paper presenting an argumentative case for the obsolescence of human code review in light of coding agent capabilities, rather than an empirical study with new benchmarks. We address each major comment below and will revise the abstract and introduction to explicitly frame the work as a position paper synthesizing trends and logical implications.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph 2: The central assertion that 'every stated goal of code review can be served by agents at lower cost and higher throughput' is presented without any benchmarks, defect-rate comparisons, case studies, or failure-mode analysis demonstrating agent performance against human reviewers on tasks such as subtle bug detection or security review.

    Authors: The manuscript advances a position based on the observed trajectory of LLM-based coding agents and published reports of their performance on code understanding, generation, testing, and repair tasks. It does not include new head-to-head empirical comparisons because the purpose is to outline the implications of current capabilities crossing a threshold, not to conduct a controlled evaluation. We will revise the abstract to state upfront that this is a position paper and to reference the body of existing agent evaluation literature while noting the absence of comprehensive failure-mode analyses for subtle bugs. revision: yes

  2. Referee: [Abstract] Abstract, paragraph 2: The claim that hybrid integration 'neither provides meaningful assurance nor scales with AI-assisted throughput' is an unsupported assertion; the manuscript contains no data on current hybrid workflow outcomes, assurance metrics, or throughput bottlenecks to substantiate why this approach is a 'dead end'.

    Authors: This claim follows from a logical analysis of throughput mismatch: agent code generation can scale to thousands of changes per day while human review capacity remains bounded. The paper does not present new measurements of hybrid workflows because it is not an empirical study of current practices; instead it argues that mandatory human review becomes a bottleneck under high AI throughput. We will expand the abstract and add a short section clarifying this as an argument about scaling limits rather than a data-driven claim about today's hybrid outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity detected; argument consists of asserted premises without self-referential derivation

full rationale

The manuscript is an argumentative position paper whose central thesis is explicitly introduced as resting on two stated claims about agent performance and hybrid workflow limitations. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text. The two claims function as premises rather than outputs of any derivation chain, so no reduction to inputs by construction occurs. The paper cites Fagan (1976) for historical context but does not rely on self-citation or prior author work to justify its threshold-crossing assertion. This is a normal non-finding for an opinion piece whose soundness is an external-evidence question, not an internal circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. The central claim rests on an unstated domain assumption about the current state of coding agent capabilities.

axioms (1)
  • domain assumption Coding agents have crossed a capability threshold sufficient to replace human code review for all stated goals
    Invoked directly in the abstract as the basis for the argument that human review is no longer necessary.

pith-pipeline@v0.9.1-grok · 5656 in / 1348 out tokens · 46100 ms · 2026-06-27T06:09:28.078826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Augmentation with Dilution: A Large-Scale Empirical Study of Human Contributor Ecosystems After AI Coding Agent Adoption

    cs.SE 2026-06 unverdicted novelty 7.0

    AI coding agent adoption causes no change in human contributor count but reduces contributor density and newcomer share by 3.7pp while increasing review depth by 5.3% in a staggered DiD analysis of 11k GitHub projects.

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Expectations, outcomes, and challenges of modern code review,

    A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” inProceedings of the 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712– 721

  2. [2]

    Modern code review: a case study at Google,

    C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at Google,” inProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). ACM, 2018, pp. 181–190

  3. [3]

    Usage, costs, and benefits of continuous integration in open-source projects,

    M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs, and benefits of continuous integration in open-source projects,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 2016, pp. 426–437

  4. [4]

    Process aspects and social dynamics of contemporary code review,

    A. Bosu, M. Greiler, and C. Bird, “Process aspects and social dynamics of contemporary code review,” inIEEE Transactions on Software Engineering, vol. 43, no. 1. IEEE, 2016, pp. 56–75

  5. [5]

    Do developers feel emotions? an exploratory analysis of emotions in software artifacts,

    A. Murgia, P. Tourani, B. Adams, and M. Ortu, “Do developers feel emotions? an exploratory analysis of emotions in software artifacts,” inProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 2014, pp. 262–271

  6. [6]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,”arXiv preprint arXiv:2405.15793, 2024

  7. [7]

    Introducing Devin, the first AI software engineer,

    Cognition AI, “Introducing Devin, the first AI software engineer,”Cog- nition AI Blog, 2024, https://www.cognition.ai/blog/introducing-devin

  8. [8]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    X. Wang, B. Chen, Y . Yuan, Y . Zhang, B. Li, C. Qianet al., “OpenDevin: An open platform for AI software developers as generalist agents,” in arXiv preprint arXiv:2407.16741, 2024

  9. [9]

    GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,

    GitHub, “GitHub Copilot Workspace: Welcome to the Copilot- native developer environment,”GitHub Blog, 2024, https://github.blog/ 2024-04-29-github-copilot-workspace/

  10. [10]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-GitHub issues?”arXiv preprint arXiv:2310.06770, 2023

  11. [11]

    CodeReviewer: Pre-training for automating code review activities,

    Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, N. Sundaresan, M. Fuet al., “CodeReviewer: Pre-training for automating code review activities,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2022, pp. 1536–1546

  12. [12]

    Automated code review in prac- tice,

    C. Pornprasit and C. Tantithamthavorn, “Automated code review in prac- tice,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 394–405

  13. [13]

    Design and code inspections to reduce errors in program development,

    M. E. Fagan, “Design and code inspections to reduce errors in program development,”IBM Systems Journal, vol. 15, no. 3, pp. 182–211, 1976

  14. [14]

    Code reviews do not find bugs: How the current code review best practice slows us down,

    J. Czerwonka, M. Greiler, and J. Tilford, “Code reviews do not find bugs: How the current code review best practice slows us down,” pp. 27–28, 2015

  15. [15]

    Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

    J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” inProceedings of the 34th IEEE International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 647–658

  16. [16]

    Using pre-trained models to boost code review automa- tion

    R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion.” ACM, 2022, pp. 1–12

  17. [17]

    CodeAgent: Autonomous communicative agents for code review,

    X. Tang, K. Kim, Y . Song, C. Lothritz, B. Li, S. Ezzini, H. Tian, J. Klein, and T. F. Bissyande, “CodeAgent: Autonomous communicative agents for code review,”arXiv preprint arXiv:2402.02172, 2024

  18. [18]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  19. [19]

    The Claude 3 model family: Opus, Sonnet, Haiku,

    Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,”An- thropic Technical Report, 2024

  20. [20]

    SWE-bench leaderboard,

    S. bench Team, “SWE-bench leaderboard,” https://www.swebench.com, 2025

  21. [21]

    A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,

    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each,” pp. 3–13, 2012

  22. [22]

    Automated program repair in the era of large pre-trained language models,

    C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” pp. 1482–1494, 2023

  23. [23]

    Competition- level code generation with AlphaCode,

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition- level code generation with AlphaCode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022

  24. [24]

    Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” inProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

  25. [25]

    How secure is code generated by ChatGPT?

    R. Khoury, A. R. Avci, J. Brunelle, and B. Marc Camara, “How secure is code generated by ChatGPT?” 2023

  26. [26]

    Comparing ai agents to cybersecurity professionals in real-world penetration testing,

    J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrikson, J. Z. Kolteret al., “Comparing ai agents to cybersecurity professionals in real-world penetration testing,” arXiv preprint arXiv:2512.09882, 2025

  27. [27]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,”arXiv preprint arXiv:2302.06590, 2023

  28. [28]

    Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,

    M. Shahin, M. A. Babar, and L. Zhu, “Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices,”IEEE Access, vol. 5, pp. 3909–3943, 2017

  29. [29]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y . Bai, S. Bow- man, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark...

  30. [30]

    Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

    A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,”arXiv preprint arXiv:2503.03586, 2025

  31. [31]

    Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),

    Berkeley Risk and Decisions Initiative, “Frontier ai’s impact on the cybersecurity landscape (paper summary and blog),” https://rdi.berkeley. edu/frontier-ai-impact-on-cybersecurity/, 2025

  32. [32]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023