pith. sign in

arxiv: 2605.22534 · v1 · pith:NPEHJJIInew · submitted 2026-05-21 · 💻 cs.SE

Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

Pith reviewed 2026-05-22 03:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI coding agentsagentic pull requestspull request outcomesreview interactionsempirical studyworkflow constraintsmerge decisions
0
0 comments X

The pith

Rejection outcomes substantially overstate AI agent errors in pull requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why AI coding agents' pull requests get merged or rejected in open-source projects. It shows that simply looking at whether a PR was merged or not does not accurately measure the agent's performance. Many rejections happen for reasons unrelated to the agent's code quality, like project workflow rules. This matters because it means current ways of judging these agents are flawed and may undervalue their contributions. The study analyzed thousands of such PRs and inspected hundreds in detail to uncover the real decision reasons.

Core claim

Through analysis of 11,048 closed Agentic Pull Requests refined to 9,799 human-reviewed ones, and manual inspection of 717 representative cases, only 35.7 percent of rejected PRs reflected clear agentic failures, while 31.2 percent were driven by workflow constraints and 33.1 percent lacked observable decision rationale. Among merged PRs, 15.4 percent required explicit reviewer involvement and 5.5 percent showed no visible interaction trace.

What carries the argument

Classification of decision rationales recovered from review interaction artifacts, separating agent failures from workflow constraints and unclear cases.

If this is right

  • PR merge and reject labels alone do not capture agent performance.
  • Evaluation of agents requires interaction-aware methods grounded in review behavior.
  • Agents such as Copilot and Devin appear more often in reviewer-mediated workflows than Codex or Cursor.
  • Rejection rates inflate perceived agent error by a factor of nearly three.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Project maintainers could reduce unnecessary rejections by clarifying workflow rules for agent submissions.
  • Agent developers might prioritize designs that require less reviewer intervention to improve merge rates.
  • Automated tools could eventually classify rationales at scale to monitor agent behavior over time.

Load-bearing premise

The manual inspection and classification of the 717 representative cases accurately recovers the true underlying decision rationale from the available review interaction artifacts without substantial coder bias or missing context.

What would settle it

An independent reclassification of the same 717 cases by different reviewers that yields substantially different shares of agent failures versus workflow constraints would challenge the reported breakdown.

Figures

Figures reproduced from arXiv: 2605.22534 by Fumika Hoshi, Hidetake Tanaka, Hiroki Mukai, Hironori Washizaki, Inase Kondo, Kazuki Kusama, Naoyasu Ubayashi, Norihiro Yoshida, Sien Reeve O. Peralta, Yoshiki Higo, Youmei Fan.

Figure 1
Figure 1. Figure 1: Overview of the study pipeline [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover decision rationale from interaction artifacts. We found that rejection outcomes substantially overstate agent error: only 35.7% of rejected PRs reflected clear agentic failures, while 31.2% were driven by workflow constraints and 33.1% lacked observable decision rationale. Among merged PRs, 15.4% required explicit reviewer involvement through feedback or direct commits, and 5.5% showed no visible interaction trace. We further observed systematic differences across agents, with Copilot and Devin more often embedded in reviewer-mediated workflows, while Codex and Cursor PRs were typically merged with minimal interaction. These results reject the assumption that PR outcomes alone capture agent performance and demonstrate the need for interaction-aware evaluation grounded in review behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical analysis of 11,048 closed Agentic Pull Requests, filtering to 9,799 human-reviewed cases and manually inspecting 717 representative examples to recover decision rationales from review interaction artifacts. It reports that only 35.7% of rejected PRs reflect clear agentic failures, while 31.2% stem from workflow constraints and 33.1% lack observable rationale; among merged PRs, 15.4% required explicit reviewer involvement and 5.5% showed no interaction trace. Systematic differences are noted across agents (e.g., Copilot and Devin more often involve reviewer mediation). The central claim is that PR merge/reject outcomes alone overstate agent error and that evaluation must be interaction-aware.

Significance. If the manual classification proves reliable, the work offers a useful corrective to simplistic outcome-based metrics for AI coding agents. The scale of the GitHub dataset and the breakdown by agent type provide concrete empirical grounding for rethinking evaluation practices. Credit is due for grounding claims in real review threads rather than synthetic benchmarks and for surfacing the substantial share of non-failure rejections.

major comments (2)
  1. [Methodology section (around the description of the 717-case inspection)] The manual coding of the 717 cases that produces the headline figures (35.7% clear failures, 31.2% workflow constraints, 33.1% no rationale) is described without reported inter-rater reliability, a full coding protocol, or justification for the sampling frame and exclusion criteria. Because these percentages directly support the claim that rejection outcomes substantially overstate agent error, the absence of these methodological safeguards is load-bearing.
  2. [Data collection and filtering subsection] The filtering step from 11,048 to 9,799 human-reviewed PRs and the selection of the 717 representative cases require more explicit documentation of exclusion rules and sampling strategy to establish that the analyzed subset remains representative of the broader population of agentic PRs.
minor comments (2)
  1. [Results section on agent differences] Table or figure presenting the per-agent breakdowns would benefit from explicit sample sizes per agent to allow readers to assess the stability of the reported differences.
  2. [Abstract] A brief statement in the abstract or introduction clarifying that the 717 cases were drawn from the filtered 9,799 would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for acknowledging the potential value of our empirical analysis in correcting simplistic evaluations of AI coding agents. We appreciate the opportunity to clarify and strengthen the methodological aspects of our study.

read point-by-point responses
  1. Referee: [Methodology section (around the description of the 717-case inspection)] The manual coding of the 717 cases that produces the headline figures (35.7% clear failures, 31.2% workflow constraints, 33.1% no rationale) is described without reported inter-rater reliability, a full coding protocol, or justification for the sampling frame and exclusion criteria. Because these percentages directly support the claim that rejection outcomes substantially overstate agent error, the absence of these methodological safeguards is load-bearing.

    Authors: We acknowledge the importance of these methodological details for establishing the reliability of our manual classification. In the revised manuscript, we will add a dedicated subsection describing the coding protocol in full, including the codebook used for categorizing rationales. We will also report inter-rater reliability statistics based on a double-coding of a subset of the 717 cases. Regarding the sampling frame, we will justify the selection of 717 cases as a representative sample by explaining the stratified sampling approach across different agents and PR outcomes, and detail the exclusion criteria applied during the initial filtering to 9,799 human-reviewed PRs. revision: yes

  2. Referee: [Data collection and filtering subsection] The filtering step from 11,048 to 9,799 human-reviewed PRs and the selection of the 717 representative cases require more explicit documentation of exclusion rules and sampling strategy to establish that the analyzed subset remains representative of the broader population of agentic PRs.

    Authors: We agree that greater transparency in data collection and filtering is necessary. We will revise the relevant subsection to provide a step-by-step account of the exclusion rules, such as the criteria for identifying human-reviewed PRs (e.g., presence of at least one human comment or review). For the sampling of the 717 cases, we will describe the random sampling strategy employed to ensure representativeness and include any statistical checks performed to verify that the subsample mirrors the population in terms of key metrics like merge rates and agent distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct observation of external artifacts

full rationale

The paper conducts an empirical study by collecting 11,048 closed Agentic Pull Requests from GitHub, refining to 9,799 human-reviewed cases, and manually inspecting 717 representative samples to classify decision rationales. The reported percentages (e.g., 35.7% clear agentic failures) are produced by human coders reading review threads and interaction artifacts. No equations, fitted parameters, self-definitional constructs, or self-citation chains are used to derive these outcomes. The central claim follows directly from external data labeling without reduction to inputs by construction, satisfying the self-contained criterion against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of human-coded labels from review comments; no free parameters or new entities are introduced, but the domain assumption that interaction traces reveal decision rationale is load-bearing.

axioms (1)
  • domain assumption Review interaction artifacts (comments, commits, timelines) contain sufficient observable information to classify the primary reason for merge or rejection decisions.
    This premise is required to assign the 35.7%, 31.2%, and 33.1% categories from the 717 inspected cases.

pith-pipeline@v0.9.0 · 5786 in / 1414 out tokens · 50213 ms · 2026-05-22T03:49:09.704692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

  1. [1]

    Agentic-PRs-Analysis: Replication Package for Decision-Oriented Analysis of Agentic Pull Requests

    2026. Agentic-PRs-Analysis: Replication Package for Decision-Oriented Analysis of Agentic Pull Requests. https://github.com/cn-eveer/Agentic-PRs-Analysis. Accessed: 2026-01-25

  2. [2]

    Adam Alami and Neil A. Ernst. 2025. Human and Machine: How Software Engineers Perceive and Engage with AI-Assisted Code Reviews Compared to Their Peers. arXiv:2501.02092 [cs.SE] https://arxiv.org/abs/2501.02092

  3. [3]

    Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chun- yang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzgerald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, and Stefan Wagner. 2025. Guidelines for Empirical ...

  4. [4]

    Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2024. Automated Code Review In Practice. arXiv:2412.18531 [cs.SE] https://arxiv.org/abs/2412. 18531

  5. [5]

    Zheyuan Kevin Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. 2025. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers.A vailable at SSRN 4945566 (2025)

  6. [6]

    Jessica Díaz, Jorge Pérez, Carolina Gallardo, and Ángel González-Prieto. 2021. Applying Inter-rater Reliability and Agreement in Grounded Theory Studies in Software Engineering. arXiv:2107.11449 [cs.SE] https://arxiv.org/abs/2107.11449

  7. [7]

    Mehdi Golzadeh, Alexandre Decan, and Tom Mens. 2019. On the Effect of Discussions on Pull Request Decisions.. InBENEVOL

  8. [8]

    Hassan, Gustavo A

    Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107

  9. [9]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE] https://arxiv.org/abs/2308.10620

  10. [10]

    Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability. (2011)

  11. [11]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003

  12. [12]

    Yifei Ming, Zixuan Ke, Xuan-Phi Nguyen, Jiayu Wang, and Shafiq Joty. 2025. Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows. arXiv:2506.03332 [cs.AI] https://arxiv.org/abs/2506.03332

  13. [13]

    Sunday Oladele and Faruk Lawal. 2025. The Impact of AI-Assisted Code Gen- eration on Software Vulnerabilities and the Role of AI in Automated Security Testing.A vailable at SSRN 5253508(2025)

  14. [14]

    Nivishree Palvannan and Chris Brown. 2023. Suggestion Bot: Analyzing the Im- pact of Automated Suggested Changes on Code Reviews. arXiv:2305.06328 [cs.SE] https://arxiv.org/abs/2305.06328

  15. [15]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2021. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. arXiv:2108.09293 [cs.CR] https://arxiv.org/abs/ 2108.09293

  16. [16]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590 [cs.SE] https://arxiv.org/abs/2302.06590

  17. [17]

    Kexin Sun, Hongyu Kuang, Sebastian Baltes, Xin Zhou, He Zhang, Xiaoxing Ma, Guoping Rong, Dong Shao, and Christoph Treude. 2025. Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions. arXiv:2508.18771 [cs.SE] https://arxiv.org/abs/2508.18771

  18. [18]

    Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. 2021. Can We Benchmark Code Review Studies? A Systematic Mapping Study of Methodology, Dataset, and Metric. arXiv:1911.08816 [cs.SE] https://arxiv.org/abs/1911.08816

  19. [19]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub. arXiv:2509.14745 [cs.SE] https://arxiv.org/abs/2509.14745