Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
Pith reviewed 2026-05-22 03:49 UTC · model grok-4.3
The pith
Rejection outcomes substantially overstate AI agent errors in pull requests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through analysis of 11,048 closed Agentic Pull Requests refined to 9,799 human-reviewed ones, and manual inspection of 717 representative cases, only 35.7 percent of rejected PRs reflected clear agentic failures, while 31.2 percent were driven by workflow constraints and 33.1 percent lacked observable decision rationale. Among merged PRs, 15.4 percent required explicit reviewer involvement and 5.5 percent showed no visible interaction trace.
What carries the argument
Classification of decision rationales recovered from review interaction artifacts, separating agent failures from workflow constraints and unclear cases.
If this is right
- PR merge and reject labels alone do not capture agent performance.
- Evaluation of agents requires interaction-aware methods grounded in review behavior.
- Agents such as Copilot and Devin appear more often in reviewer-mediated workflows than Codex or Cursor.
- Rejection rates inflate perceived agent error by a factor of nearly three.
Where Pith is reading between the lines
- Project maintainers could reduce unnecessary rejections by clarifying workflow rules for agent submissions.
- Agent developers might prioritize designs that require less reviewer intervention to improve merge rates.
- Automated tools could eventually classify rationales at scale to monitor agent behavior over time.
Load-bearing premise
The manual inspection and classification of the 717 representative cases accurately recovers the true underlying decision rationale from the available review interaction artifacts without substantial coder bias or missing context.
What would settle it
An independent reclassification of the same 717 cases by different reviewers that yields substantially different shares of agent failures versus workflow constraints would challenge the reported breakdown.
Figures
read the original abstract
AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover decision rationale from interaction artifacts. We found that rejection outcomes substantially overstate agent error: only 35.7% of rejected PRs reflected clear agentic failures, while 31.2% were driven by workflow constraints and 33.1% lacked observable decision rationale. Among merged PRs, 15.4% required explicit reviewer involvement through feedback or direct commits, and 5.5% showed no visible interaction trace. We further observed systematic differences across agents, with Copilot and Devin more often embedded in reviewer-mediated workflows, while Codex and Cursor PRs were typically merged with minimal interaction. These results reject the assumption that PR outcomes alone capture agent performance and demonstrate the need for interaction-aware evaluation grounded in review behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical analysis of 11,048 closed Agentic Pull Requests, filtering to 9,799 human-reviewed cases and manually inspecting 717 representative examples to recover decision rationales from review interaction artifacts. It reports that only 35.7% of rejected PRs reflect clear agentic failures, while 31.2% stem from workflow constraints and 33.1% lack observable rationale; among merged PRs, 15.4% required explicit reviewer involvement and 5.5% showed no interaction trace. Systematic differences are noted across agents (e.g., Copilot and Devin more often involve reviewer mediation). The central claim is that PR merge/reject outcomes alone overstate agent error and that evaluation must be interaction-aware.
Significance. If the manual classification proves reliable, the work offers a useful corrective to simplistic outcome-based metrics for AI coding agents. The scale of the GitHub dataset and the breakdown by agent type provide concrete empirical grounding for rethinking evaluation practices. Credit is due for grounding claims in real review threads rather than synthetic benchmarks and for surfacing the substantial share of non-failure rejections.
major comments (2)
- [Methodology section (around the description of the 717-case inspection)] The manual coding of the 717 cases that produces the headline figures (35.7% clear failures, 31.2% workflow constraints, 33.1% no rationale) is described without reported inter-rater reliability, a full coding protocol, or justification for the sampling frame and exclusion criteria. Because these percentages directly support the claim that rejection outcomes substantially overstate agent error, the absence of these methodological safeguards is load-bearing.
- [Data collection and filtering subsection] The filtering step from 11,048 to 9,799 human-reviewed PRs and the selection of the 717 representative cases require more explicit documentation of exclusion rules and sampling strategy to establish that the analyzed subset remains representative of the broader population of agentic PRs.
minor comments (2)
- [Results section on agent differences] Table or figure presenting the per-agent breakdowns would benefit from explicit sample sizes per agent to allow readers to assess the stability of the reported differences.
- [Abstract] A brief statement in the abstract or introduction clarifying that the 717 cases were drawn from the filtered 9,799 would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for acknowledging the potential value of our empirical analysis in correcting simplistic evaluations of AI coding agents. We appreciate the opportunity to clarify and strengthen the methodological aspects of our study.
read point-by-point responses
-
Referee: [Methodology section (around the description of the 717-case inspection)] The manual coding of the 717 cases that produces the headline figures (35.7% clear failures, 31.2% workflow constraints, 33.1% no rationale) is described without reported inter-rater reliability, a full coding protocol, or justification for the sampling frame and exclusion criteria. Because these percentages directly support the claim that rejection outcomes substantially overstate agent error, the absence of these methodological safeguards is load-bearing.
Authors: We acknowledge the importance of these methodological details for establishing the reliability of our manual classification. In the revised manuscript, we will add a dedicated subsection describing the coding protocol in full, including the codebook used for categorizing rationales. We will also report inter-rater reliability statistics based on a double-coding of a subset of the 717 cases. Regarding the sampling frame, we will justify the selection of 717 cases as a representative sample by explaining the stratified sampling approach across different agents and PR outcomes, and detail the exclusion criteria applied during the initial filtering to 9,799 human-reviewed PRs. revision: yes
-
Referee: [Data collection and filtering subsection] The filtering step from 11,048 to 9,799 human-reviewed PRs and the selection of the 717 representative cases require more explicit documentation of exclusion rules and sampling strategy to establish that the analyzed subset remains representative of the broader population of agentic PRs.
Authors: We agree that greater transparency in data collection and filtering is necessary. We will revise the relevant subsection to provide a step-by-step account of the exclusion rules, such as the criteria for identifying human-reviewed PRs (e.g., presence of at least one human comment or review). For the sampling of the 717 cases, we will describe the random sampling strategy employed to ensure representativeness and include any statistical checks performed to verify that the subsample mirrors the population in terms of key metrics like merge rates and agent distribution. revision: yes
Circularity Check
No circularity: empirical results from direct observation of external artifacts
full rationale
The paper conducts an empirical study by collecting 11,048 closed Agentic Pull Requests from GitHub, refining to 9,799 human-reviewed cases, and manually inspecting 717 representative samples to classify decision rationales. The reported percentages (e.g., 35.7% clear agentic failures) are produced by human coders reading review threads and interaction artifacts. No equations, fitted parameters, self-definitional constructs, or self-citation chains are used to derive these outcomes. The central claim follows directly from external data labeling without reduction to inputs by construction, satisfying the self-contained criterion against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Review interaction artifacts (comments, commits, timelines) contain sufficient observable information to classify the primary reason for merge or rejection decisions.
Reference graph
Works this paper leans on
-
[1]
Agentic-PRs-Analysis: Replication Package for Decision-Oriented Analysis of Agentic Pull Requests
2026. Agentic-PRs-Analysis: Replication Package for Decision-Oriented Analysis of Agentic Pull Requests. https://github.com/cn-eveer/Agentic-PRs-Analysis. Accessed: 2026-01-25
work page 2026
- [2]
-
[3]
Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chun- yang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzgerald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, and Stefan Wagner. 2025. Guidelines for Empirical ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Zheyuan Kevin Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. 2025. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers.A vailable at SSRN 4945566 (2025)
work page 2025
- [6]
-
[7]
Mehdi Golzadeh, Alexandre Decan, and Tom Mens. 2019. On the Effect of Discussions on Pull Request Decisions.. InBENEVOL
work page 2019
-
[8]
Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107
- [9]
-
[10]
Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability. (2011)
work page 2011
-
[11]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [12]
-
[13]
Sunday Oladele and Faruk Lawal. 2025. The Impact of AI-Assisted Code Gen- eration on Software Vulnerabilities and the Role of AI in Automated Security Testing.A vailable at SSRN 5253508(2025)
work page 2025
- [14]
- [15]
-
[16]
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590 [cs.SE] https://arxiv.org/abs/2302.06590
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Kexin Sun, Hongyu Kuang, Sebastian Baltes, Xin Zhou, He Zhang, Xiaoxing Ma, Guoping Rong, Dong Shao, and Christoph Treude. 2025. Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions. arXiv:2508.18771 [cs.SE] https://arxiv.org/abs/2508.18771
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [18]
- [19]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.