pith. sign in

arxiv: 2601.17581 · v3 · submitted 2026-01-24 · 💻 cs.SE · cs.AI

How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

Pith reviewed 2026-05-16 11:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI coding agentspull requestsGitHubempirical studycode modificationagentic contributionsdescription similarity
0
0 comments X

The pith

AI coding agents generate pull requests with substantially more commits than human developers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how AI coding agents modify code by comparing thousands of their pull requests to those made by humans on GitHub. It finds large differences in the number of commits per PR and moderate differences in the number of files touched and lines deleted, with agentic PRs showing slightly higher similarity between their descriptions and the actual code changes. This characterization helps assess the role and reliability of autonomous AI in open source development workflows.

Core claim

Using the AIDev dataset, analysis of 24,014 merged agentic PRs and 5,081 merged human PRs reveals that agentic PRs differ substantially from human PRs in commit count with a Cliff's delta of 0.5429, exhibit moderate differences in files touched and deleted lines, and display slightly higher description-to-diff similarity using both lexical and semantic measures.

What carries the argument

Empirical comparison of modification metrics including commit counts, files touched, line additions and deletions, alongside lexical and semantic similarity scores between PR descriptions and code diffs.

If this is right

  • Agentic PRs involve more commits, suggesting finer-grained change steps.
  • Moderate differences appear in the scope of files modified and lines removed.
  • Descriptions of agentic changes align slightly better with the diffs than human ones.
  • These patterns provide a baseline for how AI agents contribute to open source projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI agents may approach tasks by committing more frequently to manage complexity.
  • Improved description alignment could make agentic PRs easier to review if quality holds.
  • Similar studies on rejected PRs could reveal acceptance biases.
  • Developers might adapt review processes based on whether a PR is agent-generated.

Load-bearing premise

The dataset accurately identifies agentic versus human PRs without misclassification, and the similarity measures properly capture how well descriptions match the code diffs.

What would settle it

A manual review of a sample of PRs finding frequent incorrect labels in the dataset, or computation with different similarity metrics showing lower alignment for agentic PRs.

Figures

Figures reproduced from arXiv: 2601.17581 by Daniel Ogenrwot, John Businge.

Figure 1
Figure 1. Figure 1: Four-step workflow: dataset collection, commit-data extension, PR filtering, and structural and similarity analysis. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of LOC added and deleted in agentic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of commits and files touched across [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of similarity scores across lexical and semantic metrics for both Agentic and Human PRs. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
read the original abstract

AI coding agents are increasingly acting as autonomous contributors by generating and submitting pull requests (PRs). However, we lack empirical evidence on how these agent-generated PRs differ from human contributions, particularly in how they modify code and describe their changes. Understanding these differences is essential for assessing their reliability and impact on development workflows. Using the MSR 2026 Mining Challenge version of the AIDev dataset, we analyze 24,014 merged Agentic PRs (440,295 commits) and 5,081 merged Human PRs (23,242 commits). We examine additions, deletions, commits, and files touched, and evaluate the consistency between PR descriptions and their diffs using lexical and semantic similarity. Agentic PRs differ substantially from Human PRs in commit count (Cliff's $\delta = 0.5429$) and show moderate differences in files touched and deleted lines. They also exhibit slightly higher description-to-diff similarity across all measures. These findings provide a large-scale empirical characterization of how AI coding agents contribute to open source development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that agentic PRs from the AIDev dataset differ substantially from human PRs, with a large effect on commit count (Cliff's δ = 0.5429), moderate differences in files touched and deleted lines, and slightly higher description-to-diff similarity across lexical and semantic measures, based on 24,014 merged agentic PRs (440k commits) versus 5,081 human PRs (23k commits).

Significance. If the agentic/human labeling is reliable, the work supplies a large-scale empirical baseline on how AI coding agents modify code and describe changes in open-source projects, with direct relevance to assessing their reliability and integration into workflows.

major comments (1)
  1. [Methods / Dataset description] The paper takes the agentic vs. human PR split directly from the MSR 2026 AIDev dataset and performs all distributional comparisons (commit count, files touched, deleted lines, description-diff similarity) on this binary partition without reporting the labeling heuristics, any held-out validation set, or precision/recall figures. This is load-bearing: modest label noise would render the reported effect sizes uninterpretable.
minor comments (1)
  1. [Abstract] Abstract omits any mention of data-cleaning steps or edge-case handling, which would help readers gauge robustness of the reported differences.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the careful review and for emphasizing the centrality of the agentic/human labeling to our results. We address the concern directly below and will make corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: The paper takes the agentic vs. human PR split directly from the MSR 2026 AIDev dataset and performs all distributional comparisons (commit count, files touched, deleted lines, description-diff similarity) on this binary partition without reporting the labeling heuristics, any held-out validation set, or precision/recall figures. This is load-bearing: modest label noise would render the reported effect sizes uninterpretable.

    Authors: We agree that the labeling process requires explicit description. The agentic versus human partition is taken directly from the publicly released MSR 2026 AIDev dataset, which applies automated heuristics based on commit authorship patterns, message content, and PR metadata. We will add a dedicated subsection in the Methods section that summarizes these heuristics and cites the original dataset paper for complete specification. The dataset release does not include a held-out validation set or precision/recall metrics, as the labels are rule-derived rather than manually annotated. We will also expand the Limitations section to discuss the implications of possible label noise, noting that the largest reported effect size (Cliff's δ = 0.5429 on commit count) would require substantial noise to reverse the primary conclusions. These changes will be incorporated in the revised manuscript. revision: partial

standing simulated objections not resolved
  • The AIDev dataset does not provide a held-out validation set or precision/recall figures for the agentic/human labeling.

Circularity Check

0 steps flagged

No circularity: direct empirical comparisons on external dataset labels

full rationale

The paper performs straightforward statistical comparisons (Cliff's δ, similarity metrics) between agentic and human PRs using the pre-labeled AIDev dataset split as input. No equations derive predictions from fitted parameters, no self-definitional steps exist, and no load-bearing self-citations or ansatzes are invoked. The central claims are measurements on an external data partition rather than reductions to the paper's own constructs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the correctness of the dataset's agentic/human labeling and on the assumption that lexical/semantic similarity scores are appropriate proxies for description-diff consistency.

axioms (2)
  • standard math Cliff's delta is a suitable non-parametric effect size for comparing commit counts and related metrics between groups
    Applied directly to quantify the reported differences
  • domain assumption Lexical and semantic similarity metrics provide a valid measure of consistency between PR text and code diffs
    Used as the basis for the similarity evaluation

pith-pipeline@v0.9.0 · 5481 in / 1249 out tokens · 71096 ms · 2026-05-16T11:02:49.781190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

    cs.SE 2026-04 accept novelty 7.0

    AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.

  2. AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code

    cs.SE 2026-04 unverdicted novelty 5.0

    AIRA is a 15-check audit framework that finds AI-generated code has 1.8 times more high-severity failure-untruthful patterns than human-written code in a matched replication study.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Anthropic. 2025. Claude.ai. https://claude.ai/. Accessed: 2025-12-14

  2. [2]

    Shraddha Barke, Siddharth Bansal, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, USA, 1–13

  3. [3]

    2009.Natural Language Processing with Python(1st ed.)

    Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python(1st ed.). O’Reilly Media, Inc., USA

  4. [4]

    John Businge, Alexandre Decan, Ahmed Zerouali, Tom Mens, and Serge Demeyer

  5. [5]

    InProceedings of the 19th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2020, Luxembourg, December 3-4, 2020 (CEUR Workshop Proceedings, Vol

    An Empirical Investigation of Forks as Variants in the npm Package Distribution. InProceedings of the 19th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2020, Luxembourg, December 3-4, 2020 (CEUR Workshop Proceedings, Vol. 2912), Mike Papadakis and Maxime Cordy (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-2912/paper1.pdf

  6. [6]

    John Businge, Alexandre Decan, Ahmed Zerouali, Tom Mens, Serge Demeyer, and Coen De Roover. 2022. Variant Forks – Motivations and Impediments. In Proceedings of the 29th edition of the IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE Computer Society, 867–877. doi:10. 1109/SANER53432.2022.00105

  7. [7]

    John Businge, Moses Openja, Sarah Nadi, Engineer Bainomugisha, and Thorsten Berger. 2018. Clone-Based Variability Management in the Android Ecosystem. In International Conference on Software Maintenance and Evolution. IEEE, 625–634

  8. [8]

    John Businge, Moses Openja, Sarah Nadi, and Thorsten Berger. 2022. Reuse and Maintenance Practices among Divergent Forks in Three Software Ecosystems. Journal of Empirical Software Engineering27, 2 (2022), 54. doi:10.1007/s10664- 021-10078-2

  9. [9]

    Zhi Chen and Lingxiao Jiang. 2025. Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Sce- narios. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 657–668. doi:10.1109/SANER64311.2025.00068

  10. [10]

    Cursor. 2025. Cursor: AI Code Editor. https://cursor.com/. Accessed: 2025-12-14

  11. [11]

    Devin AI. 2025. Devin AI — AI Coding Assistant. https://app.devin.ai/. Accessed: 2025-12-14

  12. [12]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv:2002.08155 [cs.CL] https://arxiv.org/abs/2002.08155

  13. [13]

    GitHub. 2025. GitHub REST API documentation. https://docs.github.com/en/ rest?apiVersion=2022-11-28. Accessed: 2025-12-14

  14. [14]

    GitHub Copilot. 2025. GitHub Copilot. https://github.com/copilot. Accessed: 2025-12-14

  15. [15]

    Georgios Gousios, Martin Pinzger, and Arie van Deursen. 2014. An Exploratory Study of the Pull-Based Software Development Model. InProceedings of the 36th International Conference on Software Engineering (ICSE). ACM, USA, 345–355

  16. [16]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv:2009.08366 [cs.SE] https://arxiv.org/abs/2009.08366

  17. [17]

    Agentic software engineering: Foundational pillars and a research roadmap,

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https://arxiv.org/abs/ 2509.06216

  18. [18]

    Hassan, Gustavo A

    Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107

  19. [19]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv preprint arXiv:2511.04824(2025)

  20. [20]

    Hyoungwook Jin, Seonghee Lee, Hyungyu Shin, and Juho Kim. 2024. Teach AI How to Code: Using Large Language Models as Teachable Agents for Program- ming Education. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Ma- chinery, New York, NY, USA, Article 652, 28 pages. doi:1...

  21. [21]

    Oleksii Kononenko, Olga Baysal, and Michael W. Godfrey. 2016. Code review quality: how developers see it. InProceedings of the 38th International Conference on Software Engineering(Austin, Texas)(ICSE ’16). Association for Computing Machinery, New York, NY, USA, 1028–1038. doi:10.1145/2884781.2884840

  22. [22]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025)

  23. [23]

    Jeffrey D Long, Du Feng, and Norman Cliff. 2003. Ordinal analysis of behavioral data.Handbook of psychology(2003), 635–661

  24. [24]

    Yuanhua Lv and ChengXiang Zhai. 2012. A Log-Logistic Model-Based Interpreta- tion of TF Normalization of BM25. InAdvances in Information Retrieval, Ricardo Baeza-Yates, Arjen P. de Vries, Hugo Zaragoza, B. Barla Cambazoglu, Vanessa Murdock, Ronny Lempel, and Fabrizio Silvestri (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 244–255

  25. [25]

    H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50–60. http://www.jstor.org/stable/2236101

  26. [26]

    McKnight and Julius Najab

    Patrick E. McKnight and Julius Najab. 2010.Mann- Whitney U Test. John Wiley & Sons, Ltd, 1–1. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470479216.corpsy0524 doi:10.1002/9780470479216.corpsy0524

  27. [27]

    Daniel Ogenrwot and John Businge. 2024. PatchTrack: Analyzing ChatGPT’s Impact on Software Patch Decision-Making in Pull Requests. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 2480–2481. doi:10.1145/3691620.3695338

  28. [28]

    Daniel Ogenrwot and John Businge. 2025. PatchTrack: A Comprehensive Analysis of ChatGPT’s Influence on Pull Request Outcomes. arXiv:2505.07700 [cs.SE] https://arxiv.org/abs/2505.07700

  29. [29]

    Daniel Ogenrwot and John Businge. 2025. Refactoring-Aware Patch Integration Across Structurally Divergent Java Forks. In2025 IEEE International Conference on Source Code Analysis & Manipulation (SCAM). 25–36. doi:10.1109/SCAM67354. 2025.00010

  30. [30]

    2026.Replication Package for How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

    Daniel Ogenrwot and John Businge. 2026.Replication Package for How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests. doi:10.5281/ zenodo.18363153

  31. [31]

    OpenAI. 2025. Codex — OpenAI. https://openai.com/codex/. Accessed: 2025-12- 14

  32. [32]

    Foyzur Rahman, Daryl Posnett, and Premkumar Devanbu. 2013. Predicting Defect-Prone Software Modules Using Code Change Metrics.Empirical Software Engineering18, 5 (2013), 875–908

  33. [33]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs.CL] https://arxiv.org/abs/ 1908.10084

  34. [34]

    Robertson and Hugo Zaragoza , title =

    Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond.Found. Trends Inf. Retr.3, 4 (2009), 333–389. doi:10.1561/1500000019

  35. [35]

    Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda Devine. 2006. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate choices. Inannual meeting of the Southern Association for Institutional Research, Vol. 14. Citeseer

  36. [36]

    Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval.Information Processing & Management24, 5 (1988), 513–523. doi:10.1016/0306-4573(88)90021-0

  37. [37]

    Asif Mohammed Samir and Mohammad Masudur Rahman. 2025. Improved IR- Based Bug Localization with Intelligent Relevance Feedback. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). USA, 560–571. doi:10.1109/ICPC66645.2025.00065

  38. [38]

    Desmarais, and Giuliano Antoniol

    Florian Tambon, Arghavan Moradi-Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Giuliano Antoniol. 2025. Bugs in large language models generated code: an empirical study.Empirical Software Engineering30, 3 (2025), 65. doi:10.1007/s10664-025-10614-4

  39. [39]

    Glassman

    Priyan Vaithilingam, Zheng Xu, and Elena L. Glassman. 2023. Copilot or Co- Author? Examining the Role of Code Generation Tools in Collaborative Pro- gramming. InProceedings of the 2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, USA

  40. [40]

    Glassman

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, USA, 1–17

  41. [41]

    Felix Wang, Brian Do, and Jacie Jermier. 2025. Automated vs. Human Security Patching Patterns in Pull Requests: Evidence from the AIDev Dataset. (2025)

  42. [42]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub. arXiv:2509.14745 [cs.SE] https://arxiv.org/abs/2509.14745

  43. [43]

    Yifan Wu, Yunpeng Wang, Ying Li, Wei Tao, Siyu Yu, Haowen Yang, Wei Jiang, and Jianguo Li. 2025. An Empirical Study on Commit Message Generation Using LLMs via In-Context Learning. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 553–565. doi:10.1109/ICSE55347.2025.00091

  44. [44]

    Xinli Yang, David Lo, Xin Xia, Lingfeng Bao, and Jianling Sun. 2016. Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports. In2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, USA, 127–137. doi:10.1109/ISSRE.2016.33