pith. machine review for the scientific record. sign in

arxiv: 2603.27130 · v2 · submitted 2026-03-28 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:45 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI-generated codeempirical studysoftware repositoriesLLM detectioncode complexitycommit patternsdevelopment practices
0
0 comments X

The pith

AI-generated code in real-world repositories differs from human-written code in complexity, structure, and post-commit evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a large-scale empirical analysis of AI-generated code drawn from actual public software repositories. It measures both code-level traits such as complexity, structural features, and defect indicators, and commit-level traits such as size, activity timing, and how the code changes after the initial commit. A detection pipeline that first applies heuristic filters and then uses LLM classification is used to build the dataset at scale. This approach addresses the limitation of prior studies that relied on small or controlled settings, providing a view of how AI assistance actually operates in ongoing development work. If the differences are real, they supply an empirical basis for understanding the practical effects of AI tools on software quality and team practices.

Core claim

By constructing a large dataset through a heuristic-plus-LLM detection pipeline applied to real repositories, the study establishes that AI-generated code exhibits distinct measurable characteristics relative to conventional human-driven development, including differences in complexity and structural properties at the code level and in size, activity patterns, and evolutionary trajectories at the commit level.

What carries the argument

The detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and enable large-scale comparative analysis against human-written code.

If this is right

  • AI-assisted code displays different complexity and structural characteristics than human-written code.
  • Commits involving AI-generated code show distinct size and activity patterns.
  • Post-commit evolution of AI code follows different trajectories than human code.
  • Overall development practices shift measurably when AI assistance is present at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed patterns could be used to calibrate future AI coding models so they better align with human-like structures and maintenance needs.
  • Repository maintainers and code reviewers may require new processes tailored to the distinct defect and evolution profiles of AI-generated contributions.
  • Longitudinal tracking of the same repositories could reveal whether the differences grow or shrink as AI tools improve over time.

Load-bearing premise

The heuristic filtering combined with LLM classification accurately identifies AI-generated code at scale with error rates low enough to support valid comparisons of characteristics.

What would settle it

A manual review of a statistically meaningful random sample from the classified set that reveals a high rate of false positives, or a replication using an independent detection method that eliminates the reported differences, would falsify the central comparisons.

Figures

Figures reproduced from arXiv: 2603.27130 by Dongfang Zhao, Haixu Tang, Hang Zhang, Tianhao Mao, Xiaofeng Wang.

Figure 1
Figure 1. Figure 1: Measurement Pipeline. limited understanding of how real-world LLM-generated code sys￾tematically differs from human-written code, not only in code-level characteristics but also in development activity patterns. Our work addresses this gap by conducting a systematic compar￾ative study of real-world LLM-generated code and human-written code in open-source repositories. Rather than relying on crafted prompts… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of AI-generated code records by tools [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in software development, generating code that ranges from short snippets to substantial project components. As AI-generated code becomes more common in real-world repositories, it is important to understand how it differs from human-written code and how AI assistance may influence development practices. However, existing studies have largely relied on small-scale or controlled settings, leaving a limited understanding of AI-generated code in the wild. In this work, we present a large-scale empirical study of AI-generated code collected from real-world repositories. We examine both code-level properties, including complexity, structural characteristics, and defect-related indicators, and commit-level characteristics, such as commit size, activity patterns, and post-commit evolution. To support this study, we develop a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and construct a large-scale dataset for analysis. Our study provides a comprehensive view of the characteristics of AI-generated code in practice and highlights how AI-assisted development differs from conventional human-driven development. These findings contribute to a better understanding of the real-world impact of AI-assisted programming and offer an empirical basis for future research on AI-generated software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a large-scale empirical study of AI-generated code in real-world repositories. It develops a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code, constructs a corresponding dataset, and compares code-level properties (complexity, structural characteristics, defect indicators) and commit-level properties (size, activity patterns, post-commit evolution) against human-written code to highlight differences from conventional development.

Significance. If the detection pipeline proves reliable, the work would offer a valuable large-scale, observational view of AI-assisted coding in production repositories, extending beyond the small-scale or controlled settings of prior studies and supplying an empirical foundation for understanding AI's impact on software development practices.

major comments (2)
  1. [Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.
  2. [Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating the scale of the constructed dataset and one or two headline quantitative findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in validation and statistical reporting that we will address in the revision to strengthen the reliability of our claims.

read point-by-point responses
  1. Referee: [Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.

    Authors: We agree that explicit validation metrics are necessary to support the labeling fidelity and all downstream comparisons. The Methods section describes the pipeline components, but we did not include quantitative validation on real commits in the initial submission. In the revised version, we will add a dedicated validation subsection reporting precision, recall, and F1 on a manually annotated sample of 1,000 real commits (with inter-annotator agreement via Cohen's kappa), plus a detailed error analysis categorizing false positives and negatives. This will be accompanied by a new table of metrics. revision: yes

  2. Referee: [Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.

    Authors: We acknowledge the need for these details to assess robustness. While the full manuscript (Section 4) describes the overall scale of the dataset and repository sampling, we will revise the Results section to explicitly report exact dataset sizes (repositories, commits, and AI-generated instances), the sampling strategy (random stratified sampling by language and repository size), and statistical details including 95% confidence intervals, error bars on figures, and hypothesis test results (e.g., Mann-Whitney U tests with p-values) for all reported differences. This will mitigate concerns about selection bias or detection artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with no derivations or self-referential reductions

full rationale

This paper is an empirical observational study that collects and measures code properties and commit characteristics directly from external real-world repositories. No derivation chain, equations, fitted parameters presented as predictions, or first-principles results exist. The detection pipeline is a methodological tool for dataset construction, not a self-defining or fitted input that is then renamed as a prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. All claims reduce to direct measurements from the constructed dataset rather than to the paper's own inputs by construction. Limitations around pipeline validation affect data reliability but do not constitute circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unvalidated accuracy of the LLM-based detector and on standard assumptions of empirical software engineering studies; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption LLM-based classifiers can be combined with heuristics to produce reliable labels for AI-generated code at repository scale
    Invoked to justify the detection pipeline that underpins the entire dataset

pith-pipeline@v0.9.0 · 5516 in / 1097 out tokens · 37063 ms · 2026-05-14T22:45:00.905965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    Maurício Aniche. 2026.CK. https://github.com/mauricioaniche/ck Accessed: 2026-03-26

  2. [2]

    ISBN 979-8-4007-1895-3

    Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Nor- bert Tihanyi. 2026. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security (A...

  3. [3]

    Black, Bhaskar P

    Gavin S. Black, Bhaskar P. Rimal, and Varghese Mathew Vaidyan. 2025. Balancing Security and Correctness in Code Generation: An Empirical Study on Commercial Large Language Models.IEEE Transactions on Emerging Topics in Computational Intelligence9, 1 (2025), 419–430. doi:10.1109/TETCI.2024.3446695

  4. [4]

    Hongbo Chen, Yifan Zhang, Xing Han, Tianhao Mao, Huanyao Rong, Yuheng Zhang, XiaoFeng Wang, Luyi Xing, Xun Chen, and Hang Zhang. 2025. Line- Breaker: Finding Token-Inconsistency Bugs with Large Language Models. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 893–905. doi:10.1109/ASE63991.2025.00079

  5. [5]

    2026.Pygments

    Pygments contributors. 2026.Pygments. https://pygments.org/ Accessed: 2026- 03-26

  6. [6]

    2026.cloc: v2.08

    Albert Danial. 2026.cloc: v2.08. doi:10.5281/zenodo.5760077

  7. [7]

    Simone Daniotti, Johannes Wachs, Xiangnan Feng, and Frank Neffke. 2026. Who is using AI to code? Global diffusion and impact of generative AI.Science391, 6787 (2026), 831–835. doi:10.1126/science.adz9311

  8. [8]

    Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Trans. Softw. Eng. Methodol.34, 8, Article 218 (Oct. 2025), 34 pages. doi:10.1145/3716848

  9. [9]

    2025.CodeQL

    GitHub. 2025.CodeQL. https://github.com/github/codeql

  10. [10]

    Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, and Yi Zhang

  11. [11]

    arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212

    Code Fingerprints: Disentangled Attribution of LLM-Generated Code. arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212

  12. [12]

    Hagberg, Daniel A

    Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring Network Structure, Dynamics, and Function using NetworkX. InProceedings of the 7th Python in Science Conference. 11–15. doi:10.25080/TCWV9851

  13. [13]

    S M Mahedy Hasan, Md Fazle Rabbi, and Minhaz Zibran. 2026. The Quiet Contri- butions: Insights into AI-Generated Silent Pull Requests. arXiv:2601.21102 [cs.SE] https://arxiv.org/abs/2601.21102 Mining Challenge track of the 23rd International Conference on Mining Software Repositories (MSR 2026)

  14. [14]

    Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. doi:10.1145/3576915.3623175

  15. [15]

    2026.Joern: The Bug Hunter’s Workbench

    joern.io. 2026.Joern: The Bug Hunter’s Workbench. https://github.com/joernio/ joern

  16. [16]

    2026.jscpd

    jscpd contributors. 2026.jscpd. https://github.com/kucherenko/jscpd Accessed: 2026-03-26

  17. [17]

    Avila, Jacob Brunelle, and Baba Mamadou Camara

    Raphaël Khoury, Anderson R. Avila, Jacob Brunelle, and Baba Mamadou Camara

  18. [18]

    In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

    How Secure is Code Generated by ChatGPT?. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2445–2451. doi:10.1109/ SMC53992.2023.10394237

  19. [19]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003

  20. [20]

    Shuang Li, Yuntao Cheng, Jinfu Chen, Jifeng Xuan, Sen He, and Weiyi Shang

  21. [21]

    Engg.31, 3 (Jan

    Performance analysis of AI-generated code: A case study of Copilot, Copilot Chat, CodeLlaMa, and DeepSeek-Coder models.Empirical Softw. Engg.31, 3 (Jan. 2026), 52 pages. doi:10.1007/s10664-025-10776-1

  22. [22]

    Jie Lin and David Mohaisen. 2025. From Large to Mammoth: A Com- parative Evaluation of Large Language Models in Vulnerability Detection. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Soci- ety. https://www.ndss-symposium.org/ndss-paper/from-large-to-mammoth-a- com...

  23. [23]

    H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50 – 60. doi:10.1214/aoms/1177730491

  24. [24]

    Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996

  25. [25]

    Sarker, Leandros Maglaras, and Naeem Janjua

    Ahmad Mohsin, Helge Janicke, Adrian Wood, Iqbal H. Sarker, Leandros Maglaras, and Naeem Janjua. 2024. Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs. arXiv:2406.12513 [cs.CR] https://arxiv.org/abs/2406.12513

  26. [26]

    Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley K. G. Assunção. 2025. Is LLM-Generated Code More Maintainable & Reliable Than Human-Written Code?. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 151–162. doi:10.1109/ ESEM64174.2025.00036

  27. [27]

    Daniil Orel, Dilshod Azizov, and Preslav Nakov. 2025. CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar (Eds.). Association for Computational Linguistics,...

  28. [28]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 754–768

  29. [29]

    Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175. doi:10.1080/14786440009463897

  30. [30]

    Musfiqur Rahman, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2025. Automatic Detection of LLM-Generated Code: A Comparative Case Study of Contemporary Models Across Function and Class Granularities. arXiv:2409.01382 [cs.SE] https://arxiv.org/abs/2409.01382

  31. [31]

    Romain Robbes, Théo Matricon, Thomas Degueule, Andre Hora, and Stefano Zacchiroli. 2026. Agentic Much? Adoption of Coding Agents on GitHub. arXiv:2601.18341 [cs.SE] https://arxiv.org/abs/2601.18341

  32. [32]

    Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee. 2025. How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench. arXiv:2507.02976 [cs.CR] https://arxiv.org/abs/2507.02976

  33. [33]

    Andreas Schaad, Stefan Götz, and Dominik Binder. 2025. You Still have to Study On the Security of LLM Generated Code. InICT Systems Security and Privacy Protection, Lili Nemec Zlatolas, Kai Rannenberg, Tatjana Welzer, and Joaquin Garcia-Alfaro (Eds.). Springer Nature Switzerland, Cham, 111–124

  34. [34]

    2025.Security Vulnerabilities in AI- Generated Code: A Large-Scale Analysis of Public GitHub Repositories

    Maximilian Schreiber and Pascal Tippe. 2025.Security Vulnerabilities in AI- Generated Code: A Large-Scale Analysis of Public GitHub Repositories. Springer Nature Singapore, 153–172. doi:10.1007/978-981-95-3537-8_9

  35. [35]

    2026.Understand

    SciTools. 2026.Understand. https://scitools.com/ Accessed: 2026-03-26

  36. [36]

    Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. SALLM: Security Assessment of Generated Code. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW ’24). ACM, 54–65. doi:10.1145/3691621.3694934

  37. [37]

    Mohammed Latif Siddiq, Xinye Zhao, Vinicius Carvalho Lopes, Beatrice Casey, and Joanna C. S. Santos. 2026. Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub. arXiv:2601.00477 [cs.CR] https: //arxiv.org/abs/2601.00477

  38. [38]

    Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. 2025. Calibration and Correctness of Language Models for Code. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 540–552. doi:10....

  39. [39]

    Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, and Iftekhar Ahmed. 2025. An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 859–871. doi:10.1109/ICSE55347.2025.00064

  40. [40]

    2025.tree-sitter

    tree-sitter contributors. 2025.tree-sitter. https://github.com/tree-sitter/tree-sitter

  41. [41]

    Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, and Yi Cai. 2024. Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval. arXiv:2407.02395 [cs.SE] https://arxiv.org/abs/2407.02395

  42. [42]

    Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83. http://www.jstor.org/stable/3001968 Conference’17, July 2017, Washington, DC, USA Tianhao Mao, Dongfang Zhao, Haixu Tang, Xiaofeng Wang, and Hang Zhang

  43. [43]

    Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, and Sebastian Baltes. 2026. Self-Admitted GenAI Usage in Open- Source Software. arXiv:2507.10422 [cs.SE] https://arxiv.org/abs/2507.10422

  44. [44]

    Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, and Dong- ping Chen. 2026. code-transformed: The Influence of Large Language Models on Code. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 5462–5490. doi:1...

  45. [45]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pretrained Models . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 428–439. doi:10.1145/...

  46. [46]

    Beiqi Zhang, Peng Liang, Qiong Feng, Yujia Fu, and Zengyang Li. 2024. Copilot- in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot. InProceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 2230–2234. doi:...