pith. sign in

arxiv: 2605.21453 · v1 · pith:3VM5M5PAnew · submitted 2026-05-20 · 💻 cs.SE · cs.AI

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Pith reviewed 2026-05-21 02:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI-generated coderefactoringpull requestscode qualitysecurity analysisPythonstatic analysisempirical study
0
0 comments X

The pith

AI-generated Python refactoring pull requests improve quality attributes in 22.5 percent of changes while introducing new issues in 24 percent of files, yet 73.5 percent are still merged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies real GitHub pull requests where AI agents perform Python refactoring to measure their effects on maintainability, quality, and security. It tracks five quality attributes before and after each edit and checks for new code style violations plus security problems. Improvements occur in roughly one in five cases with usability seeing the largest share, but new violations appear in nearly a quarter of the modified files. Most of these pull requests are accepted by developers even when the changes add issues. The work concludes that current agentic refactoring shows both gains and risks, pointing to the value of added checks in AI-driven workflows.

Core claim

Agentic commits improve a quality attribute in 22.5 percent of the studied changes, with usability improving most frequently at 36.5 percent. At the same time, 24.17 percent of modified files introduce new issues from static analysis, mostly convention-level violations such as long lines, while 4.7 percent introduce new security findings. A taxonomy of 24 recurring change operations is derived from the diffs and mapped to the lint and security findings they most commonly affect. Despite these mixed results, 73.5 percent of the analyzed pull requests are merged, including cases that introduce new findings alongside removal of existing ones.

What carries the argument

The before-and-after comparison of quality attributes and static-analysis issues across refactoring pull requests, combined with the taxonomy of 24 recurring change operations that links specific edits to the violations they tend to create or remove.

If this is right

  • Quality gains remain limited and concentrated on certain attributes such as usability.
  • New convention violations appear more often than security problems in the modified files.
  • Developers accept many changes that add issues when other problems are also removed.
  • The derived taxonomy shows which common refactoring steps most frequently affect lint and security signals.
  • Current outcomes motivate adding quality and security checks into AI-assisted development processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed pattern suggests agents may favor surface-level refactors that touch usability metrics more readily than deeper structural improvements.
  • Over repeated merges, the introduced convention issues could accumulate into larger technical debt if review processes do not catch them.
  • Applying similar before-and-after tracking to non-refactoring AI edits or other programming languages would test whether the same balance of gains and new issues appears.
  • Integrating the static-analysis checks directly into the agent loop could lower the rate of new issues without reducing the observed merge acceptance.

Load-bearing premise

That the quality attributes and static-analysis flags used in the study serve as valid proxies for actual maintainability, code quality, and security once the changes enter production codebases.

What would settle it

A follow-up measurement of actual maintenance effort, bug rates, or security incidents in the same repositories after the merged AI-refactored changes have been in use for several months.

Figures

Figures reproduced from arXiv: 2605.21453 by Anwar Ghammam, Hua Ming, Mohamed Almukhtar.

Figure 1
Figure 1. Figure 1: Enhancement Rates by Agent and Quality Attribute. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of AI-generated Python refactoring pull requests drawn from the AIDev dataset. It applies the ML-based PyQu tool to quantify before/after changes across five quality attributes and augments this with Pylint and Bandit static analysis to detect new code-quality and security issues. Key reported results are that agentic commits improve at least one quality attribute in 22.5 % of changes (usability most often at 36.5 %), 24.17 % of modified files introduce new Pylint issues (predominantly convention-level violations such as long lines), 4.7 % introduce new Bandit findings, a taxonomy of 24 recurring change operations is derived from the diffs, and 73.5 % of the PRs are ultimately merged even when new issues are present.

Significance. If the chosen tool-based signals are accepted as reasonable proxies, the study supplies concrete, large-scale empirical evidence on the mixed quality and security effects of agentic refactoring in real open-source Python projects. The taxonomy that links specific edit operations to the lint and security findings they most frequently trigger is a useful contribution that can inform future tool design. The high merge rate despite detectable regressions also supplies a practical baseline for discussions of AI-assisted development workflows.

major comments (2)
  1. [§4 (Results)] §4 (Results) and abstract: the headline statistics (22.5 % quality-attribute improvement, 24.17 % new Pylint issues, 4.7 % new Bandit findings, 73.5 % merge rate) are presented without the total number of PRs or files examined, confidence intervals, or any statistical tests. This information is necessary to judge the precision and robustness of the reported percentages.
  2. [Discussion] Discussion: the central narrative of 'mixed outcomes' and the call for stronger tool-in-the-loop gating rest on the assumption that deltas produced by PyQu, Pylint, and Bandit track real maintainability and security impact after merge. No post-merge defect analysis, longitudinal study, or developer survey is supplied to test this proxy assumption, which is load-bearing for the interpretive claims.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'long lines-while 4.7%' is missing a comma or space and should read 'long lines, while 4.7%'.
  2. [Methodology] Methodology: the five quality attributes scored by PyQu are not defined or justified in the text; a brief description of what each attribute measures would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the emphasis on statistical rigor and the validity of our proxy measures. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4 (Results)] §4 (Results) and abstract: the headline statistics (22.5 % quality-attribute improvement, 24.17 % new Pylint issues, 4.7 % new Bandit findings, 73.5 % merge rate) are presented without the total number of PRs or files examined, confidence intervals, or any statistical tests. This information is necessary to judge the precision and robustness of the reported percentages.

    Authors: We agree that the headline figures require accompanying sample sizes and uncertainty estimates for proper interpretation. The full analysis is based on the AIDev dataset, and we have revised both the abstract and §4 to explicitly report the total number of PRs and modified files examined. We have also added 95% confidence intervals (using the Wilson method for proportions) and results from appropriate paired statistical tests (e.g., McNemar’s test for before/after issue counts) with p-values to support the robustness of the reported percentages. revision: yes

  2. Referee: [Discussion] Discussion: the central narrative of 'mixed outcomes' and the call for stronger tool-in-the-loop gating rest on the assumption that deltas produced by PyQu, Pylint, and Bandit track real maintainability and security impact after merge. No post-merge defect analysis, longitudinal study, or developer survey is supplied to test this proxy assumption, which is load-bearing for the interpretive claims.

    Authors: This comment correctly identifies a scope limitation of the study. Our work examines observable pre-merge signals using established static-analysis tools as proxies; we do not claim these directly measure post-merge outcomes. We have expanded the Discussion to explicitly state this assumption, discuss threats to validity arising from the proxy nature of the metrics, and outline future work involving longitudinal defect tracking and developer surveys. These additions clarify the interpretive boundaries without altering the core empirical contributions. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external dataset

full rationale

The paper's results consist of straightforward before/after counts and percentages obtained by running PyQu, Pylint, and Bandit on PRs from the external AIDev dataset, plus merge-rate statistics from GitHub metadata. The taxonomy of 24 change operations is extracted directly from observed diffs in the data. No equations, fitted parameters, or derivations are presented as predictions; no self-citations or uniqueness theorems are invoked to support the central claims. The analysis is self-contained against the independent outputs of the static-analysis tools and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The reported quality and security signals rest on the assumption that PyQu, Pylint, and Bandit outputs meaningfully capture maintainability and risk; the study adds no new entities or fitted parameters beyond the observed counts.

axioms (1)
  • domain assumption PyQu, Pylint, and Bandit outputs meaningfully capture maintainability, code quality, and security impacts
    These tools are used to quantify changes before and after each refactoring edit and to derive the taxonomy of effects.

pith-pipeline@v0.9.0 · 5829 in / 1423 out tokens · 45095 ms · 2026-05-21T02:57:48.877409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    https://bandit.readthedocs.io/en/latest/index.html Access on Dec 2025

    Bandit documentation. https://bandit.readthedocs.io/en/latest/index.html Access on Dec 2025

  2. [2]

    https: //github.com/microsoft/onnxscript/pull/2392 Accessed Dec 2025

    Clean up rewriter code: improve efficiency, finish TODOs, and enhance doc- umentation by Copilot·Pull Request #2392·microsoft/onnxscript. https: //github.com/microsoft/onnxscript/pull/2392 Accessed Dec 2025

  3. [3]

    https://cwe.mitre.org/ Accessed Dec 2025

    CWE - Common Weakness Enumeration. https://cwe.mitre.org/ Accessed Dec 2025

  4. [4]

    https://github.com/deepset-ai/haystack-core-integrations/ pull/2048 Accessed Dec 2025

    Decompose_process_streaming_chunk function into smaller focused functions for better maintainability by Copilot·Pull Request #2048·deepset-ai/haystack- core-integrations. https://github.com/deepset-ai/haystack-core-integrations/ pull/2048 Accessed Dec 2025

  5. [5]

    https: //leetcode.com/ Accessed Dec 2025

    LeetCode - The World’s Leading Online Programming Learning Platform. https: //leetcode.com/ Accessed Dec 2025

  6. [6]

    https://sites.google.com/view/thehiddencosts/home Accessed Feb 2026

    Quality and Security Signals in AI-Generated Python Refactoring Pull Requests. https://sites.google.com/view/thehiddencosts/home Accessed Feb 2026

  7. [7]

    https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.mannwhitneyu.html

    scipy.stats.mannwhitneyu — SciPy documentation. https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.mannwhitneyu.html. Accessed on Feb 2026

  8. [8]

    Mohamed Almukhtar, Anwar Ghammam, Marouane Kessentini, and Hua Ming

  9. [9]

    InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE)

    From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE). doi:10.1145/3744916.3773258

  10. [10]

    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological)57, 1 (1995), 289–300

  11. [11]

    refactor: use os.path.join consistently for path handling by devin- ai-integration[bot]·Pull Request #291·bespokelabsai/curator

    bespokelabsai. refactor: use os.path.join consistently for path handling by devin- ai-integration[bot]·Pull Request #291·bespokelabsai/curator. https://github. com/bespokelabsai/curator/pull/291 Accessed on Feb 2026

  12. [12]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  13. [13]

    Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Trans. Softw. Eng. Methodol.34, 8, Article 218 (Oct. 2025), 34 pages. doi:10.1145/3716848

  14. [14]

    Anwar Ghammam and Mohamed Almukhtar. 2026. AI Builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR ’26 Mining Challenge). doi:10.1145/3793302.3793563

  15. [15]

    Agentic software engineering: Foundational pillars and a research roadmap,

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap.arXiv preprint arXiv:2509.06216(2025)

  16. [16]

    Hassan, Gustavo A

    Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, and Zhen Ming (Jack) Jiang. 2026. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3807901

  17. [17]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv preprint arXiv:2511.04824(2025)

  18. [18]

    Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2025. The rise of ai teammates in software engineering (se) 3.0: How autonomous coding agents are reshaping software engineering.arXiv preprint arXiv:2507.15003(2025)

  19. [19]

    Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2026. Aidev: studying ai coding agents on github.arXiv preprint arXiv:2602.09185(2026)

  20. [20]

    Logilab. 2025. Pylint - code analysis for Python | www.pylint.org. https://www. pylint.org/ Access on Dec 2025

  21. [21]

    Kane Meissel and Esther S. Yao. 2024. Using Cliff’s Delta as a Non-Parametric Effect Size Measure: An Accessible Web App and R Tutorial.Practical Assessment, Research, and Evaluation29 (1 2024). Issue 1. doi:10.7275/pare.1977

  22. [22]

    Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley K. G. Assunção. 2025. Is LLM-Generated Code More Maintainable & Reliable Than Human-Written Code?. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)(Honolulu, HI, USA). IEEE Press, 151–162. doi:10.1109/ESEM64174.2025.00036

  23. [23]

    [alpha_factory] refactor business agent policy mapping· MontrealAI/AGI-Alpha-Agent-v0@e018e6d

    MontrealAI. [alpha_factory] refactor business agent policy mapping· MontrealAI/AGI-Alpha-Agent-v0@e018e6d. https://github.com/MontrealAI/ AGI-Alpha-Agent-v0/commit/e018e6dba71c7c757beabe4241ff5cbc4ce3ac39 Ac- cessed on Feb 2026

  24. [24]

    [alpha_factory] refactor self-healer patching by MontrealAI·Pull Re- quest #2213·MontrealAI/AGI-Alpha-Agent-v0

    MontrealAI. [alpha_factory] refactor self-healer patching by MontrealAI·Pull Re- quest #2213·MontrealAI/AGI-Alpha-Agent-v0. https://github.com/MontrealAI/ AIware ’26, July 6–7, 2026, Montreal, QC, Canada Almukhtar et al. AGI-Alpha-Agent-v0/pull/2213 Accessed on Feb 2026

  25. [25]

    Desmarais, and Zhen Ming (Jack) Jiang

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub Copilot AI pair programmer: Asset or Liability?J. Syst. Softw.203, C (Sept. 2023), 23 pages. doi:10.1016/j.jss.2023.111734

  26. [26]

    Convert cache storage from pickle to JSON format·New- Future/DDNS@fa29fc1

    NewFuture. Convert cache storage from pickle to JSON format·New- Future/DDNS@fa29fc1. https://github.com/NewFuture/DDNS/commit/ fa29fc1e3d17f620e97f9774a7a20e54fc7e92c1 Accessed on Feb 2026

  27. [27]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). 754–768. doi:10.1109/SP46214.2022.9833571

  28. [28]

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark)(CCS ’23). Association for Computing Machinery, New York, NY, USA, 2785–2799. doi:10.1145/3576915.3623157

  29. [29]

    Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code and their Use by Developers(MSR ’24). 152–156. doi:10.1145/3643991.3645071

  30. [30]

    A Study of LLMs' Preferences for Libraries and Programming Languages

    Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, and Detlef Nauck. 2026. A Study of LLMs’ Preferences for Libraries and Programming Languages. (2026). arXiv:2503.17181 [cs.SE] https: //arxiv.org/abs/2503.17181

  31. [31]

    Bart van Oort, Luís Cruz, Maurício Aniche, and Arie van Deursen. 2021. The Prevalence of Code Smells in Machine Learning projects. In2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (W AIN)(Madrid, Spain). IEEE Press, 1–8. doi:10.1109/WAIN52551.2021.00011

  32. [32]

    Anthony J Viera, Joanne M Garrett, et al . 2005. Understanding interobserver agreement: the kappa statistic.Fam med37, 5 (2005), 360–363

  33. [33]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2026. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10.1145/3798166

  34. [34]

    Hao Yan, Swapneel Suhas Vaidya, Xiaokuan Zhang, and Ziyu Yao. 2025. Guiding ai to fix its own flaws: An empirical study on llm-driven secure code generation. arXiv preprint arXiv:2506.23034(2025)

  35. [35]

    Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. 2009. Cross-project defect prediction: a large scale ex- periment on data vs. domain vs. process. InProceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Sym- posium on the Foundations of Software Engineering (ESEC/FSE). A...