pith. sign in

arxiv: 2605.04835 · v1 · submitted 2026-05-06 · 💻 cs.SE · cs.HC

Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions

Pith reviewed 2026-05-08 16:10 UTC · model grok-4.3

classification 💻 cs.SE cs.HC
keywords LLMcode refactoringdeveloper adoptionChatGPTGitHub commitssoftware engineeringAI-assisted developmentrefactoring patterns
0
0 comments X p. Extension

The pith

Developers mostly accept LLM refactoring suggestions without modifications, applying major changes in five patterns when they edit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how developers actually apply suggestions from large language models for code refactoring by examining real GitHub commits. It shows that acceptance without changes is the dominant behavior, while edited suggestions tend to involve substantial rewrites grouped into five recurring patterns. These patterns vary with the specific refactoring goal, the wording of the developer's prompt to the model, and whether the model's output is technically valid. A sympathetic reader would care because this reveals the practical fit of LLM tools inside existing development routines rather than just measuring suggestion quality in isolation.

Core claim

We analyze 169 GitHub commits where developers refactor their code based on a ChatGPT conversation linked in the commit message. Developers mostly accept and use the suggestions without modifications. When changes are made, they are mostly major and fall into five different patterns that depend on the refactoring activity, the developer's prompt, and the validity of the response from ChatGPT.

What carries the argument

Analysis of 169 GitHub commits that contain explicit links to ChatGPT conversations in their commit messages, used to classify developer adoption into direct acceptance versus five categories of modification.

If this is right

  • LLM refactoring suggestions are treated as ready-to-apply in the majority of observed cases.
  • Modification effort concentrates in a small set of predictable patterns rather than arbitrary tweaks.
  • Pattern type correlates with refactoring kind, prompt clarity, and output correctness.
  • Commit-message links to model conversations can serve as a traceable signal for studying AI-assisted edits.
  • Tool builders could prioritize handling invalid responses and ambiguous prompts to raise direct-acceptance rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed high acceptance rate may indicate that current LLMs already handle common refactoring requests at a usable baseline level.
  • Similar commit-link methods could be applied to study LLM use in other tasks such as bug fixing or test generation.
  • The five patterns might be turned into targeted training signals to improve future models on the cases where developers currently intervene most.
  • Practitioners could build lightweight checklists around prompt formulation and response validation to reduce downstream editing work.

Load-bearing premise

GitHub commits that mention ChatGPT links in their messages give a representative picture of how developers adopt LLM refactoring suggestions in general.

What would settle it

A broad sample of refactoring commits that use LLMs but omit ChatGPT links, or direct developer surveys, showing substantially lower direct-acceptance rates or different modification patterns.

Figures

Figures reproduced from arXiv: 2605.04835 by David Sch\"on, Faiza Amjad, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner, Ranim Khojah, Tehreem Asif.

Figure 1
Figure 1. Figure 1: On a high level, we pre-process the DevGPT dataset view at source ↗
Figure 1
Figure 1. Figure 1: Overview of the process we followed in our study. view at source ↗
Figure 2
Figure 2. Figure 2: Relative position of the adopted prompt (e.g., 1st, 2nd, etc.) in the view at source ↗
Figure 3
Figure 3. Figure 3: The distribution of similarity scores using Jaccard 3-gram similarity, Normalized Levenshtein similarity, the rate of token matched in the refactored view at source ↗
read the original abstract

Large language models (LLMs) have gained widespread popularity and have steadily improved over time, enabling software developers to use them for various code-related tasks. One common task is code refactoring, where the LLM suggests changes for the developer to apply to their code to improve quality attributes such as readability or maintainability. While current research focuses on evaluating LLM-generated refactoring suggestions, there is a limited understanding of how developers apply these suggestions in practice. To explore this, we analyze 169 GitHub commits where developers refactor their code based on a ChatGPT conversation linked in the commit message. We found that developers mostly accept and use the suggestions without modifications. When changes are made, they are mostly major and fall into five different patterns that depend on the refactoring activity, the developer's prompt, and the validity of the response from ChatGPT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes 169 GitHub commits that include links to ChatGPT conversations for code refactoring tasks. It reports that developers predominantly accept and apply the LLM-generated suggestions without any modifications. When modifications occur, they are typically major and fall into five distinct patterns, which the authors link to the specific refactoring activity, the content of the developer's prompt, and the validity of ChatGPT's response.

Significance. If the core findings hold after addressing methodological details, this work contributes empirical evidence on real-world developer interactions with LLM refactoring suggestions, an area with limited prior observational data. The use of actual public commits provides ecological validity beyond lab studies or surveys, and the identification of modification patterns could inform tool design and prompting strategies in software engineering.

major comments (2)
  1. [Methods] Methods section (data collection): Commits were selected solely via explicit ChatGPT links in commit messages. This criterion filters for public repos, disclosed interactions, and cases developers deemed noteworthy, likely biasing toward successful adoptions. This directly affects the central claim that developers 'mostly accept and use the suggestions without modifications' and the distribution of the five patterns; a limitations discussion or control comparison (e.g., refactoring commits without LLM mentions) is required to assess representativeness.
  2. [Findings] Findings / Data Analysis: The derivation of the five modification patterns lacks sufficient detail on the qualitative process (e.g., open coding procedure, codebook, inter-rater reliability metrics, or how patterns were validated against the 169 commits). Without this, the reliability and reproducibility of the pattern classification cannot be evaluated, which is load-bearing for the claim that patterns 'depend on the refactoring activity, the developer's prompt, and the validity of the response.'
minor comments (2)
  1. [Abstract] Abstract: The abstract states the sample size and high-level findings but could briefly note the data source (public GitHub commits with ChatGPT links) to set expectations for generalizability.
  2. [Results] Ensure that any tables summarizing the five patterns include example commit excerpts or prompt/response pairs to illustrate each pattern concretely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Methods] Methods section (data collection): Commits were selected solely via explicit ChatGPT links in commit messages. This criterion filters for public repos, disclosed interactions, and cases developers deemed noteworthy, likely biasing toward successful adoptions. This directly affects the central claim that developers 'mostly accept and use the suggestions without modifications' and the distribution of the five patterns; a limitations discussion or control comparison (e.g., refactoring commits without LLM mentions) is required to assess representativeness.

    Authors: We agree that selecting commits via explicit ChatGPT links in messages introduces selection bias toward public repositories, disclosed interactions, and cases developers found noteworthy enough to document. This is an inherent limitation of our data collection approach, which prioritizes ecological validity through real GitHub commits over broader sampling. We will revise the manuscript to include an expanded Limitations section that explicitly discusses this bias, its potential impact on the reported acceptance rates and pattern distributions, and the boundaries of our claims (i.e., they apply to documented LLM-assisted refactorings). A full control comparison with non-LLM refactoring commits would require new data collection outside the current study's scope, but we will clarify how readers should interpret the findings in light of the sampling method. revision: yes

  2. Referee: [Findings] Findings / Data Analysis: The derivation of the five modification patterns lacks sufficient detail on the qualitative process (e.g., open coding procedure, codebook, inter-rater reliability metrics, or how patterns were validated against the 169 commits). Without this, the reliability and reproducibility of the pattern classification cannot be evaluated, which is load-bearing for the claim that patterns 'depend on the refactoring activity, the developer's prompt, and the validity of the response.'

    Authors: We appreciate the call for greater transparency in the qualitative analysis. The five patterns emerged from an iterative thematic analysis of the 169 commits and linked ChatGPT conversations. In the revised manuscript, we will expand the Data Analysis subsection to detail the open coding procedure, codebook development and refinement, the validation steps (including cross-checking patterns against the full dataset), and any inter-coder agreement processes used by the authors. This addition will improve reproducibility and allow better evaluation of the reliability of the classification and its links to refactoring activity, prompts, and response validity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study

full rationale

The paper conducts a qualitative analysis of 169 GitHub commits containing explicit links to ChatGPT conversations. All claims (acceptance rates, modification patterns) are stated as direct observations from manual review of the selected commits and their linked conversations. There are no equations, fitted parameters, predictions derived from models, self-citations used as load-bearing premises, or ansatzes. The derivation chain consists of data selection followed by categorization; these steps do not reduce to their inputs by construction and contain no self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical analysis of public repository data; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption GitHub commits with linked ChatGPT conversations represent genuine developer adoption of refactoring suggestions
    The study relies on this to interpret the commits as evidence of usage.

pith-pipeline@v0.9.0 · 5460 in / 1270 out tokens · 30778 ms · 2026-05-08T16:10:47.926992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages

  1. [1]

    A survey of software refac- toring,

    T. Mens and T. Tourwe, “A survey of software refac- toring,”IEEE Transactions on Software Engineering, vol. 30, no. 2, pp. 126–139, 2004

  2. [2]

    Analysis of code refactoring impact on software quality,

    A. Kaur and M. Kaur, “Analysis of code refactoring impact on software quality,”MATEC Web of Conferences, vol. 57, p. 02012, 2016

  3. [3]

    Octoverse: The state of open source and rise of AI in 2023

    K. Daigle, “Octoverse: The state of open source and rise of AI in 2023.” https://github.blog/news- insights/research/the-state-of-open-source-and-ai, nov 8

  4. [4]

    [Online; accessed 2025-02-05]

  5. [5]

    A Review on Code Generation with LLMs: Application and Evaluation,

    J. Wang and Y . Chen, “A Review on Code Generation with LLMs: Application and Evaluation,” in2023 IEEE International Conference on Medical Artificial Intelli- gence (MedAI), pp. 284–289, IEEE, nov 18 2023

  6. [6]

    An empirical study on the code refactoring capability of large language models,

    J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,” arXiv preprint arXiv:2411.02320, 2024

  7. [7]

    Large Language Models on Software Refactoring

    M. Metsola, “Large Language Models on Software Refactoring.” https://trepo.tuni.fi/handle/10024/160496, oct 18 2024. [Online; accessed 2025-03-13]

  8. [8]

    2024 , howpublished =

    B. Liu, Y . Jiang, Y . Zhang, N. Niu, G. Li, and H. Liu, “An Empirical Study on the Potential of LLMs in Automated Software Refactoring.” https://arxiv.org/abs/2411.04444, nov 7 2024. [Online; accessed 2025-03-04]

  9. [9]

    Devgpt: Studying Developer-ChatGPT Conversations,

    T. Xiao, C. Treude, H. Hata, and K. Matsumoto, “Devgpt: Studying Developer-ChatGPT Conversations,” inPro- ceedings of the 21st International Conference on Mining Software Repositories, (New York, NY , USA), pp. 227– 230, ACM, apr 15 2024. [Online; accessed 2025-01-21]

  10. [10]

    Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,

    R. Khojah, M. Mohamad, P. Leitner, and F. G. de Oliveira Neto, “Beyond code generation: An obser- vational study of chatgpt usage in software engineering practice,”Proc. ACM Softw. Eng., vol. 1, July 2024

  11. [11]

    Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

    Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, and X. Peng, “Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study.” https://arxiv.org/abs/2309.08221, sep 15 2023. [Online; accessed 2025-03-01]

  12. [12]

    From human-to-human to human-to-bot conversations in software engineering,

    R. Khojah, F. G. de Oliveira Neto, and P. Leitner, “From human-to-human to human-to-bot conversations in software engineering,” AIware 2024, (New York, NY , USA), p. 38–44, Association for Computing Machinery, 2024

  13. [13]

    Generative AI at Work

    E. Brynjolfsson, D. Li, and L. Raymond, “Generative AI at Work.” https://arxiv.org/abs/2304.11771, apr 23 2023. [Online; accessed 2025-02-10]

  14. [14]

    Fowler,Refactoring: improving the design of existing code

    M. Fowler,Refactoring: improving the design of existing code. Addison-Wesley Professional, 2018

  15. [15]

    Analyzing Developer-ChatGPT Conversations for Software Refactoring: An Exploratory Study,

    O. S. Chavan, D. D. Hinge, S. S. Deo, Y . O. Wang, and M. W. Mkaouer, “Analyzing Developer-ChatGPT Conversations for Software Refactoring: An Exploratory Study,” inProceedings of the 21st International Confer- ence on Mining Software Repositories, (New York, NY , USA), pp. 207–211, ACM, apr 15 2024

  16. [16]

    How to refactor this code? an exploratory study on developer-chatgpt refactoring conversations,

    E. A. AlOmar, A. Venkatakrishnan, M. W. Mkaouer, C. Newman, and A. Ouni, “How to refactor this code? an exploratory study on developer-chatgpt refactoring conversations,” inProceedings of the 21st International Conference on Mining Software Repositories, MSR ’24, (New York, NY , USA), p. 202–206, Association for Computing Machinery, 2024

  17. [17]

    Can chatgpt fix my code?,

    V . Csuvik, T. Gyim´othy, and L. Vid´acs, “Can chatgpt fix my code?,” inProceedings of the 18th International Con- ference on Software Technologies, pp. 478–485, 2023

  18. [18]

    Refining ChatGPT- Generated Code: Characterizing and Mitigating Code Quality Issues

    Y . Liu, T. Le-Cong, R. Widyasari, C. Tantithamthavorn, L. Li, X.-B. D. Le, and D. Lo, “Refining ChatGPT- Generated Code: Characterizing and Mitigating Code Quality Issues.” https://arxiv.org/abs/2307.12596, jul 24

  19. [19]

    [Online; accessed 2025-02-13]

  20. [20]

    Exploring chatgpt’s code refactoring capabilities: An empirical study,

    K. DePalma, I. Miminoshvili, C. Henselder, K. Moss, and E. A. AlOmar, “Exploring chatgpt’s code refactoring capabilities: An empirical study,”Expert Systems with Applications, vol. 249, p. 123602, 2024

  21. [21]

    Replication Package for the study

    D. Sch ¨on, F. Amjad, T. Asif, R. Khojah, M. Mohamad, F. Gomes de Oliveira Neto, and P. Leitner, “Replication Package for the study ”Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions” ,” July 2025

  22. [22]

    Ratzinger,sPACE - Software Project Assessment in the Course of Evolution

    J. Ratzinger,sPACE - Software Project Assessment in the Course of Evolution. PhD thesis, Technische Universit ¨at Wien, Vienna, Austria, 2007. [Online; accessed 2025- 04-30]

  23. [23]

    On the Documentation of Refactoring Types

    E. A. AlOmar, J. Liu, K. Addo, M. W. Mkaouer, C. New- man, A. Ouni, and Z. Yu, “On the Documentation of Refactoring Types.” https://arxiv.org/abs/2112.01581, dec 2 2021. [Online; accessed 2025-06-13]

  24. [24]

    How we refactor, and how we know it,

    E. Murphy-Hill, C. Parnin, and A. P. Black, “How we refactor, and how we know it,”IEEE Transactions on Software Engineering, vol. 38, no. 1, pp. 5–18, 2012

  25. [25]

    A normalized levenshtein distance metric,

    L. Yujian and L. Bo, “A normalized levenshtein distance metric,”IEEE transactions on pattern analysis and ma- chine intelligence, vol. 29, no. 6, pp. 1091–1095, 2007

  26. [26]

    ´Etude comparative de la distribution florale dans une portion des alpes et des jura,

    P. Jaccard, “ ´Etude comparative de la distribution florale dans une portion des alpes et des jura,”Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547–579, 1901

  27. [27]

    Crystalbleu: Precisely and efficiently measuring the similarity of code,

    A. Eghbali and M. Pradel, “Crystalbleu: Precisely and efficiently measuring the similarity of code,” inProceed- ings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, (New York, NY , USA), Association for Computing Machinery, 2023

  28. [28]

    Navigating the complexity of generative ai adoption in software engineering,

    D. Russo, “Navigating the complexity of generative ai adoption in software engineering,”ACM Trans. Softw. Eng. Methodol., vol. 33, June 2024

  29. [29]

    Refactoring programs using large lan- guage models with few-shot examples,

    A. Shirafuji, Y . Oda, J. Suzuki, M. Morishita, and Y . Watanobe, “Refactoring programs using large lan- guage models with few-shot examples,” in2023 30th Asia-Pacific Software Engineering Conference (APSEC), pp. 151–160, 2023