pith. sign in

arxiv: 2605.22526 · v1 · pith:CDKNRR2Hnew · submitted 2026-05-21 · 💻 cs.SE

"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

Pith reviewed 2026-05-22 03:54 UTC · model grok-4.3

classification 💻 cs.SE
keywords tangled refactoringscoding agentsissue resolutioncompilabilityrefactoringLLM agentssoftware patchesempirical study
0
0 comments X

The pith

Coding agents introduce tangled refactorings less often than humans, but those they add reduce patch compilability; a targeted refinement step more than doubles the compilability rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes thousands of patches generated by coding agents on real software issues and compares them to human patches. Agents produce tangled refactorings at lower frequency and intensity than humans, yet these refactorings still correlate strongly with failures to compile. The authors introduce a refinement method that checks each tangled refactoring for necessity and safety, then removes or repairs the problematic ones. Applying this method raises compilability from 19.34 percent to 38.33 percent and resolves a small additional share of previously unsolved issues.

Core claim

Tangled refactorings appear less frequently and with lower intensity in agent patches than in human ones, but they remain associated with sharply reduced compilability while showing no link to functional correctness. A refactoring-aware refinement approach that assesses necessity and safety of each tangled refactoring and then selectively removes or repairs the unsafe ones improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.

What carries the argument

The refactoring-aware refinement approach, which identifies tangled refactorings in generated patches and judges their necessity and safety before deciding whether to remove or repair them.

If this is right

  • Tangled refactorings reduce compilability of agent-generated patches.
  • Agents display a wider range of refactoring types than human developers.
  • The refinement step can be applied after any agent run to raise the share of compilable patches.
  • Logistic regression shows the negative effect on compilability is independent of functional correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents could be prompted or trained to limit refactoring to only what is required for the fix.
  • The same necessity-and-safety check might help in other code-change tasks such as feature implementation.
  • Training data drawn from open-source repositories may explain why agents still produce some tangled refactorings.

Load-bearing premise

That tangled refactorings found in agent patches can be judged for necessity and safety by the refinement method without missing important context or creating fresh problems.

What would settle it

Running the refinement on a new collection of agent patches and observing no rise in compilability or the appearance of new compilation errors.

Figures

Figures reproduced from arXiv: 2605.22526 by Dong Wang, Junjie Chen, Masanari Kondo, Tao Xiao, Yasutaka Kamei, Zhao Tian, Zifan Zhang.

Figure 1
Figure 1. Figure 1: Real-world examples of unintentional refactorings performed by coding agents during issue resolutions. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The dataset preparation process for empirical analysis of tangled refactorings in coding agents for [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the top-10 tangled refactoring types by agent framework (aggregated across 3 agent [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the top-10 tangled refactoring types by base LLM (aggregated across 12 LLMs). Each [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Recent advances in coding agents have shown remarkable progress in software issue resolution. In practice, real-world issues are typically bug fixes or feature requests in which human developers naturally incorporate refactoring as part of the resolution process, resulting in tangled refactoring. Since LLMs are trained on large-scale open-source repositories, coding agents may inherit such behaviors. In this paper, we conduct an empirical study on Multi-SWE-bench, analyzing 3,691 valid patches generated by three agent frameworks with 12 LLMs. We find that coding agents introduce tangled refactorings less frequently (21.43% vs. 36.72%) and with lower intensity (0.66 vs. 1.75) than human developers, although they exhibit a broader diversity of refactoring types. Logistic regression analysis further shows that tangled refactorings are strongly associated with reduced compilability, while exhibiting no significant association with functional correctness. Based on these findings, we propose a refactoring-aware refinement approach that assesses the necessity and safety of tangled refactorings and selectively removes or repairs problematic operations. Our approach improves compilability from 19.34% to 38.33%, and additionally resolves 2.79% previously unresolved issues. Overall, this work presents the first step towards understanding tangled refactoring practices in agentic issue resolution and opens up avenues for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts an empirical study on 3,691 valid patches generated by three coding agent frameworks using 12 LLMs on the Multi-SWE-bench dataset. It reports that agents introduce tangled refactorings less frequently (21.43% vs. 36.72%) and with lower intensity (0.66 vs. 1.75) than human developers, though with broader diversity of refactoring types. Logistic regression shows tangled refactorings are strongly associated with reduced compilability but not with functional correctness. The authors propose an LLM-based refactoring-aware refinement approach to assess necessity and safety of tangled refactorings and selectively remove or repair them, claiming this raises compilability from 19.34% to 38.33% while resolving an additional 2.79% of previously unresolved issues.

Significance. If the refinement method can be shown to correctly distinguish necessary from unnecessary refactorings without discarding functional fixes or introducing new errors, the work supplies a concrete, deployable technique for improving the reliability of agent-generated patches in real issue-resolution workflows. The comparative statistics on agent versus human refactoring practices add to the growing literature on how LLM training corpora shape code-editing behavior. The study is a useful first step in characterizing side-effects of agentic code generation.

major comments (1)
  1. [Refinement approach section] Description of the refactoring-aware refinement approach: the necessity and safety assessment is described as LLM-based judgment performed on the patch. No explicit mechanism (prompt template, retrieval step, or rule set) is provided for incorporating the original issue description, failing test cases, or resolution intent. Because the reported compilability gain (19.34% → 38.33%) and the additional 2.79% resolutions rest directly on the correctness of these removals/repairs, the absence of grounding context is load-bearing and requires either a detailed prompt appendix or an ablation that measures functional regression after refinement.
minor comments (2)
  1. [Abstract and §3] The abstract states logistic regression results but supplies no model specification, variable list, or goodness-of-fit statistics; these details should appear in the main text or an appendix to support reproducibility.
  2. [Empirical study section] Table or figure reporting the 3,691-patch breakdown by framework and LLM is missing; adding it would allow readers to judge whether the frequency and intensity differences are driven by particular model families.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment below and will incorporate the suggested clarifications and additional analysis in the revised manuscript.

read point-by-point responses
  1. Referee: [Refinement approach section] Description of the refactoring-aware refinement approach: the necessity and safety assessment is described as LLM-based judgment performed on the patch. No explicit mechanism (prompt template, retrieval step, or rule set) is provided for incorporating the original issue description, failing test cases, or resolution intent. Because the reported compilability gain (19.34% → 38.33%) and the additional 2.79% resolutions rest directly on the correctness of these removals/repairs, the absence of grounding context is load-bearing and requires either a detailed prompt appendix or an ablation that measures functional regression after refinement.

    Authors: We agree that the current description of the refactoring-aware refinement approach is insufficiently detailed regarding the incorporation of grounding context. In the revised manuscript we will add the complete prompt template to an appendix. The template explicitly includes the original issue description, the failing test cases, and the resolution intent as part of the input provided to the LLM for necessity and safety assessment. We will also add an ablation study that reports functional correctness (pass rates on the original test suite) before and after refinement to confirm that the observed gains in compilability do not introduce regressions or discard functional fixes. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observational study with independent analysis

full rationale

The paper is an empirical study that analyzes 3,691 patches from coding agents on Multi-SWE-bench, applies logistic regression to report associations between tangled refactorings and compilability/functional correctness, and then proposes a refinement approach whose performance gains are measured directly on the data. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. The logistic regression serves as standard statistical reporting of observed associations rather than a tautological reduction, and the reported improvements (19.34% to 38.33% compilability) are presented as empirical outcomes of the proposed method rather than forced by construction from the inputs. The work is self-contained against external benchmarks with no evident circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, invented entities, or non-standard axioms are described in the provided text. Logistic regression implicitly relies on standard statistical assumptions.

axioms (1)
  • domain assumption Logistic regression assumptions hold, including linearity of the logit and independence of observations.
    Invoked for analyzing association between tangled refactorings and compilability/functional correctness.

pith-pipeline@v0.9.0 · 5790 in / 1211 out tokens · 50326 ms · 2026-05-22T03:54:00.845261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 3 internal anchors

  1. [1]

    Pouria Alikhanifard and Nikolaos Tsantalis. 2025. A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools.ACM Transactions on Software Engineering and Methodology34, 2, Article 40 (Jan. 2025), 63 pages. doi:10.1145/3696002

  2. [2]

    Amirreza Bagheri and Péter Hegedüs. 2022. Is refactoring always a good egg? exploring the interconnection between bugs and refactorings. InProceedings of the 19th International Conference on Mining Software Repositories(Pittsburgh, J. ACM, Vol. 1, No. 1, Article 1. Publication date: March 2026. 1:26 Zhao Tian, Zifan Zhang, Tao Xiao, Dong Wang, Masanari Ko...

  3. [3]

    Gabriele Bavota, Bernardino De Carluccio, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, and Orazio Strollo

  4. [4]

    InProceedings of the 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation (SCAM ’12)

    When Does a Refactoring Induce Bugs? An Empirical Study. InProceedings of the 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation (SCAM ’12). IEEE Computer Society, USA, 104–113. doi:10.1109/SCAM.2012.20

  5. [5]

    Dennis Bijlsma, Miguel Alexandre Ferreira, Bart Luijten, and Joost Visser. 2012. Faster issue resolution with higher technical quality of software.Software quality journal20, 2 (2012), 265–285

  6. [6]

    Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study. arXiv:2506.08311 [cs.SE] https://arxiv.org/ abs/2506.08311

  7. [7]

    Zhi Chen and Lingxiao Jiang. 2025. Evaluating software development agents: Patch patterns, code quality, and issue complexity in real-world github scenarios. In2025 IEEE international conference on software analysis, evolution and reengineering (SANER). IEEE, 657–668

  8. [8]

    Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. InProceedings of the IEEE/ACM 48th International Conference on Software Engineering

  9. [9]

    Steve Counsell, Vesna Nowack, Tracy Hall, David Bowes, Saemundur Haraldsson, Emily Winter, and John Woodward

  10. [10]

    In2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

    An 80-20 Analysis of Buggy and Non-buggy Refactorings in Open-Source Commits. In2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). 197–200. doi:10.1109/SEAA56994.2022.00038

  11. [11]

    Massimiliano Di Penta, Gabriele Bavota, and Fiorella Zampetti. 2020. On the relationship between refactoring actions and bugs: a differentiated replication. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Virtual Event, USA)(ESEC/FSE 2020). Association for Com...

  12. [12]

    Martín Dias, Alberto Bacchelli, Georgios Gousios, Damien Cassou, and Stéphane Ducasse. 2015. Untangling fine- grained code changes. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 341–350. doi:10.1109/SANER.2015.7081844

  13. [13]

    Isabella Ferreira, Eduardo Fernandes, Diego Cedrim, Anderson Uchôa, Ana Carla Bibiano, Alessandro Garcia, João Lu- cas Correia, Filipe Santos, Gabriel Nunes, Caio Barbosa, Baldoino Fonseca, and Rafael de Mello. 2018. The buggy side of code refactoring: understanding the relationship between refactorings and bugs. InProceedings of the 40th International Co...

  14. [14]

    2018.Refactoring: improving the design of existing code

    Martin Fowler. 2018.Refactoring: improving the design of existing code. Addison-Wesley Professional

  15. [15]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling.arXiv preprint arXiv:2507.23370(2025)

  16. [16]

    Yuan Gao, Hui Liu, Xiaozhong Fan, Zhendong Niu, and Bridget Nyirongo. 2015. Analyzing Refactorings’ Impact on Regression Test Cases. InProceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 02 (COMPSAC ’15). IEEE Computer Society, USA, 222–231. doi:10.1109/COMPSAC.2015.16

  17. [17]

    Ahmed E Hassan. 2009. Predicting faults using the complexity of code changes. In2009 IEEE 31st international conference on software engineering. IEEE, 78–88

  18. [18]

    Steffen Herbold, Alexander Trautsch, Benjamin Ledel, Alireza Aghamohammadi, Taher A. Ghaleb, Kuljit Kaur Chahal, Tim Bossenmaier, Bhaveet Nagaria, Philip Makedonski, Matin Nili Ahmadabadi, Kristof Szabados, Helge Spieker, Matej Madeja, Nathaniel Hoy, Valentina Lenarduzzi, Shangwen Wang, Gema Rodríguez-Pérez, Ricardo Colomo- Palacios, Roberto Verdecchia, P...

  19. [19]

    Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes. InProceedings of the 10th Working Conference on Mining Software Repositories(San Francisco, CA, USA)(MSR ’13). IEEE Press, 121–130

  20. [20]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [cs.SE] https://arxiv.org/abs/2511.04824

  21. [21]

    2013.Applied logistic regression

    David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013.Applied logistic regression. John Wiley & Sons

  22. [22]

    Refactoring Runaway

    Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014.An Introduction to Statistical Learning: with Applications in R. Springer Publishing Company, Incorporated. J. ACM, Vol. 1, No. 1, Article 1. Publication date: March 2026. “Refactoring Runaway”: Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution 1:27

  23. [23]

    Zhonghao Jiang, David Lo, and Zhongxin Liu. 2025. Agentic Software Issue Resolution with Large Language Models: A Survey.arXiv preprint arXiv:2512.22256(2025)

  24. [24]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

  25. [25]

    Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A survey of current, challenges and future.arXiv preprint arXiv:2408.02479(2024)

  26. [26]

    Yutaro Kashiwa, Kazuki Shimizu, Bin Lin, Gabriele Bavota, Michele Lanza, Yasutaka Kamei, and Naoyasu Ubayashi

  27. [27]

    In2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)

    Does Refactoring Break Tests and to What Extent?. In2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 171–182. doi:10.1109/ICSME52107.2021.00022

  28. [28]

    Jong Hae Kim. 2019. Multicollinearity and misleading statistical results.Korean journal of anesthesiology72, 6 (2019), 558–569

  29. [29]

    Hiroyuki Kirinuki, Yoshiki Higo, Keisuke Hotta, and Shinji Kusumoto. 2014. Hey! are you committing tangled changes?. InProceedings of the 22nd International Conference on Program Comprehension(Hyderabad, India)(ICPC 2014). Association for Computing Machinery, New York, NY, USA, 262–265. doi:10.1145/2597008.2597798

  30. [30]

    Zengyang Li, Guangzong Cai, Qinyi Yu, Peng Liang, Ran Mo, and Hui Liu. 2024. Bug priority change: An empirical study on Apache projects.Journal of Systems and Software212 (2024), 112019

  31. [31]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9, Article 195 (Jan. 2023), 35 pages. doi:10.1145/3560815

  32. [32]

    Mehran Mahmoudi, Sarah Nadi, and Nikolaos Tsantalis. 2019. Are Refactorings to Blame? An Empirical Study of Refactorings in Merge Conflicts. In2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 151–162. doi:10.1109/SANER.2019.8668012

  33. [33]

    Matias Martinez and Xavier Franch. 2026. What’s in a Benchmark? The Case of SWE-Bench in Automated Program Repair. InProceedings of the 48th International Conference on Software Engineering: Software Engineering in Practice. IEEE

  34. [34]

    Davood Mazinanian, Ameya Ketkar, Nikolaos Tsantalis, and Danny Dig. 2017. Understanding the use of lambda expressions in Java.Proc. ACM Program. Lang.1, OOPSLA, Article 85 (Oct. 2017), 31 pages. doi:10.1145/3133909

  35. [35]

    Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black. 2009. How we refactor, and how we know it. InProceedings of the 31st International Conference on Software Engineering (ICSE ’09). IEEE Computer Society, USA, 287–297. doi:10. 1109/ICSE.2009.5070529

  36. [36]

    Nachiappan Nagappan and Thomas Ball. 2005. Use of relative code churn measures to predict system defect density. InProceedings of the 27th international conference on Software engineering. 284–292

  37. [37]

    Stas Negara, Nicholas Chen, Mohsen Vakilian, Ralph E Johnson, and Danny Dig. 2013. A comparative study of manual and automated refactorings. InEuropean Conference on Object-Oriented Programming. Springer, 552–576

  38. [38]

    Feifei Niu, Junqian Shao, Christoph Mayr-Dorn, Liguo Huang, KG Wesley Assunção, Chuanyi Li, Jidong Ge, and Alexander Egyed. 2025. Refactoring ≠ bug-inducing: Improving defect prediction with code change tactics analysis. In 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 527–538

  39. [39]

    Lukas Ottenhof, Daniel Penner, Abram Hindle, and Thibaud Lutellier. 2026. How do Agents Refactor: An Empirical Study.arXiv preprint arXiv:2601.20160(2026)

  40. [40]

    Anvith Pabba, Simin Chen, Alex Mathai, Anindya Chakraborty, and Baishakhi Ray. 2025. REFINE: Enhancing Program Repair Agents through Context-Aware Patch Refinement. arXiv:2510.03588 [cs.SE] https://arxiv.org/abs/2510.03588

  41. [41]

    Decker, and Christian D

    Anthony Peruma, Mohamed Wiem Mkaouer, Michael J. Decker, and Christian D. Newman. 2018. An empirical investigation of how and why developers rename identifiers. InProceedings of the 2nd International Workshop on Refactoring(Montpellier, France)(IWoR 2018). Association for Computing Machinery, New York, NY, USA, 26–33. doi:10.1145/3242163.3242169

  42. [42]

    Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee. 2025. How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench.arXiv preprint arXiv:2507.02976 (2025)

  43. [43]

    Danilo Silva, João Paulo da Silva, Gustavo Santos, Ricardo Terra, and Marco Tulio Valente. 2021. RefDiff 2.0: A Multi-Language Refactoring Detection Tool.IEEE Transactions on Software Engineering47, 12 (2021), 2786–2802. doi:10.1109/TSE.2020.2968072

  44. [44]

    Xin Sun, Daniel Ståhl, Kristian Sandahl, and Christoph Kessler. 2025. Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics.arXiv preprint arXiv:2511.10271(2025)

  45. [45]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. MAGIS: LLM-based multi-agent framework for GitHub issue ReSolution. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article J. ACM, Vol. 1, No. 1, Art...

  46. [46]

    Yida Tao and Sunghun Kim. 2015. Partitioning composite code changes to facilitate code review. InProceedings of the 12th Working Conference on Mining Software Repositories(Florence, Italy)(MSR ’15). IEEE Press, 180–190

  47. [47]

    Refactoring Runaway

    Zhao Tian, Zifan Zhang, Tao Xiao, Dong Wang, Manasanari Kondo, Junjie Chen, and Kamei Yasutaka. [n. d.]. Research artifact – "Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution. https://github.com/zifan-zhang/llm-refactoring-research [Online; accessed 2026-05-21]

  48. [48]

    Nikolaos Tsantalis, Ameya Ketkar, and Danny Dig. 2022. RefactoringMiner 2.0.IEEE Transactions on Software Engineering48, 3 (2022), 930–950. doi:10.1109/TSE.2020.3007722

  49. [49]

    Dong Wang, Hanmo You, Lingwei Zhu, Kaiwei Lin, Zheng Chen, Chen Yang, Junji Yu, Zan Wang, and Junjie Chen

  50. [50]

    A Survey of Reinforcement Learning for Software Engineering.arXiv preprint arXiv:2507.12483(2025)

  51. [51]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  52. [52]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

  53. [53]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

  54. [54]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents. arXiv:2504.21798 [cs.SE] https://arxiv.org/abs/2504.21798

  55. [55]

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. arXiv:2504.02605 [cs.SE] https://arxiv.org/abs/2504.02605

  56. [56]

    Refactoring Runaway

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1592–1604. doi:10.1145/ 3650212.3680384 J. ACM, Vol. 1, No...

  57. [57]

    Correctness first: the fix must resolve the reported issue

  58. [58]

    Minimize change scope: avoid unnecessary modifications

  59. [59]

    Refactoring discipline: keep only refactorings that directly support the fix or are tightly coupled to it

  60. [60]

    Maintain backward compatibility: do not introduce breaking changes to public APIs unless absolutely necessary

  61. [61]

    A.2 Component 1: Refactoring Assessment

    Preserve test interfaces: avoid changes that break existing test contracts Note: Output format requirements are specified in the phase-specific system prompts. A.2 Component 1: Refactoring Assessment

  62. [62]

    Your goal is to assess each refactoring detected in a bug-fix patch and recommend appropriate actions

    Component-specific system prompt template: You are a senior Java engineer specializing in patch analysis and refactoring evaluation. Your goal is to assess each refactoring detected in a bug-fix patch and recommend appropriate actions. Evaluation Dimensions:

  63. [63]

    **Necessity** - Is this refactoring necessary for the fix to work? - NECESSARY: The fix cannot work without this refactoring (e.g., must add parameter to pass critical data) - UNNECESSARY: The refactoring is cosmetic cleanup unrelated to the fix

  64. [64]

    ACM, Vol

    **Safety** - Is the refactoring implementation correct? - SAFE: The refactoring is correctly implemented with no issues * All API changes are consistently applied across all call sites * Type signatures are compatible or properly updated * Behavior is preserved (same logic, proper wiring, equivalent control flow) - UNSAFE: The refactoring has implementati...

  65. [65]

    unsafe", action MUST be

    **Recommended Action** - What should be done with this refactoring? - KEEP: Necessary + safe -> keep as-is - REMOVE: Unnecessary (even if safe) -> remove entirely to minimize patch - FIX: Necessary but unsafe -> fix the safety issues while preserving functionality Decision Process: For EACH detected refactoring, evaluate: a. Locate the exact diff hunks th...

  66. [66]

    <issue> {issue_description} </issue> <code> {code_sections} </code> <patch> {patch_diff} </patch> <detected_refactorings> J

    Case-specific user prompt template: You will be provided with an issue description, optional code context, the candidate patch diff, and a list of detected refactorings. <issue> {issue_description} </issue> <code> {code_sections} </code> <patch> {patch_diff} </patch> <detected_refactorings> J. ACM, Vol. 1, No. 1, Article 1. Publication date: March 2026. 1...

  67. [67]

    Locate the corresponding diff hunks (cite file names and line numbers) b

    For EACH refactoring in the list above: a. Locate the corresponding diff hunks (cite file names and line numbers) b. Assess NECESSITY: Is it necessary/unnecessary for the bug fix? c. Check SAFETY: Are there any implementation issues? - Missing call site updates? - Type incompatibilities? - Behavior changes? d. If unsafe, list specific safety issues e. Rec...

  68. [68]

    overall_verdict

    Provide overall assessment: - "overall_verdict": "all_safe" (all are safe and necessary) | "has_issues" (some need action) | "uncertain" - "actions_needed": Count how many refactorings need each action type

  69. [69]

    all_safe

    Be OBJECTIVE: - Necessary refactorings with implementation issues should be marked for FIX , not REMOVE - Unnecessary refactorings should be marked for REMOVE, even if they are safe - Focus on minimizing the patch while preserving the fix functionality - Keep reasoning consistent with safety/action (no contradictions) - If a refactoring is fully supported...

  70. [70]

    Refactoring Runaway

    Component-specific system prompt template: You are a senior Java engineer tasked with refining a bug-fix patch. Your goal is to apply the specified actions to each refactoring while keeping the core bug fix intact. J. ACM, Vol. 1, No. 1, Article 1. Publication date: March 2026. “Refactoring Runaway”: Understanding and Mitigating Tangled Refactorings in Co...

  71. [71]

    **Preserve the bug fix**: The primary fix must remain functional after refinement

  72. [72]

    **Apply actions precisely**: - KEEP: Leave the refactoring changes as-is - REMOVE: Revert only the refactoring-related changes, preserving the bug fix - FIX: Apply the fix_suggestion to make the refactoring safe

  73. [73]

    **Minimize changes**: Only modify what is necessary to apply the actions

  74. [74]

    git apply

    **Maintain compilability**: Ensure the refined patch will compile successfully Output requirements: - Output ONLY a unified diff for the refined patch - Use standard Git diff format with the following structure: * diff --git a/path b/path * (optional) index line * --- a/path * +++ b/path * @@ ... @@ hunks - Each file change must include at minimum: diff -...

  75. [75]

    Case-specific user prompt template: You will be given:

  76. [76]

    The original patch diff

  77. [77]

    remove": revert the refactoring-related changes while preserving the bug fix - If action is

    A list of refactoring assessments with recommended actions <patch> {patch_diff} </patch> <assessments> {refactoring_assessments} </assessments> Apply the recommended actions: For each refactoring assessment: - If action is "keep": leave those changes unchanged J. ACM, Vol. 1, No. 1, Article 1. Publication date: March 2026. 1:34 Zhao Tian, Zifan Zhang, Tao...