pith. machine review for the scientific record. sign in

arxiv: 2603.16791 · v2 · submitted 2026-03-17 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:44 UTC · model grok-4.3

classification 💻 cs.SE
keywords refactoringnovice programmerscode comprehensioncognitive loadCognitive-Driven DevelopmentCyclomatic complexityautomated refactoring
0
0 comments X

The pith

Cognitively guided refactoring improves novice programmers' code comprehension by reducing control-flow complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to automatically refactor code for better novice understanding using principles from Cognitive-Driven Development. The approach, called CDDRefactorER, limits changes to decrease nesting and complexity measures without altering what the code does. Tests on standard programming datasets with AI models show major drops in failed refactors and complexity increases. In a study with actual novices, the refactored code led to better identification of functions and easier reading of structure. If effective, this could mean novices learn from code examples more readily without extra explanations.

Core claim

The central claim is that cognitively guided refactoring, operationalized in CDDRefactorER, provides a practical mechanism for enhancing novice code comprehension by constraining transformations to reduce control-flow complexity while preserving behavior and structural similarity, as evidenced by reduced refactoring failures and improved human comprehension metrics.

What carries the argument

CDDRefactorER, the automated refactoring approach that applies constrained transformations from Cognitive-Driven Development to lower Cyclomatic and Cognitive complexity.

Load-bearing premise

That the specific constrained transformations reliably preserve original behavior and structural similarity while reducing cognitive load as measured by complexity metrics.

What would settle it

Conducting the human-subject study again and finding no statistically significant improvement in novice comprehension scores for the refactored code.

Figures

Figures reproduced from arXiv: 2603.16791 by Alif Al Hasan, Fariha Tanjim Shifat, Mia Mohammad Imran, Subarna Saha.

Figure 1
Figure 1. Figure 1: Examples from Reddit where novice programmers [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Methodology. Boolean Returns replaces verbose conditional patterns with direct boolean expressions [69]. Descriptive Naming improves identifier clarity [59, 61], and Sequential Flow encourages chronological or￾dering and grouping of statements to support comprehension [64]. Each strategy is defined through explicit transformation rules and illustrated with concrete examples in the prompt. F… view at source ↗
Figure 3
Figure 3. Figure 3: CodeBLEU similarity distributions after refactoring [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Original code (top), erroneous baseline refactoring [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Novice programmers often struggle to comprehend code due to vague naming, deep nesting, and poor structural organization. While explanations may offer partial support, they typically do not restructure the code itself. We propose code refactoring as cognitive scaffolding, where cognitively guided refactoring automatically restructures code to improve clarity. We operationalize this in CDDRefactorER, an automated approach grounded in Cognitive-Driven Development that constrains transformations to reduce control-flow complexity while preserving behavior and structural similarity. We evaluate CDDRefactorER using two benchmark datasets (MBPP and APPS) against two models (gpt-5-nano and kimi-k2), and a controlled human-subject study with novice programmers. Across datasets and models, CDDRefactorER reduces refactoring failures by 54-71% and substantially lowers the likelihood of increased Cyclomatic and Cognitive complexity during refactoring, compared to unconstrained prompting. Results from the human study show consistent improvements in novice code comprehension, with function identification increasing by 31.3% and structural readability by 22.0%. The findings suggest that cognitively guided refactoring offers a practical and effective mechanism for enhancing novice code comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that CDDRefactorER, an LLM-based automated refactoring approach grounded in Cognitive-Driven Development, constrains transformations to reduce control-flow complexity (Cyclomatic and Cognitive) while preserving behavior and structural similarity. On MBPP and APPS benchmarks it reports 54-71% fewer refactoring failures and lower rates of complexity increase versus unconstrained prompting; a controlled human study with novices reports 31.3% higher function identification and 22.0% higher structural readability.

Significance. If the behavior-preservation claim holds, the work supplies a practical, cognitively grounded mechanism for improving novice code comprehension that could be integrated into programming tools and education platforms. The use of independent public benchmarks plus a separate human study is a strength.

major comments (1)
  1. [Evaluation] Evaluation section: the central claim requires that constrained transformations preserve original semantics, yet the manuscript provides no description of post-refactoring test execution, output-equivalence checks, or structural-diff metrics on MBPP/APPS. Without these, the reported reductions in failures and complexity cannot be interpreted as evidence of safe refactoring.
minor comments (1)
  1. [Human study] The abstract and visible sections omit sample size, task design details, and statistical controls for the human study; these should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment regarding the evaluation of semantic preservation below and agree that additional details are needed.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim requires that constrained transformations preserve original semantics, yet the manuscript provides no description of post-refactoring test execution, output-equivalence checks, or structural-diff metrics on MBPP/APPS. Without these, the reported reductions in failures and complexity cannot be interpreted as evidence of safe refactoring.

    Authors: We agree that the manuscript should explicitly describe how behavior preservation was verified to support the central claims. Although the CDD constraints are intended to ensure equivalence by restricting transformations to behavior-preserving operations (such as renaming and restructuring without logic changes), we acknowledge the absence of verification details. In the revised version, we will add a dedicated subsection to the Evaluation section detailing: post-refactoring execution of test cases from MBPP and APPS to confirm output equivalence; use of structural-diff metrics (e.g., AST similarity) to quantify structural preservation; and any manual or automated equivalence checks performed. This will allow the reported failure reductions and complexity improvements to be interpreted as evidence of safe refactoring. revision: yes

Circularity Check

0 steps flagged

Minor reliance on established CDD framework with independent benchmarks and human evaluation

full rationale

The paper grounds CDDRefactorER in the existing Cognitive-Driven Development framework and evaluates refactoring success, complexity metrics, and novice comprehension gains on external public datasets (MBPP, APPS) plus a separate controlled human study. No equations, fitted parameters, or self-citations reduce the central claims to inputs defined by the same data. Behavior preservation is enforced via prompt constraints rather than derived from the evaluation itself, so the derivation chain remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cognitive complexity metrics can guide refactoring rules that improve human comprehension without introducing new fitted parameters or invented entities.

axioms (1)
  • domain assumption Cognitive-Driven Development principles can be operationalized as constraints on code transformations that reduce control-flow complexity while preserving behavior
    Invoked to justify the design of CDDRefactorER

pith-pipeline@v0.9.0 · 5510 in / 1110 out tokens · 24317 ms · 2026-05-15T09:44:33.064147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    Code Transformation

    2017. Code Transformation. ScienceDirect Topics, Computer Science. https:// www.sciencedirect.com/topics/computer-science/code-transformation (accessed January 11, 2026). Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom

  2. [2]

    CDDRefactorER

    2025. CDDRefactorER. https://chatgpt.com/g/g- 6803de5d95fc81919a4cdbcb210b8200-cddrefactorgpt

  3. [3]

    Replication Package

    2025. Replication Package. https://zenodo.org/records/18153415

  4. [4]

    Felix Adler, Gordon Fraser, Eva Grundinger, et al. 2021. Improving Readability of Scratch Programs with Search-based Refactoring . In2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE Computer Society, Los Alamitos, CA, USA, 120–130

  5. [5]

    Eman Abdullah AlOmar, Mohamed Wiem Mkaouer, and Ali Ouni. 2024. Au- tomating Source Code Refactoring in the Classroom. InProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1(Portland, OR, USA)(SIGCSE 2024). Association for Computing Machinery, New York, NY, USA, 60–66

  6. [6]

    Eman Abdullah AlOmar, Luo Xu, Sofia Martinez, et al. 2025. ChatGPT for Code Refactoring: Analyzing Topics, Interaction, and Effective Prompts.35th IEEE International Conference on Collaborative Advances in Software and Computing (CASCON)(2025)

  7. [7]

    Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  8. [8]

    Leonardo Ferreira Barbosa, Victor Hugo Pinto, Alberto Luiz Oliveira Tavares de Souza, et al. 2022. To What Extent Cognitive-Driven Development Improves Code Readability?. InProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement(Helsinki, Finland)(ESEM ’22). Association for Computing Machinery, New York...

  9. [9]

    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111

  10. [10]

    Arie Bennett and Cruz Izu. 2025. Replicating a SOLO approach to Measure Students’ Ability to Improve Code Efficiency. InProceedings of the ACM Global on Computing Education Conference 2025 Vol 1(Gaborone, Botswana)(CompEd 2025). Association for Computing Machinery, New York, NY, USA, 43–49

  11. [11]

    João Henrique Berssanette and Antonio Carlos de Francisco. 2021. Cognitive load theory in the context of teaching and learning computer programming: A systematic literature review.IEEE Transactions on Education65, 3 (2021), 440–449

  12. [12]

    Teresa Busjahn, Carsten Schulte, and Andreas Busjahn. 2011. Analysis of code reading to gain more insight in program comprehension. InProceedings of the 11th Koli Calling International Conference on Computing Education Research(Koli, Finland)(Koli Calling ’11). Association for Computing Machinery, New York, NY, USA, 1–9

  13. [13]

    G Ann Campbell. 2018. Cognitive complexity: An overview and evaluation. In Proceedings of the 2018 international conference on technical debt(Gothenburg, Sweden)(TechDebt ’18). Association for Computing Machinery, New York, NY, USA, 57–58

  14. [14]

    Eduardo Carneiro Oliveira, Hieke Keuning, and Johan Jeuring. 2024. Investigat- ing student reasoning in method-level code refactoring: A think-aloud study. InProceedings of the 24th Koli Calling International Conference on Computing Education Research. 1–11

  15. [15]

    Eduardo Carneiro Oliveira, Hieke Keuning, and Johan Jeuring. 2025. Uncovering Behavioral Patterns in Student–LLM Conversations during Code Refactoring Tasks. InProceedings of the 25th Koli Calling International Conference on Comput- ing Education Research (Koli Calling ’25). Association for Computing Machinery, New York, NY, USA, Article 39, 11 pages

  16. [16]

    Gary Charness, Uri Gneezy, and Michael A Kuhn. 2012. Experimental methods: Between-subject and within-subject design.Journal of economic behavior & organization81, 1 (2012), 1–8

  17. [17]

    Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

  18. [18]

    Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological bulletin114, 3 (1993), 494

  19. [19]

    Refactor to Understand

    Bart Du Bois, Serge Demeyer, and Jan Verelst. 2005. Does the "Refactor to Understand" Reverse Engineering Pattern Improve Program Comprehension?. InProceedings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR ’05). IEEE Computer Society, USA, 334–343

  20. [20]

    Rodrigo Duran, Albina Zavgorodniaia, and Juha Sorva. 2022. Cognitive load the- ory in computing education research: A review.ACM Transactions on Computing Education (TOCE)22, 4 (2022), 1–27

  21. [21]

    Ericsson, Emma. 2023. Evaluating Similarity-Based Refactoring Recommenda- tions. Student Paper

  22. [22]

    Matteo Esposito, Andrea Janes, Terhi Kilamo, et al. 2025. Early Career Developers’ Perceptions of Code Understandability: A Study of Complexity Metrics.IEEE Access13 (2025), 135027–135042

  23. [23]

    Sarah Fakhoury, Yuzhan Ma, Venera Arnaoudova, et al. 2018. The effect of poor source code lexicon and readability on developers’ cognitive load. InProceedings of the 26th Conference on Program Comprehension(Gothenburg, Sweden)(ICPC ’18). Association for Computing Machinery, New York, NY, USA, 286–296

  24. [24]

    Sarah Fakhoury, Devjeet Roy, Adnan Hassan, et al. 2019. Improving source code readability: Theory and practice. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 2–12

  25. [25]

    Zhangyin Feng, Daya Guo, Duyu Tang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, 1536–1547

  26. [26]

    Pinto, Cleidson R

    Ronivaldo Ferreira, Victor Hugo Santiago C. Pinto, Cleidson R. B. de Souza, et al. 2024. Assisting Novice Developers Learning in Flutter Through Cognitive- Driven Development. InProceedings of the 38th Brazilian Symposium on Software Engineering, SBES 2024, Curitiba, Brazil, September 30 - October 4, 2024. SBC, 367–376

  27. [27]

    2018.Refactoring: improving the design of existing code

    Martin Fowler. 2018.Refactoring: improving the design of existing code. Addison- Wesley Professional

  28. [28]

    Lucian José Gonçales, Kleinner Farias, and Bruno C da Silva. 2021. Measuring the cognitive load of software developers: An extended Systematic Mapping Study. Information and Software Technology136 (2021), 106563

  29. [29]

    Dan Gopstein, Jake Iannacone, Yu Yan, et al. 2017. Understanding misunderstand- ings in source code. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering(Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 129–139

  30. [30]

    Anthony G Greenwald. 1976. Within-subjects designs: To use or not to use? Psychological Bulletin83, 2 (1976), 314

  31. [31]

    Gao Hao, Haytham Hijazi, João Durães, et al . 2023. On the accuracy of code complexity metrics: A neuroscience-based guideline for improvement.Frontiers in Neuroscience16 (2023), 1065366

  32. [32]

    Alif Al Hasan, Subarna Saha, and Mia Mohammad Imran. 2026. Learning Pro- gramming in Informal Spaces: Using Emotion as a Lens to Understand Novice Struggles on r/learnprogramming. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’26). ACM, Rio de Janeiro, Brazil, 1–12

  33. [33]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, et al. 2021. Measuring Coding Challenge Competence With APPS. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1

  34. [34]

    Felienne Hermans and Efthimia Aivaloglou. 2016. Do code smells hamper novice programming? A controlled experiment on Scratch programs . In2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 1–10

  35. [35]

    John Johnson, Sergio Lubo, Nishitha Yedla, et al . 2019. An Empirical Study Assessing Source Code Readability in Comprehension . In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 513–523

  36. [36]

    Shahedul Huq Khandkar. 2009. Open coding.University of Calgary23, 2009 (2009), 2009

  37. [37]

    Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (2019), 56–65

  38. [38]

    Stephen MacNeil, Andrew Tran, Arto Hellas, et al. 2023. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1(Toronto ON, Canada)(SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 931–937

  39. [39]

    Philomena Marfo and G.A. Okyere. 2019. The accuracy of effect-size estimates under normals and contaminated normals in meta-analysis.Heliyon5, 6 (2019), e01838

  40. [40]

    T.J. McCabe. 1976. A Complexity Measure .IEEE Transactions on Software Engineering2, 04 (Dec. 1976), 308–320

  41. [41]

    Flavio Medeiros, Marcio Ribeiro, Rohit Gheyi, et al. 2018. Discipline Matters: Refactoring of Preprocessor Directives in the #ifdef Hell .IEEE Transactions on Software Engineering44, 05 (May 2018), 453–469

  42. [42]

    G. A. Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information.Psychological Review63, 2 (1956), 81–97

  43. [43]

    Rodrigo Morales, Foutse Khomh, and Giuliano Antoniol. 2020. RePOR: Mimicking humans on refactoring tasks. Are we there yet?Empirical Software Engineering 25, 4 (2020), 2960–2996

  44. [44]

    Marvin Muñoz Barón, Marvin Wyrich, and Stefan Wagner. 2020. An Empirical Validation of Cognitive Complexity as a Measure of Source Code Understandabil- ity. InProceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)(Bari, Italy)(ESEM ’20). Associa- tion for Computing Machinery, New York, NY, USA, ...

  45. [45]

    Sara Nurollahian, Hieke Keuning, and Eliane Wiese. 2025. Teaching Well- Structured Code: A Literature Review of Instructional Approaches . In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE Computer Society, Los Alamitos, CA, USA, 205–216

  46. [46]

    Indranil Palit and Tushar Sharma. 2025. Reinforcement Learning vs Supervised Learning: A tug of war to generate refactored code accurately. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE ’25). Association for Computing Machinery, New York, NY, USA, 429–440. EASE 2026, 9–12 June, 2026, Glasgow,...

  47. [47]

    Peterson, et al

    Kang-il Park, Jack Johnson, Cole S. Peterson, et al. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empirical Softw. Engg.29, 6 (Oct. 2024), 60 pages

  48. [48]

    Norman Peitek, Sven Apel, Chris Parnin, et al. 2021. Program Comprehension and Code Complexity Metrics: An fMRI Study. InProceedings of the 43rd International Conference on Software Engineering(Madrid, Spain)(ICSE ’21). IEEE Press, NJ, USA, 524–536

  49. [49]

    Anthony Peruma, Steven Simmons, Eman Abdullah AlOmar, et al. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overflow.Empirical Software Engineering27, 1 (2022), 11

  50. [50]

    Yonnel Chen Kuang Piao, Jean Carlors Paul, Leuson Da Silva, et al. 2025. Refac- toring with LLMs: Bridging Human Expertise and Machine Understanding. arXiv:2510.03914 [cs.SE]

  51. [51]

    Gustavo Pinto and Alberto de Souza. 2023. Cognitive Driven Development helps software teams to keep code units under the limit!Journal of Systems and Software 206 (2023), 111830

  52. [52]

    Pinto and Alberto Luiz Oliveira Tavares De Souza

    Victor Hugo Santiago C. Pinto and Alberto Luiz Oliveira Tavares De Souza. 2022. Effects of Cognitive-driven Development in the Early Stages of the Software Development Life Cycle. InProceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS

  53. [53]

    Pinto, Alberto Luiz Oliveira Tavares de Souza, Yuri Matheus Barboza de Oliveira, et al

    Victor Hugo Santiago C. Pinto, Alberto Luiz Oliveira Tavares de Souza, Yuri Matheus Barboza de Oliveira, et al. 2021. Cognitive-Driven Development: Preliminary Results on Software Refactorings. InProceedings of the 16th Inter- national Conference on Evaluation of Novel Approaches to Software Engineering - ENASE. INSTICC, SciTePress, 92–102. doi:10.5220/00...

  54. [54]

    It’s weird that it knows what i want

    James Prather, Brent N Reeves, Paul Denny, et al. 2023. “It’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers. ACM transactions on computer-human interaction31, 1 (2023), 1–31

  55. [55]

    Raluca Budiu. 2023. Between-Subjects vs. Within-Subjects Study Design. https: //www.nngroup.com/articles/between-within-subjects/. Accessed: 2026-01-10

  56. [56]

    Shuo Ren, Daya Guo, Shuai Lu, et al. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

  57. [57]

    Devjeet Roy, Sarah Fakhoury, John Lee, et al. 2020. A Model to Detect Readability Improvements in Incremental Changes. InProceedings of the 28th International Conference on Program Comprehension(Seoul, Republic of Korea)(ICPC ’20). Association for Computing Machinery, New York, NY, USA, 25–36

  58. [58]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, et al. 2024. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]

  59. [59]

    Simone Scalabrino, Mario Linares-Vasquez, Denys Poshyvanyk, et al. 2016. Im- proving code readability models with textual features . In2016 IEEE 24th Interna- tional Conference on Program Comprehension (ICPC). IEEE Computer Society, Los Alamitos, CA, USA, 1–10

  60. [60]

    Sandro Schulze, Jörg Liebig, Janet Siegmund, et al . 2013. Does the discipline of preprocessor annotations matter? a controlled experiment. InProceedings of the 12th International Conference on Generative Programming: Concepts & Experiences(Indianapolis, Indiana, USA)(GPCE ’13). Association for Computing Machinery, New York, NY, USA, 65–74

  61. [61]

    Giulia Sellitto, Emanuele Iannone, Zadia Codabux, et al. 2022. Toward Under- standing the Impact of Refactoring on Program Comprehension. InIEEE Interna- tional Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 731–742

  62. [62]

    Janet Siegmund, Norman Peitek, Chris Parnin, et al . 2017. Measuring neural efficiency of program comprehension. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering(Paderborn, Germany)(ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 140–150

  63. [63]

    José Aldo Silva Da Costa and Rohit Gheyi. 2023. Evaluating the Code Com- prehension of Novices with Eye Tracking. InProceedings of the XXII Brazilian Symposium on Software Quality(Brasília, Brazil)(SBQS ’23). Association for Computing Machinery, New York, NY, USA, 332–341

  64. [64]

    John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science12, 2 (1988), 257–285

  65. [65]

    Alberto Luiz Oliveira Tavares de Souza and Victor Hugo Santiago Costa Pinto

  66. [66]

    In2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)

    Toward a Definition of Cognitive-Driven Development . In2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 776–778

  67. [67]

    Kimi Team, Yifan Bai, Yiping Bao, et al. 2025. Kimi K2: Open Agentic Intelligence. arXiv:2507.20534 [cs.LG]

  68. [68]

    Peeratham Techapalokul and Eli Tilevich. 2019. Position: Manual Refactoring (by Novice Programmers) Considered Harmful . In2019 IEEE Blocks and Beyond Workshop (B&B). IEEE Computer Society, Los Alamitos, CA, USA, 79–80

  69. [69]

    Garry L White and Marcos P Sivitanides. 2002. A theory of the relationships between cognitive requirements of computer programming languages and pro- grammers’ cognitive characteristics.Journal of information systems education13, 1 (2002), 59–66

  70. [70]

    Wiese, Anna N

    Eliane S. Wiese, Anna N. Rafferty, and Armando Fox. 2019. Linking code readabil- ity, structure, and comprehension among novices: it’s complicated. InProceedings of the 41st International Conference on Software Engineering: Software Engineering Education and Training(Montreal, Quebec, Canada)(ICSE-SEET ’19). IEEE Press, NJ, USA, 84–94

  71. [71]

    Frank Wilcoxon. 1945. Individual comparisons by ranking methods.Biometrics bulletin1, 6 (1945), 80–83

  72. [72]

    Yisen Xu, Feng Lin, Jinqiu Yang, et al. 2025. MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collabo- ration. arXiv:2503.14340 [cs.SE]

  73. [73]

    Albert Ziegler, Eirini Kalliamvakou, X Alice Li, et al . 2024. Measuring github copilot’s impact on productivity.Commun. ACM67, 3 (2024), 54–63