pith. sign in

arxiv: 2511.02922 · v2 · submitted 2025-11-04 · 💻 cs.SE

Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming

Pith reviewed 2026-05-18 00:32 UTC · model grok-4.3

classification 💻 cs.SE
keywords code comprehensionGitHub Copilotbrownfield programmingverification behaviorgenerative AIlegacy codesoftware engineering educationprogramming performance
0
0 comments X

The pith

Programmers maintain comprehension of legacy code when they actively verify Copilot suggestions rather than accepting them passively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether GitHub Copilot improves both speed and understanding when graduate students add features to existing codebases. It finds clear performance gains alongside no overall gain in comprehension, plus a split where faster work hurts reverse-engineering insight but may support implementation tasks. The decisive factor turns out to be usage pattern: students who repeatedly checked the generated code against the existing system understood the codebase far better.

Core claim

In a within-subject study, participants completed feature implementation tasks in legacy code with and without Copilot. Performance improved significantly but comprehension showed no overall change. Performance gains correlated negatively with reverse engineering comprehension and positively trended with implementation comprehension. Behavioral analysis showed that verification loops strongly predict comprehension, with high-comprehension users verifying 4.7 times more often.

What carries the argument

Verification loops, repeated cycles in which programmers review and test Copilot-generated code against the surrounding legacy system before accepting it.

If this is right

  • GenAI tools do not inherently reduce comprehension when users adopt active review habits.
  • Passive acceptance of generated code produces speed without preserving system-level understanding.
  • Software engineering curricula should explicitly train verification skills alongside tool use.
  • Educational versions of these tools could include prompts or interfaces that encourage code review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification pattern may matter for other generative coding tools beyond Copilot.
  • Industry onboarding programs could measure and reward verification frequency to protect long-term maintainability.
  • Tool designers might test interfaces that make verification the default next step after generation.

Load-bearing premise

The within-subject design with 15 participants fully controls for individual differences and order effects, and the chosen comprehension measures validly capture understanding of the brownfield code base without being confounded by task-specific artifacts or fatigue.

What would settle it

A follow-up experiment that directly manipulates the frequency of verification opportunities and then measures reverse-engineering and implementation scores would show whether the reported correlation is causal.

Figures

Figures reproduced from arXiv: 2511.02922 by and Christopher Hundhausen, Md Istiak Hossain Shihab, Summit Haque, Yunhan Qiao.

Figure 1
Figure 1. Figure 1: Task 1 completion time by condition Copilot. Similar to the findings of our previous study [1], partici￾pants passed an average of 7.2 tests (𝑆𝐷=3.01) with Copilot and 3.9 tests (𝑆𝐷=1.70) without Copilot, representing an 84% increase in performance. A Wilcoxon signed-rank test confirmed that this dif￾ference was statistically significant (𝑊 =6.5, 𝑝=0.001), with a large effect size (𝑟=0.795) [PITH_FULL_IMA… view at source ↗
Figure 2
Figure 2. Figure 2: Total tests passed by condition 4.1.3 Programming Behavior [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean percentage of time spent in each activity category, by condition [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean percentage of time spent in each code writing activity, by condition, showing a shift from almost exclusively [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Programming workflow networks comparing activity transitions without Copilot (left) and with Copilot (right). Node [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlation of test passed and comprehension level [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Correlation of test passed and comprehension level [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($\rho=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($\rho=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a within-subject user study with 15 graduate CS students performing feature implementation tasks on a brownfield codebase, comparing conditions with and without GitHub Copilot. It finds significant performance gains with Copilot but no overall comprehension improvement (p=0.59), a comprehension-performance trade-off (negative correlation with reverse-engineering comprehension ρ=-0.57, p=0.026; positive trend with implementation comprehension ρ=0.50, p=0.06), and a strong behavioral predictor: engagement in verification loops correlates with comprehension at r=0.96 (p<0.001), with high-comprehension participants verifying 4.7 times more frequently. The authors conclude that passive Copilot use, not the tool itself, drives comprehension issues and recommend changes to education and tool design.

Significance. If the core findings hold after addressing sample-size limitations, the work is significant for software engineering education and human-AI interaction research. It provides concrete empirical evidence of a performance-comprehension decoupling and identifies verifiable usage behaviors (verification loops) as a modifiable factor rather than an inherent property of generative tools. This shifts focus from blanket tool adoption to scaffolded cognitive engagement, with direct implications for curriculum design and AI assistant interfaces.

major comments (2)
  1. [Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.
  2. [Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.
minor comments (2)
  1. [Abstract] Abstract: report the exact statistical test and degrees of freedom for the p=0.59 overall comprehension result and the ρ values.
  2. [Methods] Clarify operational definition of 'verification loops' and how they were coded from interaction logs or think-aloud data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and revised the paper to strengthen the robustness of our behavioral findings and to provide greater methodological transparency. Below we respond point by point.

read point-by-point responses
  1. Referee: [Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.

    Authors: We agree that the small sample size makes the reported correlation sensitive to potential outliers or leverage points. In the revised manuscript we have added a leave-one-out sensitivity analysis together with bootstrap resampling (10,000 iterations) to compute 95% confidence intervals. The correlation remains statistically significant and above r = 0.90 in every leave-one-out iteration, with the bootstrap CI being [0.87, 0.99]. We have also clarified that the verification-loop metric was defined a priori on the basis of pilot observations and theoretical considerations of active code review, rather than being derived post hoc from the main data. revision: yes

  2. Referee: [Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.

    Authors: We appreciate the referee’s call for greater methodological detail. The revised Methods section now explicitly reports: (1) counterbalancing via a balanced Latin-square design across the two conditions; (2) randomization of task order within each condition; (3) fatigue assessment using NASA-TLX scales administered after each task together with mandatory short breaks; (4) pilot validation of the reverse-engineering and implementation comprehension instruments with five additional participants and subsequent expert review; and (5) independent behavioral coding by two researchers yielding Cohen’s κ = 0.87. These additions address potential order, fatigue, and measurement confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with independent statistical observations

full rationale

The paper reports results from a within-subject user study involving 15 participants performing feature implementation tasks with and without Copilot. All load-bearing claims rest on directly observed behavioral metrics (verification loop counts) and comprehension scores (reverse engineering and implementation tasks), with reported statistics such as Pearson r=0.96 and subgroup frequency ratios derived from the collected data rather than from any self-referential equations or fitted parameters. No mathematical derivations, uniqueness theorems, ansatzes, or self-citation chains are invoked to justify core results; the analysis is self-contained against external benchmarks of participant performance and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of experimental psychology and statistics rather than new free parameters or postulated entities.

axioms (1)
  • standard math Standard assumptions for Pearson/Spearman correlations and regression (linearity, independence, sufficient sample for p-value interpretation) hold for the reported statistics.
    Invoked when reporting ρ, r, and p values in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1310 out tokens · 45409 ms · 2026-05-18T00:32:18.793053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Restructure This: Using AI to Restructure Onboarding Documents to Reduce Cognitive Overload

    cs.SE 2026-05 conditional novelty 6.0

    VisDoc uses GenAI to restructure OSS onboarding documentation according to CTML principles, yielding higher task success and lower cognitive load in a small newcomer study.

  2. Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap

    cs.SE 2026-05 unverdicted novelty 5.0

    Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Anonymous Authors. 2025. Anonymous title

  2. [2]

    2010.Brownfield application development in

    Kyle Baley and Donald Belcham. 2010.Brownfield application development in. NET. Manning

  3. [3]

    Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency.Annals of statistics, 1165–1188

  4. [4]

    Michelle Brachman, Arielle Goldberg, Andrew Anderson, Stephanie Houde, Michael Muller, and Justin D Weisz. 2025. Towards personalized and contextu- alized code explanations. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 120–125

  5. [5]

    Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer programming.International Journal of Man-Machine Studies, 9, 6, 737–751

  6. [6]

    Roee Cates, Nadav Yunik, and Dror G Feitelson. 2021. Does code structure affect comprehension? on using and naming intermediate variables. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 118–126

  7. [7]

    Souti Chattopadhyay, Zixuan Feng, Emily Arteaga, Audrey Au, Gonzalo Ramos, Titus Barik, and Anita Sarma. 2023. Make it make sense! understanding and fa- cilitating sensemaking in computational notebooks.arXiv preprint arXiv:2312.11431

  8. [8]

    Paul Denny, David H Smith IV, Max Fowler, James Prather, Brett A Becker, and Juho Leinonen. 2024. Explaining code with a purpose: an integrated approach for developing code comprehension and prompting skills. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, 283–289

  9. [9]

    James Dominic, Brock Tubre, Jada Houser, Charles Ritter, Deborah Kunkel, and Paige Rodeghero. 2020. Program comprehension in virtual reality. In Proceedings of the 28th International Conference on Program Comprehension, 391–395

  10. [10]

    Zixuan Feng, Reed Milewicz, Emerson Murphy-Hill, Tyler Menezes, Alexan- der Serebrenik, Igor Steinmacher, and Anita Sarma. 2025. Charting uncertain waters: a socio-technical framework for navigating genai’s impact on open source communities.arXiv preprint arXiv:2508.04921

  11. [11]

    GitHub. 2024. Github codespaces. Accessed on 16 October 2025. Retrieved Oct. 16, 2025 from https://github.com/features/codespaces

  12. [12]

    GitHub, Inc. 2024. GitHub Classroom. https://classroom.github.com. Accessed: 2024-10-16. (2024)

  13. [13]

    Ava Heinonen, Bettina Lehtelä, Arto Hellas, and Fabian Fagerholm. 2023. Syn- thesizing research on programmers’ mental models of programs, tasks and concepts—a systematic literature review.Information and Software Technology, 164, 107300

  14. [14]

    Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. 2019. Shorter identifier names take longer to comprehend.Empirical Software Engineering, 24, 1, 417–443

  15. [15]

    Instructure, Inc. 2025. Canvas Learning Management System. https://www.ins tructure.com/canvas. Accessed: 2025-10-16. (2025)

  16. [16]

    Anna A Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H Kean, Riva Dhamala, Una-May O’reilly, Marina U Bers, and Evelina Fedorenko. 2020. Comprehension of computer code relies primarily on domain-general executive brain regions. elife, 9, e58906

  17. [17]

    Majeed Kazemitabaar, Oliver Huang, Sangho Suh, Austin Z Henley, and Tovi Grossman. 2025. Exploring the design space of cognitive engagement tech- niques with ai-generated code for enhanced learning. InProceedings of the 30th International Conference on Intelligent User Interfaces, 695–714

  18. [18]

    Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. InProceedings of the 2024 chi conference on human factors in computing systems, 1–20

  19. [19]

    Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An ex- ploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks.IEEE Transactions on software engineering, 32, 12, 971–987

  20. [20]

    Jürgen Koenemann and Scott P Robertson. 1991. Expert problem solving strate- gies for program comprehension. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, 125–130

  21. [21]

    Thomas D LaToza and Brad A Myers. 2010. Developers ask reachability ques- tions. InProceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1, 185–194

  22. [22]

    Joseph Lawrance, Christopher Bogart, Margaret Burnett, Rachel Bellamy, Kyle Rector, and Scott D Fleming. 2010. How programmers debug, revisited: an infor- mation foraging theory perspective.IEEE Transactions on Software Engineering, 39, 2, 197–215

  23. [23]

    Fangjian Lei, Jiawen Liu, Shayan Noei, Ying Zou, Derek Truong, and William Alexander. 2025. Enhancing cobol code explanations: a multi-agents approach using large language models.arXiv preprint arXiv:2507.02182

  24. [24]

    Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing code explanations created by students and large language models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 124–130

  25. [25]

    Stanley Letovsky. 1987. Cognitive processes in program comprehension.Jour- nal of Systems and software, 7, 4, 325–339

  26. [26]

    Omer Levy and Dror G Feitelson. 2019. Understanding large-scale software–a hierarchical view. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 283–293

  27. [27]

    Jenny T Liang, Melissa Lin, Nikitha Rao, and Brad A Myers. 2025. Prompts are programs too! understanding how developers build software containing prompts.Proceedings of the ACM on Software Engineering, 2, FSE, 1591–1614

  28. [28]

    Loom, Inc. 2025. Loom. https://www.loom.com. Accessed: 2025-10-16. (2025)

  29. [29]

    Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from using code explanations generated by large language models in a web software development e-book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937

  30. [30]

    Andreı Andreevich Markov. 2006. An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains.Science in Context, 19, 4, 591–600

  31. [31]

    Microsoft Corporation. 2025. Visual Studio Code. https://code.visualstudio.com. Accessed: 2025-10-16. (2025)

  32. [32]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13

  33. [33]

    Daye Nam, Ahmed Omran, Ambar Murillo, Saksham Thakur, Abner Araujo, Marcel Blistein, Alexander Frömmgen, Vincent Hellendoorn, and Satish Chan- dra. 2025. Prompting llms for code editing: struggles and remedies.arXiv preprint arXiv:2504.20196

  34. [34]

    Kevin KB Ng, Liyana Fauzi, Leon Leow, and Jaren Ng. 2024. Harnessing the potential of gen-ai coding assistants in public sector software development. arXiv preprint arXiv:2409.17434

  35. [35]

    Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolu- tion.—iii. regression, heredity, and panmixia.Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253–318

  36. [36]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: evidence from github copilot.arXiv preprint arXiv:2302.06590

  37. [37]

    David Piorkowski, Austin Z Henley, Tahmid Nabi, Scott D Fleming, Christopher Scaffidi, and Margaret Burnett. 2016. Foraging and navigations, fundamentally: developers’ predictions of value and cost. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 97– 108

  38. [38]

    James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs

  39. [39]

    InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

    The widening gap: the benefits and harms of generative ai for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

  40. [40]

    Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. 2025. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–21

  41. [41]

    Yunhan Qiao, Md Istiak Hossain Shihab, and Christopher Hundhausen. 2025. A systematic literature review of the use of genai assistants for code compre- hension: implications for computing education research and practice.arXiv preprint arXiv:2510.17894

  42. [42]

    Qualtrics, LLC. 2025. Qualtrics. https://www.qualtrics.com. Accessed: 2025-10-

  43. [43]

    Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

    (2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

  44. [44]

    Christian Rahe and Walid Maalej. 2025. How do programming students use generative ai?Proceedings of the ACM on Software Engineering, 2, FSE, 978– 1000

  45. [45]

    Martin P Robillard, Wesley Coelho, and Gail C Murphy. 2005. How effective developers investigate source code: an exploratory study.IEEE Transactions on software engineering, 30, 12, 889–903

  46. [46]

    Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

  47. [47]

    Nedhal A Al-Saiyd. 2017. Source code comprehension analysis in software maintenance. In2017 2nd International Conference on Computer and Communi- cation Systems (ICCCS). IEEE, 1–5

  48. [48]

    Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. 2025. Vibe coding vs. agentic coding: fundamentals and practical implications of agentic ai.arXiv preprint arXiv:2505.19443

  49. [49]

    Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic gen- eration of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM conference on international computing education research-volume 1, 27–43

  50. [50]

    Teresa M Shaft and Iris Vessey. 2006. The role of cognitive fit in the relationship between software comprehension and modification.MIS quarterly, 29–55

  51. [51]

    Anshul Shah, Thomas Rexin, Anya Chernova, Gonzalo Allen-Perez, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Needles in a haystack: student struggles with working on large code bases. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1, 27–40

  52. [52]

    Anshul Shah, Thanh Tong, Elena Tomson, Steven Shi, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Students’ program comprehension processes in a large code base. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 182–193

  53. [53]

    Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Mulanda. 2025. The effects of github copilot on computing students’ programming effectiveness, efficiency, and processes in brownfield programming tasks.arXiv preprint arXiv:2506.10051

  54. [54]

    Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 140–150

  55. [55]

    Jonathan Sillito, Gail C Murphy, and Kris De Volder. 2006. Questions pro- grammers ask during software evolution tasks. InProceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, 23– 34

  56. [56]

    Lucas Siqueira Rodrigues, Antonio Rueda-Toicen, and Thomas Kosch. 2025. Redesigning large language model coding assistants for software engineer- ing education. InMensch und Computer 2025-Workshopband. Gesellschaft für Informatik eV, 10–18420

  57. [57]

    David H Smith IV, Paul Denny, and Max Fowler. 2024. Prompting for compre- hension: exploring the intersection of explain in plain english questions and prompt writing. InProceedings of the Eleventh ACM Conference on Learning@ Scale, 39–50

  58. [58]

    Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, West- ley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehen- sion and code summarization. InProceedings of the 28th International Conference on Program Comprehension, 2–13

  59. [59]

    Margaret-Anne Storey. 2006. Theories, tools and research methods in program comprehension: past, present and future.Software Quality Journal, 14, 3, 187– 208

  60. [60]

    Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comparative study of free self-explanations and socratic tutoring explanations for source code comprehension. InProceedings of the 52nd acm technical symposium on computer science education, 219–225

  61. [61]

    Ningzhi Tang. 2024. Towards effective validation and integration of llm-generated code. In2024 IEEE Symposium on Visual Languages and Human-Centric Com- puting (VL/HCC). IEEE, 369–370

  62. [62]

    Anneliese Von Mayrhauser, A Marie Vans, and Adele E Howe. 1997. Program understanding behaviour during enhancement of large-scale software.Journal of Software Maintenance: Research and Practice, 9, 5, 299–327

  63. [63]

    Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: a study of intelligence and personality.IEEE Transactions on Software Engineering, 48, 12, 4789–4801

  64. [64]

    Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: a large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976

  65. [65]

    Zoom Video Communications, Inc. 2025. Zoom Video Conferencing. https://zo om.us. Accessed: 2025-10-16. (2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009