Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming

and Christopher Hundhausen; Md Istiak Hossain Shihab; Summit Haque; Yunhan Qiao

arxiv: 2511.02922 · v2 · submitted 2025-11-04 · 💻 cs.SE

Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming

Yunhan Qiao , Md Istiak Hossain Shihab , Summit Haque , and Christopher Hundhausen This is my paper

Pith reviewed 2026-05-18 00:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords code comprehensionGitHub Copilotbrownfield programmingverification behaviorgenerative AIlegacy codesoftware engineering educationprogramming performance

0 comments

The pith

Programmers maintain comprehension of legacy code when they actively verify Copilot suggestions rather than accepting them passively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether GitHub Copilot improves both speed and understanding when graduate students add features to existing codebases. It finds clear performance gains alongside no overall gain in comprehension, plus a split where faster work hurts reverse-engineering insight but may support implementation tasks. The decisive factor turns out to be usage pattern: students who repeatedly checked the generated code against the existing system understood the codebase far better.

Core claim

In a within-subject study, participants completed feature implementation tasks in legacy code with and without Copilot. Performance improved significantly but comprehension showed no overall change. Performance gains correlated negatively with reverse engineering comprehension and positively trended with implementation comprehension. Behavioral analysis showed that verification loops strongly predict comprehension, with high-comprehension users verifying 4.7 times more often.

What carries the argument

Verification loops, repeated cycles in which programmers review and test Copilot-generated code against the surrounding legacy system before accepting it.

If this is right

GenAI tools do not inherently reduce comprehension when users adopt active review habits.
Passive acceptance of generated code produces speed without preserving system-level understanding.
Software engineering curricula should explicitly train verification skills alongside tool use.
Educational versions of these tools could include prompts or interfaces that encourage code review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification pattern may matter for other generative coding tools beyond Copilot.
Industry onboarding programs could measure and reward verification frequency to protect long-term maintainability.
Tool designers might test interfaces that make verification the default next step after generation.

Load-bearing premise

The within-subject design with 15 participants fully controls for individual differences and order effects, and the chosen comprehension measures validly capture understanding of the brownfield code base without being confounded by task-specific artifacts or fatigue.

What would settle it

A follow-up experiment that directly manipulates the frequency of verification opportunities and then measures reverse-engineering and implementation scores would show whether the reported correlation is causal.

Figures

Figures reproduced from arXiv: 2511.02922 by and Christopher Hundhausen, Md Istiak Hossain Shihab, Summit Haque, Yunhan Qiao.

**Figure 1.** Figure 1: Task 1 completion time by condition Copilot. Similar to the findings of our previous study [1], participants passed an average of 7.2 tests (𝑆𝐷=3.01) with Copilot and 3.9 tests (𝑆𝐷=1.70) without Copilot, representing an 84% increase in performance. A Wilcoxon signed-rank test confirmed that this difference was statistically significant (𝑊 =6.5, 𝑝=0.001), with a large effect size (𝑟=0.795) [PITH_FULL_IMA… view at source ↗

**Figure 2.** Figure 2: Total tests passed by condition 4.1.3 Programming Behavior [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mean percentage of time spent in each activity category, by condition [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Mean percentage of time spent in each code writing activity, by condition, showing a shift from almost exclusively [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Programming workflow networks comparing activity transitions without Copilot (left) and with Copilot (right). Node [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Correlation of test passed and comprehension level [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation of test passed and comprehension level [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($\rho=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($\rho=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Copilot speeds tasks in legacy code but comprehension stays flat unless students actively verify the suggestions, with a striking but small-n correlation tying verification frequency to understanding.

read the letter

The main takeaway is that this within-subject study with 15 grad students finds clear performance gains from Copilot on brownfield feature tasks, yet no overall lift in comprehension and even a negative link between speed gains and reverse-engineering scores. The standout part is the behavioral finding: participants who ran verification loops reviewed the generated code far more often, and that frequency correlated strongly with better comprehension scores. High-comprehension users verified 4.7 times as often. That gives a concrete handle on why some people get understanding out of the tool and others do not. It moves the conversation past simple time-and-correctness metrics into usage patterns that matter for education and tool design. The brownfield setting is also a step forward from the greenfield tasks that dominate earlier Copilot papers. The data collection and coding of verification behavior look like real work that could be built on. The soft spots are mostly around scale and robustness. An r of 0.96 with n=15 is eye-catching but easy to drive with one or two influential points or with how the behavioral codes were decided after the fact. The within-subject setup helps with individual differences, but order effects, fatigue, or task-specific artifacts could still leak into the comprehension measures. It is not clear from the abstract whether those instruments were pre-validated or whether the subgroup split was pre-registered. Those are fixable issues rather than fatal ones. This paper is for people working on AI-assisted software engineering education or on how to instrument tools to encourage active review. A reader who wants empirical grounding for claims about verification behavior will get something usable here, even if they will want a replication with bigger samples. It deserves a serious referee round so the methods can be stress-tested and the behavioral coding made more transparent before wider citation.

Referee Report

2 major / 2 minor

Summary. The paper reports a within-subject user study with 15 graduate CS students performing feature implementation tasks on a brownfield codebase, comparing conditions with and without GitHub Copilot. It finds significant performance gains with Copilot but no overall comprehension improvement (p=0.59), a comprehension-performance trade-off (negative correlation with reverse-engineering comprehension ρ=-0.57, p=0.026; positive trend with implementation comprehension ρ=0.50, p=0.06), and a strong behavioral predictor: engagement in verification loops correlates with comprehension at r=0.96 (p<0.001), with high-comprehension participants verifying 4.7 times more frequently. The authors conclude that passive Copilot use, not the tool itself, drives comprehension issues and recommend changes to education and tool design.

Significance. If the core findings hold after addressing sample-size limitations, the work is significant for software engineering education and human-AI interaction research. It provides concrete empirical evidence of a performance-comprehension decoupling and identifies verifiable usage behaviors (verification loops) as a modifiable factor rather than an inherent property of generative tools. This shifts focus from blanket tool adoption to scaffolded cognitive engagement, with direct implications for curriculum design and AI assistant interfaces.

major comments (2)

[Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.
[Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.

minor comments (2)

[Abstract] Abstract: report the exact statistical test and degrees of freedom for the p=0.59 overall comprehension result and the ρ values.
[Methods] Clarify operational definition of 'verification loops' and how they were coded from interaction logs or think-aloud data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and revised the paper to strengthen the robustness of our behavioral findings and to provide greater methodological transparency. Below we respond point by point.

read point-by-point responses

Referee: [Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.

Authors: We agree that the small sample size makes the reported correlation sensitive to potential outliers or leverage points. In the revised manuscript we have added a leave-one-out sensitivity analysis together with bootstrap resampling (10,000 iterations) to compute 95% confidence intervals. The correlation remains statistically significant and above r = 0.90 in every leave-one-out iteration, with the bootstrap CI being [0.87, 0.99]. We have also clarified that the verification-loop metric was defined a priori on the basis of pilot observations and theoretical considerations of active code review, rather than being derived post hoc from the main data. revision: yes
Referee: [Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.

Authors: We appreciate the referee’s call for greater methodological detail. The revised Methods section now explicitly reports: (1) counterbalancing via a balanced Latin-square design across the two conditions; (2) randomization of task order within each condition; (3) fatigue assessment using NASA-TLX scales administered after each task together with mandatory short breaks; (4) pilot validation of the reverse-engineering and implementation comprehension instruments with five additional participants and subsequent expert review; and (5) independent behavioral coding by two researchers yielding Cohen’s κ = 0.87. These additions address potential order, fatigue, and measurement confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with independent statistical observations

full rationale

The paper reports results from a within-subject user study involving 15 participants performing feature implementation tasks with and without Copilot. All load-bearing claims rest on directly observed behavioral metrics (verification loop counts) and comprehension scores (reverse engineering and implementation tasks), with reported statistics such as Pearson r=0.96 and subgroup frequency ratios derived from the collected data rather than from any self-referential equations or fitted parameters. No mathematical derivations, uniqueness theorems, ansatzes, or self-citation chains are invoked to justify core results; the analysis is self-contained against external benchmarks of participant performance and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of experimental psychology and statistics rather than new free parameters or postulated entities.

axioms (1)

standard math Standard assumptions for Pearson/Spearman correlations and regression (linearity, independence, sufficient sample for p-value interpretation) hold for the reported statistics.
Invoked when reporting ρ, r, and p values in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1310 out tokens · 45409 ms · 2026-05-18T00:32:18.793053+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension (p<0.001, r=0.96)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

within-subject study with 15 graduate CS students completing feature implementation tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Restructure This: Using AI to Restructure Onboarding Documents to Reduce Cognitive Overload
cs.SE 2026-05 conditional novelty 6.0

VisDoc uses GenAI to restructure OSS onboarding documentation according to CTML principles, yielding higher task success and lower cognitive load in a small newcomer study.
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
cs.SE 2026-05 unverdicted novelty 5.0

Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Anonymous Authors. 2025. Anonymous title

work page 2025
[2]

2010.Brownfield application development in

Kyle Baley and Donald Belcham. 2010.Brownfield application development in. NET. Manning

work page 2010
[3]

Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency.Annals of statistics, 1165–1188

work page 2001
[4]

Michelle Brachman, Arielle Goldberg, Andrew Anderson, Stephanie Houde, Michael Muller, and Justin D Weisz. 2025. Towards personalized and contextu- alized code explanations. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 120–125

work page 2025
[5]

Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer programming.International Journal of Man-Machine Studies, 9, 6, 737–751

work page 1977
[6]

Roee Cates, Nadav Yunik, and Dror G Feitelson. 2021. Does code structure affect comprehension? on using and naming intermediate variables. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 118–126

work page 2021
[7]

Souti Chattopadhyay, Zixuan Feng, Emily Arteaga, Audrey Au, Gonzalo Ramos, Titus Barik, and Anita Sarma. 2023. Make it make sense! understanding and fa- cilitating sensemaking in computational notebooks.arXiv preprint arXiv:2312.11431

work page arXiv 2023
[8]

Paul Denny, David H Smith IV, Max Fowler, James Prather, Brett A Becker, and Juho Leinonen. 2024. Explaining code with a purpose: an integrated approach for developing code comprehension and prompting skills. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, 283–289

work page 2024
[9]

James Dominic, Brock Tubre, Jada Houser, Charles Ritter, Deborah Kunkel, and Paige Rodeghero. 2020. Program comprehension in virtual reality. In Proceedings of the 28th International Conference on Program Comprehension, 391–395

work page 2020
[10]

Zixuan Feng, Reed Milewicz, Emerson Murphy-Hill, Tyler Menezes, Alexan- der Serebrenik, Igor Steinmacher, and Anita Sarma. 2025. Charting uncertain waters: a socio-technical framework for navigating genai’s impact on open source communities.arXiv preprint arXiv:2508.04921

work page arXiv 2025
[11]

GitHub. 2024. Github codespaces. Accessed on 16 October 2025. Retrieved Oct. 16, 2025 from https://github.com/features/codespaces

work page 2024
[12]

GitHub, Inc. 2024. GitHub Classroom. https://classroom.github.com. Accessed: 2024-10-16. (2024)

work page 2024
[13]

Ava Heinonen, Bettina Lehtelä, Arto Hellas, and Fabian Fagerholm. 2023. Syn- thesizing research on programmers’ mental models of programs, tasks and concepts—a systematic literature review.Information and Software Technology, 164, 107300

work page 2023
[14]

Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. 2019. Shorter identifier names take longer to comprehend.Empirical Software Engineering, 24, 1, 417–443

work page 2019
[15]

Instructure, Inc. 2025. Canvas Learning Management System. https://www.ins tructure.com/canvas. Accessed: 2025-10-16. (2025)

work page 2025
[16]

Anna A Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H Kean, Riva Dhamala, Una-May O’reilly, Marina U Bers, and Evelina Fedorenko. 2020. Comprehension of computer code relies primarily on domain-general executive brain regions. elife, 9, e58906

work page 2020
[17]

Majeed Kazemitabaar, Oliver Huang, Sangho Suh, Austin Z Henley, and Tovi Grossman. 2025. Exploring the design space of cognitive engagement tech- niques with ai-generated code for enhanced learning. InProceedings of the 30th International Conference on Intelligent User Interfaces, 695–714

work page 2025
[18]

Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. InProceedings of the 2024 chi conference on human factors in computing systems, 1–20

work page 2024
[19]

Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An ex- ploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks.IEEE Transactions on software engineering, 32, 12, 971–987

work page 2006
[20]

Jürgen Koenemann and Scott P Robertson. 1991. Expert problem solving strate- gies for program comprehension. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, 125–130

work page 1991
[21]

Thomas D LaToza and Brad A Myers. 2010. Developers ask reachability ques- tions. InProceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1, 185–194

work page 2010
[22]

Joseph Lawrance, Christopher Bogart, Margaret Burnett, Rachel Bellamy, Kyle Rector, and Scott D Fleming. 2010. How programmers debug, revisited: an infor- mation foraging theory perspective.IEEE Transactions on Software Engineering, 39, 2, 197–215

work page 2010
[23]

Fangjian Lei, Jiawen Liu, Shayan Noei, Ying Zou, Derek Truong, and William Alexander. 2025. Enhancing cobol code explanations: a multi-agents approach using large language models.arXiv preprint arXiv:2507.02182

work page arXiv 2025
[24]

Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing code explanations created by students and large language models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 124–130

work page 2023
[25]

Stanley Letovsky. 1987. Cognitive processes in program comprehension.Jour- nal of Systems and software, 7, 4, 325–339

work page 1987
[26]

Omer Levy and Dror G Feitelson. 2019. Understanding large-scale software–a hierarchical view. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 283–293

work page 2019
[27]

Jenny T Liang, Melissa Lin, Nikitha Rao, and Brad A Myers. 2025. Prompts are programs too! understanding how developers build software containing prompts.Proceedings of the ACM on Software Engineering, 2, FSE, 1591–1614

work page 2025
[28]

Loom, Inc. 2025. Loom. https://www.loom.com. Accessed: 2025-10-16. (2025)

work page 2025
[29]

Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from using code explanations generated by large language models in a web software development e-book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937

work page 2023
[30]

Andreı Andreevich Markov. 2006. An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains.Science in Context, 19, 4, 591–600

work page 2006
[31]

Microsoft Corporation. 2025. Visual Studio Code. https://code.visualstudio.com. Accessed: 2025-10-16. (2025)

work page 2025
[32]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13

work page 2024
[33]

Daye Nam, Ahmed Omran, Ambar Murillo, Saksham Thakur, Abner Araujo, Marcel Blistein, Alexander Frömmgen, Vincent Hellendoorn, and Satish Chan- dra. 2025. Prompting llms for code editing: struggles and remedies.arXiv preprint arXiv:2504.20196

work page arXiv 2025
[34]

Kevin KB Ng, Liyana Fauzi, Leon Leow, and Jaren Ng. 2024. Harnessing the potential of gen-ai coding assistants in public sector software development. arXiv preprint arXiv:2409.17434

work page arXiv 2024
[35]

Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolu- tion.—iii. regression, heredity, and panmixia.Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253–318

work page
[36]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: evidence from github copilot.arXiv preprint arXiv:2302.06590

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

David Piorkowski, Austin Z Henley, Tahmid Nabi, Scott D Fleming, Christopher Scaffidi, and Margaret Burnett. 2016. Foraging and navigations, fundamentally: developers’ predictions of value and cost. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 97– 108

work page 2016
[38]

James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs

work page
[39]

InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

The widening gap: the benefits and harms of generative ai for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

work page 2024
[40]

Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. 2025. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–21

work page 2025
[41]

Yunhan Qiao, Md Istiak Hossain Shihab, and Christopher Hundhausen. 2025. A systematic literature review of the use of genai assistants for code compre- hension: implications for computing education research and practice.arXiv preprint arXiv:2510.17894

work page arXiv 2025
[42]

Qualtrics, LLC. 2025. Qualtrics. https://www.qualtrics.com. Accessed: 2025-10-

work page 2025
[43]

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2025
[44]

Christian Rahe and Walid Maalej. 2025. How do programming students use generative ai?Proceedings of the ACM on Software Engineering, 2, FSE, 978– 1000

work page 2025
[45]

Martin P Robillard, Wesley Coelho, and Gail C Murphy. 2005. How effective developers investigate source code: an exploratory study.IEEE Transactions on software engineering, 30, 12, 889–903

work page 2005
[46]

Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

work page 2012
[47]

Nedhal A Al-Saiyd. 2017. Source code comprehension analysis in software maintenance. In2017 2nd International Conference on Computer and Communi- cation Systems (ICCCS). IEEE, 1–5

work page 2017
[48]

Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. 2025. Vibe coding vs. agentic coding: fundamentals and practical implications of agentic ai.arXiv preprint arXiv:2505.19443

work page arXiv 2025
[49]

Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic gen- eration of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM conference on international computing education research-volume 1, 27–43

work page 2022
[50]

Teresa M Shaft and Iris Vessey. 2006. The role of cognitive fit in the relationship between software comprehension and modification.MIS quarterly, 29–55

work page 2006
[51]

Anshul Shah, Thomas Rexin, Anya Chernova, Gonzalo Allen-Perez, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Needles in a haystack: student struggles with working on large code bases. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1, 27–40

work page 2025
[52]

Anshul Shah, Thanh Tong, Elena Tomson, Steven Shi, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Students’ program comprehension processes in a large code base. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 182–193

work page 2025
[53]

Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Mulanda. 2025. The effects of github copilot on computing students’ programming effectiveness, efficiency, and processes in brownfield programming tasks.arXiv preprint arXiv:2506.10051

work page arXiv 2025
[54]

Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 140–150

work page 2017
[55]

Jonathan Sillito, Gail C Murphy, and Kris De Volder. 2006. Questions pro- grammers ask during software evolution tasks. InProceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, 23– 34

work page 2006
[56]

Lucas Siqueira Rodrigues, Antonio Rueda-Toicen, and Thomas Kosch. 2025. Redesigning large language model coding assistants for software engineer- ing education. InMensch und Computer 2025-Workshopband. Gesellschaft für Informatik eV, 10–18420

work page 2025
[57]

David H Smith IV, Paul Denny, and Max Fowler. 2024. Prompting for compre- hension: exploring the intersection of explain in plain english questions and prompt writing. InProceedings of the Eleventh ACM Conference on Learning@ Scale, 39–50

work page 2024
[58]

Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, West- ley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehen- sion and code summarization. InProceedings of the 28th International Conference on Program Comprehension, 2–13

work page 2020
[59]

Margaret-Anne Storey. 2006. Theories, tools and research methods in program comprehension: past, present and future.Software Quality Journal, 14, 3, 187– 208

work page 2006
[60]

Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comparative study of free self-explanations and socratic tutoring explanations for source code comprehension. InProceedings of the 52nd acm technical symposium on computer science education, 219–225

work page 2021
[61]

Ningzhi Tang. 2024. Towards effective validation and integration of llm-generated code. In2024 IEEE Symposium on Visual Languages and Human-Centric Com- puting (VL/HCC). IEEE, 369–370

work page 2024
[62]

Anneliese Von Mayrhauser, A Marie Vans, and Adele E Howe. 1997. Program understanding behaviour during enhancement of large-scale software.Journal of Software Maintenance: Research and Practice, 9, 5, 299–327

work page 1997
[63]

Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: a study of intelligence and personality.IEEE Transactions on Software Engineering, 48, 12, 4789–4801

work page 2021
[64]

Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: a large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976

work page 2017
[65]

Zoom Video Communications, Inc. 2025. Zoom Video Conferencing. https://zo om.us. Accessed: 2025-10-16. (2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2025

[1] [1]

Anonymous Authors. 2025. Anonymous title

work page 2025

[2] [2]

2010.Brownfield application development in

Kyle Baley and Donald Belcham. 2010.Brownfield application development in. NET. Manning

work page 2010

[3] [3]

Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency.Annals of statistics, 1165–1188

work page 2001

[4] [4]

Michelle Brachman, Arielle Goldberg, Andrew Anderson, Stephanie Houde, Michael Muller, and Justin D Weisz. 2025. Towards personalized and contextu- alized code explanations. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 120–125

work page 2025

[5] [5]

Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer programming.International Journal of Man-Machine Studies, 9, 6, 737–751

work page 1977

[6] [6]

Roee Cates, Nadav Yunik, and Dror G Feitelson. 2021. Does code structure affect comprehension? on using and naming intermediate variables. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 118–126

work page 2021

[7] [7]

Souti Chattopadhyay, Zixuan Feng, Emily Arteaga, Audrey Au, Gonzalo Ramos, Titus Barik, and Anita Sarma. 2023. Make it make sense! understanding and fa- cilitating sensemaking in computational notebooks.arXiv preprint arXiv:2312.11431

work page arXiv 2023

[8] [8]

Paul Denny, David H Smith IV, Max Fowler, James Prather, Brett A Becker, and Juho Leinonen. 2024. Explaining code with a purpose: an integrated approach for developing code comprehension and prompting skills. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, 283–289

work page 2024

[9] [9]

James Dominic, Brock Tubre, Jada Houser, Charles Ritter, Deborah Kunkel, and Paige Rodeghero. 2020. Program comprehension in virtual reality. In Proceedings of the 28th International Conference on Program Comprehension, 391–395

work page 2020

[10] [10]

Zixuan Feng, Reed Milewicz, Emerson Murphy-Hill, Tyler Menezes, Alexan- der Serebrenik, Igor Steinmacher, and Anita Sarma. 2025. Charting uncertain waters: a socio-technical framework for navigating genai’s impact on open source communities.arXiv preprint arXiv:2508.04921

work page arXiv 2025

[11] [11]

GitHub. 2024. Github codespaces. Accessed on 16 October 2025. Retrieved Oct. 16, 2025 from https://github.com/features/codespaces

work page 2024

[12] [12]

GitHub, Inc. 2024. GitHub Classroom. https://classroom.github.com. Accessed: 2024-10-16. (2024)

work page 2024

[13] [13]

Ava Heinonen, Bettina Lehtelä, Arto Hellas, and Fabian Fagerholm. 2023. Syn- thesizing research on programmers’ mental models of programs, tasks and concepts—a systematic literature review.Information and Software Technology, 164, 107300

work page 2023

[14] [14]

Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. 2019. Shorter identifier names take longer to comprehend.Empirical Software Engineering, 24, 1, 417–443

work page 2019

[15] [15]

Instructure, Inc. 2025. Canvas Learning Management System. https://www.ins tructure.com/canvas. Accessed: 2025-10-16. (2025)

work page 2025

[16] [16]

Anna A Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H Kean, Riva Dhamala, Una-May O’reilly, Marina U Bers, and Evelina Fedorenko. 2020. Comprehension of computer code relies primarily on domain-general executive brain regions. elife, 9, e58906

work page 2020

[17] [17]

Majeed Kazemitabaar, Oliver Huang, Sangho Suh, Austin Z Henley, and Tovi Grossman. 2025. Exploring the design space of cognitive engagement tech- niques with ai-generated code for enhanced learning. InProceedings of the 30th International Conference on Intelligent User Interfaces, 695–714

work page 2025

[18] [18]

Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. InProceedings of the 2024 chi conference on human factors in computing systems, 1–20

work page 2024

[19] [19]

Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An ex- ploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks.IEEE Transactions on software engineering, 32, 12, 971–987

work page 2006

[20] [20]

Jürgen Koenemann and Scott P Robertson. 1991. Expert problem solving strate- gies for program comprehension. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, 125–130

work page 1991

[21] [21]

Thomas D LaToza and Brad A Myers. 2010. Developers ask reachability ques- tions. InProceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1, 185–194

work page 2010

[22] [22]

Joseph Lawrance, Christopher Bogart, Margaret Burnett, Rachel Bellamy, Kyle Rector, and Scott D Fleming. 2010. How programmers debug, revisited: an infor- mation foraging theory perspective.IEEE Transactions on Software Engineering, 39, 2, 197–215

work page 2010

[23] [23]

Fangjian Lei, Jiawen Liu, Shayan Noei, Ying Zou, Derek Truong, and William Alexander. 2025. Enhancing cobol code explanations: a multi-agents approach using large language models.arXiv preprint arXiv:2507.02182

work page arXiv 2025

[24] [24]

Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing code explanations created by students and large language models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 124–130

work page 2023

[25] [25]

Stanley Letovsky. 1987. Cognitive processes in program comprehension.Jour- nal of Systems and software, 7, 4, 325–339

work page 1987

[26] [26]

Omer Levy and Dror G Feitelson. 2019. Understanding large-scale software–a hierarchical view. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 283–293

work page 2019

[27] [27]

Jenny T Liang, Melissa Lin, Nikitha Rao, and Brad A Myers. 2025. Prompts are programs too! understanding how developers build software containing prompts.Proceedings of the ACM on Software Engineering, 2, FSE, 1591–1614

work page 2025

[28] [28]

Loom, Inc. 2025. Loom. https://www.loom.com. Accessed: 2025-10-16. (2025)

work page 2025

[29] [29]

Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from using code explanations generated by large language models in a web software development e-book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937

work page 2023

[30] [30]

Andreı Andreevich Markov. 2006. An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains.Science in Context, 19, 4, 591–600

work page 2006

[31] [31]

Microsoft Corporation. 2025. Visual Studio Code. https://code.visualstudio.com. Accessed: 2025-10-16. (2025)

work page 2025

[32] [32]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13

work page 2024

[33] [33]

Daye Nam, Ahmed Omran, Ambar Murillo, Saksham Thakur, Abner Araujo, Marcel Blistein, Alexander Frömmgen, Vincent Hellendoorn, and Satish Chan- dra. 2025. Prompting llms for code editing: struggles and remedies.arXiv preprint arXiv:2504.20196

work page arXiv 2025

[34] [34]

Kevin KB Ng, Liyana Fauzi, Leon Leow, and Jaren Ng. 2024. Harnessing the potential of gen-ai coding assistants in public sector software development. arXiv preprint arXiv:2409.17434

work page arXiv 2024

[35] [35]

Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolu- tion.—iii. regression, heredity, and panmixia.Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253–318

work page

[36] [36]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: evidence from github copilot.arXiv preprint arXiv:2302.06590

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

David Piorkowski, Austin Z Henley, Tahmid Nabi, Scott D Fleming, Christopher Scaffidi, and Margaret Burnett. 2016. Foraging and navigations, fundamentally: developers’ predictions of value and cost. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 97– 108

work page 2016

[38] [38]

James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs

work page

[39] [39]

InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

The widening gap: the benefits and harms of generative ai for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486

work page 2024

[40] [40]

Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. 2025. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–21

work page 2025

[41] [41]

Yunhan Qiao, Md Istiak Hossain Shihab, and Christopher Hundhausen. 2025. A systematic literature review of the use of genai assistants for code compre- hension: implications for computing education research and practice.arXiv preprint arXiv:2510.17894

work page arXiv 2025

[42] [42]

Qualtrics, LLC. 2025. Qualtrics. https://www.qualtrics.com. Accessed: 2025-10-

work page 2025

[43] [43]

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2025

[44] [44]

Christian Rahe and Walid Maalej. 2025. How do programming students use generative ai?Proceedings of the ACM on Software Engineering, 2, FSE, 978– 1000

work page 2025

[45] [45]

Martin P Robillard, Wesley Coelho, and Gail C Murphy. 2005. How effective developers investigate source code: an exploratory study.IEEE Transactions on software engineering, 30, 12, 889–903

work page 2005

[46] [46]

Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

work page 2012

[47] [47]

Nedhal A Al-Saiyd. 2017. Source code comprehension analysis in software maintenance. In2017 2nd International Conference on Computer and Communi- cation Systems (ICCCS). IEEE, 1–5

work page 2017

[48] [48]

Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. 2025. Vibe coding vs. agentic coding: fundamentals and practical implications of agentic ai.arXiv preprint arXiv:2505.19443

work page arXiv 2025

[49] [49]

Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic gen- eration of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM conference on international computing education research-volume 1, 27–43

work page 2022

[50] [50]

Teresa M Shaft and Iris Vessey. 2006. The role of cognitive fit in the relationship between software comprehension and modification.MIS quarterly, 29–55

work page 2006

[51] [51]

Anshul Shah, Thomas Rexin, Anya Chernova, Gonzalo Allen-Perez, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Needles in a haystack: student struggles with working on large code bases. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1, 27–40

work page 2025

[52] [52]

Anshul Shah, Thanh Tong, Elena Tomson, Steven Shi, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Students’ program comprehension processes in a large code base. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 182–193

work page 2025

[53] [53]

Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Mulanda. 2025. The effects of github copilot on computing students’ programming effectiveness, efficiency, and processes in brownfield programming tasks.arXiv preprint arXiv:2506.10051

work page arXiv 2025

[54] [54]

Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 140–150

work page 2017

[55] [55]

Jonathan Sillito, Gail C Murphy, and Kris De Volder. 2006. Questions pro- grammers ask during software evolution tasks. InProceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, 23– 34

work page 2006

[56] [56]

Lucas Siqueira Rodrigues, Antonio Rueda-Toicen, and Thomas Kosch. 2025. Redesigning large language model coding assistants for software engineer- ing education. InMensch und Computer 2025-Workshopband. Gesellschaft für Informatik eV, 10–18420

work page 2025

[57] [57]

David H Smith IV, Paul Denny, and Max Fowler. 2024. Prompting for compre- hension: exploring the intersection of explain in plain english questions and prompt writing. InProceedings of the Eleventh ACM Conference on Learning@ Scale, 39–50

work page 2024

[58] [58]

Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, West- ley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehen- sion and code summarization. InProceedings of the 28th International Conference on Program Comprehension, 2–13

work page 2020

[59] [59]

Margaret-Anne Storey. 2006. Theories, tools and research methods in program comprehension: past, present and future.Software Quality Journal, 14, 3, 187– 208

work page 2006

[60] [60]

Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comparative study of free self-explanations and socratic tutoring explanations for source code comprehension. InProceedings of the 52nd acm technical symposium on computer science education, 219–225

work page 2021

[61] [61]

Ningzhi Tang. 2024. Towards effective validation and integration of llm-generated code. In2024 IEEE Symposium on Visual Languages and Human-Centric Com- puting (VL/HCC). IEEE, 369–370

work page 2024

[62] [62]

Anneliese Von Mayrhauser, A Marie Vans, and Adele E Howe. 1997. Program understanding behaviour during enhancement of large-scale software.Journal of Software Maintenance: Research and Practice, 9, 5, 299–327

work page 1997

[63] [63]

Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: a study of intelligence and personality.IEEE Transactions on Software Engineering, 48, 12, 4789–4801

work page 2021

[64] [64]

Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: a large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976

work page 2017

[65] [65]

Zoom Video Communications, Inc. 2025. Zoom Video Conferencing. https://zo om.us. Accessed: 2025-10-16. (2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2025