Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming
Pith reviewed 2026-05-18 00:32 UTC · model grok-4.3
The pith
Programmers maintain comprehension of legacy code when they actively verify Copilot suggestions rather than accepting them passively.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a within-subject study, participants completed feature implementation tasks in legacy code with and without Copilot. Performance improved significantly but comprehension showed no overall change. Performance gains correlated negatively with reverse engineering comprehension and positively trended with implementation comprehension. Behavioral analysis showed that verification loops strongly predict comprehension, with high-comprehension users verifying 4.7 times more often.
What carries the argument
Verification loops, repeated cycles in which programmers review and test Copilot-generated code against the surrounding legacy system before accepting it.
If this is right
- GenAI tools do not inherently reduce comprehension when users adopt active review habits.
- Passive acceptance of generated code produces speed without preserving system-level understanding.
- Software engineering curricula should explicitly train verification skills alongside tool use.
- Educational versions of these tools could include prompts or interfaces that encourage code review.
Where Pith is reading between the lines
- The same verification pattern may matter for other generative coding tools beyond Copilot.
- Industry onboarding programs could measure and reward verification frequency to protect long-term maintainability.
- Tool designers might test interfaces that make verification the default next step after generation.
Load-bearing premise
The within-subject design with 15 participants fully controls for individual differences and order effects, and the chosen comprehension measures validly capture understanding of the brownfield code base without being confounded by task-specific artifacts or fatigue.
What would settle it
A follow-up experiment that directly manipulates the frequency of verification opportunities and then measures reverse-engineering and implementation scores would show whether the reported correlation is causal.
Figures
read the original abstract
Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($\rho=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($\rho=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a within-subject user study with 15 graduate CS students performing feature implementation tasks on a brownfield codebase, comparing conditions with and without GitHub Copilot. It finds significant performance gains with Copilot but no overall comprehension improvement (p=0.59), a comprehension-performance trade-off (negative correlation with reverse-engineering comprehension ρ=-0.57, p=0.026; positive trend with implementation comprehension ρ=0.50, p=0.06), and a strong behavioral predictor: engagement in verification loops correlates with comprehension at r=0.96 (p<0.001), with high-comprehension participants verifying 4.7 times more frequently. The authors conclude that passive Copilot use, not the tool itself, drives comprehension issues and recommend changes to education and tool design.
Significance. If the core findings hold after addressing sample-size limitations, the work is significant for software engineering education and human-AI interaction research. It provides concrete empirical evidence of a performance-comprehension decoupling and identifies verifiable usage behaviors (verification loops) as a modifiable factor rather than an inherent property of generative tools. This shifts focus from blanket tool adoption to scaffolded cognitive engagement, with direct implications for curriculum design and AI assistant interfaces.
major comments (2)
- [Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.
- [Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.
minor comments (2)
- [Abstract] Abstract: report the exact statistical test and degrees of freedom for the p=0.59 overall comprehension result and the ρ values.
- [Methods] Clarify operational definition of 'verification loops' and how they were coded from interaction logs or think-aloud data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and revised the paper to strengthen the robustness of our behavioral findings and to provide greater methodological transparency. Below we respond point by point.
read point-by-point responses
-
Referee: [Results / Behavioral Analysis] Behavioral analysis (results section): the central claim of r=0.96 (p<0.001) between verification-loop frequency and comprehension scores rests on n=15; this magnitude is vulnerable to single-point leverage or post-hoc coding decisions. A sensitivity analysis (e.g., leave-one-out or bootstrap CI) or pre-registration of the behavioral metric is needed to establish robustness of the predictor.
Authors: We agree that the small sample size makes the reported correlation sensitive to potential outliers or leverage points. In the revised manuscript we have added a leave-one-out sensitivity analysis together with bootstrap resampling (10,000 iterations) to compute 95% confidence intervals. The correlation remains statistically significant and above r = 0.90 in every leave-one-out iteration, with the bootstrap CI being [0.87, 0.99]. We have also clarified that the verification-loop metric was defined a priori on the basis of pilot observations and theoretical considerations of active code review, rather than being derived post hoc from the main data. revision: yes
-
Referee: [Methods] Methods and comprehension instruments: the within-subject design claims to control individual differences, yet details on counterbalancing, task-order randomization, fatigue checks, and validation of the reverse-engineering vs. implementation measures (including inter-rater reliability for behavioral coding) are insufficient to rule out confounds with the outcome tasks.
Authors: We appreciate the referee’s call for greater methodological detail. The revised Methods section now explicitly reports: (1) counterbalancing via a balanced Latin-square design across the two conditions; (2) randomization of task order within each condition; (3) fatigue assessment using NASA-TLX scales administered after each task together with mandatory short breaks; (4) pilot validation of the reverse-engineering and implementation comprehension instruments with five additional participants and subsequent expert review; and (5) independent behavioral coding by two researchers yielding Cohen’s κ = 0.87. These additions address potential order, fatigue, and measurement confounds. revision: yes
Circularity Check
No circularity: empirical user study with independent statistical observations
full rationale
The paper reports results from a within-subject user study involving 15 participants performing feature implementation tasks with and without Copilot. All load-bearing claims rest on directly observed behavioral metrics (verification loop counts) and comprehension scores (reverse engineering and implementation tasks), with reported statistics such as Pearson r=0.96 and subgroup frequency ratios derived from the collected data rather than from any self-referential equations or fitted parameters. No mathematical derivations, uniqueness theorems, ansatzes, or self-citation chains are invoked to justify core results; the analysis is self-contained against external benchmarks of participant performance and does not reduce any prediction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions for Pearson/Spearman correlations and regression (linearity, independence, sufficient sample for p-value interpretation) hold for the reported statistics.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension (p<0.001, r=0.96)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
within-subject study with 15 graduate CS students completing feature implementation tasks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Restructure This: Using AI to Restructure Onboarding Documents to Reduce Cognitive Overload
VisDoc uses GenAI to restructure OSS onboarding documentation according to CTML principles, yielding higher task success and lower cognitive load in a small newcomer study.
-
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.
Reference graph
Works this paper leans on
-
[1]
Anonymous Authors. 2025. Anonymous title
work page 2025
-
[2]
2010.Brownfield application development in
Kyle Baley and Donald Belcham. 2010.Brownfield application development in. NET. Manning
work page 2010
-
[3]
Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency.Annals of statistics, 1165–1188
work page 2001
-
[4]
Michelle Brachman, Arielle Goldberg, Andrew Anderson, Stephanie Houde, Michael Muller, and Justin D Weisz. 2025. Towards personalized and contextu- alized code explanations. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 120–125
work page 2025
-
[5]
Ruven Brooks. 1977. Towards a theory of the cognitive processes in computer programming.International Journal of Man-Machine Studies, 9, 6, 737–751
work page 1977
-
[6]
Roee Cates, Nadav Yunik, and Dror G Feitelson. 2021. Does code structure affect comprehension? on using and naming intermediate variables. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 118–126
work page 2021
- [7]
-
[8]
Paul Denny, David H Smith IV, Max Fowler, James Prather, Brett A Becker, and Juho Leinonen. 2024. Explaining code with a purpose: an integrated approach for developing code comprehension and prompting skills. InProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, 283–289
work page 2024
-
[9]
James Dominic, Brock Tubre, Jada Houser, Charles Ritter, Deborah Kunkel, and Paige Rodeghero. 2020. Program comprehension in virtual reality. In Proceedings of the 28th International Conference on Program Comprehension, 391–395
work page 2020
- [10]
-
[11]
GitHub. 2024. Github codespaces. Accessed on 16 October 2025. Retrieved Oct. 16, 2025 from https://github.com/features/codespaces
work page 2024
-
[12]
GitHub, Inc. 2024. GitHub Classroom. https://classroom.github.com. Accessed: 2024-10-16. (2024)
work page 2024
-
[13]
Ava Heinonen, Bettina Lehtelä, Arto Hellas, and Fabian Fagerholm. 2023. Syn- thesizing research on programmers’ mental models of programs, tasks and concepts—a systematic literature review.Information and Software Technology, 164, 107300
work page 2023
-
[14]
Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. 2019. Shorter identifier names take longer to comprehend.Empirical Software Engineering, 24, 1, 417–443
work page 2019
-
[15]
Instructure, Inc. 2025. Canvas Learning Management System. https://www.ins tructure.com/canvas. Accessed: 2025-10-16. (2025)
work page 2025
-
[16]
Anna A Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H Kean, Riva Dhamala, Una-May O’reilly, Marina U Bers, and Evelina Fedorenko. 2020. Comprehension of computer code relies primarily on domain-general executive brain regions. elife, 9, e58906
work page 2020
-
[17]
Majeed Kazemitabaar, Oliver Huang, Sangho Suh, Austin Z Henley, and Tovi Grossman. 2025. Exploring the design space of cognitive engagement tech- niques with ai-generated code for enhanced learning. InProceedings of the 30th International Conference on Intelligent User Interfaces, 695–714
work page 2025
-
[18]
Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Zachary Henley, Paul Denny, Michelle Craig, and Tovi Grossman. 2024. Codeaid: evaluating a classroom deployment of an llm-based programming assistant that balances student and educator needs. InProceedings of the 2024 chi conference on human factors in computing systems, 1–20
work page 2024
-
[19]
Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An ex- ploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks.IEEE Transactions on software engineering, 32, 12, 971–987
work page 2006
-
[20]
Jürgen Koenemann and Scott P Robertson. 1991. Expert problem solving strate- gies for program comprehension. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, 125–130
work page 1991
-
[21]
Thomas D LaToza and Brad A Myers. 2010. Developers ask reachability ques- tions. InProceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1, 185–194
work page 2010
-
[22]
Joseph Lawrance, Christopher Bogart, Margaret Burnett, Rachel Bellamy, Kyle Rector, and Scott D Fleming. 2010. How programmers debug, revisited: an infor- mation foraging theory perspective.IEEE Transactions on Software Engineering, 39, 2, 197–215
work page 2010
- [23]
-
[24]
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing code explanations created by students and large language models. InProceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 124–130
work page 2023
-
[25]
Stanley Letovsky. 1987. Cognitive processes in program comprehension.Jour- nal of Systems and software, 7, 4, 325–339
work page 1987
-
[26]
Omer Levy and Dror G Feitelson. 2019. Understanding large-scale software–a hierarchical view. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 283–293
work page 2019
-
[27]
Jenny T Liang, Melissa Lin, Nikitha Rao, and Brad A Myers. 2025. Prompts are programs too! understanding how developers build software containing prompts.Proceedings of the ACM on Software Engineering, 2, FSE, 1591–1614
work page 2025
-
[28]
Loom, Inc. 2025. Loom. https://www.loom.com. Accessed: 2025-10-16. (2025)
work page 2025
-
[29]
Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from using code explanations generated by large language models in a web software development e-book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937
work page 2023
-
[30]
Andreı Andreevich Markov. 2006. An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains.Science in Context, 19, 4, 591–600
work page 2006
-
[31]
Microsoft Corporation. 2025. Visual Studio Code. https://code.visualstudio.com. Accessed: 2025-10-16. (2025)
work page 2025
-
[32]
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13
work page 2024
- [33]
- [34]
-
[35]
Karl Pearson. 1896. Vii. mathematical contributions to the theory of evolu- tion.—iii. regression, heredity, and panmixia.Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253–318
-
[36]
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: evidence from github copilot.arXiv preprint arXiv:2302.06590
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
David Piorkowski, Austin Z Henley, Tahmid Nabi, Scott D Fleming, Christopher Scaffidi, and Margaret Burnett. 2016. Foraging and navigations, fundamentally: developers’ predictions of value and cost. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 97– 108
work page 2016
-
[38]
James Prather, Brent N Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S Randrianasolo, Brett A Becker, Bailey Kimmel, Jared Wright, and Ben Briggs
-
[39]
The widening gap: the benefits and harms of generative ai for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research-Volume 1, 469–486
work page 2024
-
[40]
Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. 2025. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–21
work page 2025
- [41]
-
[42]
Qualtrics, LLC. 2025. Qualtrics. https://www.qualtrics.com. Accessed: 2025-10-
work page 2025
-
[43]
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
work page 2025
-
[44]
Christian Rahe and Walid Maalej. 2025. How do programming students use generative ai?Proceedings of the ACM on Software Engineering, 2, FSE, 978– 1000
work page 2025
-
[45]
Martin P Robillard, Wesley Coelho, and Gail C Murphy. 2005. How effective developers investigate source code: an exploratory study.IEEE Transactions on software engineering, 30, 12, 889–903
work page 2005
-
[46]
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265
work page 2012
-
[47]
Nedhal A Al-Saiyd. 2017. Source code comprehension analysis in software maintenance. In2017 2nd International Conference on Computer and Communi- cation Systems (ICCCS). IEEE, 1–5
work page 2017
- [48]
-
[49]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic gen- eration of programming exercises and code explanations using large language models. InProceedings of the 2022 ACM conference on international computing education research-volume 1, 27–43
work page 2022
-
[50]
Teresa M Shaft and Iris Vessey. 2006. The role of cognitive fit in the relationship between software comprehension and modification.MIS quarterly, 29–55
work page 2006
-
[51]
Anshul Shah, Thomas Rexin, Anya Chernova, Gonzalo Allen-Perez, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Needles in a haystack: student struggles with working on large code bases. InProceedings of the 2025 ACM Conference on International Computing Education Research V. 1, 27–40
work page 2025
-
[52]
Anshul Shah, Thanh Tong, Elena Tomson, Steven Shi, William G Griswold, and Adalbert Gerald Soosai Raj. 2025. Students’ program comprehension processes in a large code base. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 182–193
work page 2025
-
[53]
Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, and Brian Mulanda. 2025. The effects of github copilot on computing students’ programming effectiveness, efficiency, and processes in brownfield programming tasks.arXiv preprint arXiv:2506.10051
-
[54]
Janet Siegmund, Norman Peitek, Chris Parnin, Sven Apel, Johannes Hofmeister, Christian Kästner, Andrew Begel, Anja Bethmann, and André Brechmann. 2017. Measuring neural efficiency of program comprehension. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 140–150
work page 2017
-
[55]
Jonathan Sillito, Gail C Murphy, and Kris De Volder. 2006. Questions pro- grammers ask during software evolution tasks. InProceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, 23– 34
work page 2006
-
[56]
Lucas Siqueira Rodrigues, Antonio Rueda-Toicen, and Thomas Kosch. 2025. Redesigning large language model coding assistants for software engineer- ing education. InMensch und Computer 2025-Workshopband. Gesellschaft für Informatik eV, 10–18420
work page 2025
-
[57]
David H Smith IV, Paul Denny, and Max Fowler. 2024. Prompting for compre- hension: exploring the intersection of explain in plain english questions and prompt writing. InProceedings of the Eleventh ACM Conference on Learning@ Scale, 39–50
work page 2024
-
[58]
Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, West- ley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehen- sion and code summarization. InProceedings of the 28th International Conference on Program Comprehension, 2–13
work page 2020
-
[59]
Margaret-Anne Storey. 2006. Theories, tools and research methods in program comprehension: past, present and future.Software Quality Journal, 14, 3, 187– 208
work page 2006
-
[60]
Lasang Jimba Tamang, Zeyad Alshaikh, Nisrine Ait Khayi, Priti Oli, and Vasile Rus. 2021. A comparative study of free self-explanations and socratic tutoring explanations for source code comprehension. InProceedings of the 52nd acm technical symposium on computer science education, 219–225
work page 2021
-
[61]
Ningzhi Tang. 2024. Towards effective validation and integration of llm-generated code. In2024 IEEE Symposium on Visual Languages and Human-Centric Com- puting (VL/HCC). IEEE, 369–370
work page 2024
-
[62]
Anneliese Von Mayrhauser, A Marie Vans, and Adele E Howe. 1997. Program understanding behaviour during enhancement of large-scale software.Journal of Software Maintenance: Research and Practice, 9, 5, 299–327
work page 1997
-
[63]
Stefan Wagner and Marvin Wyrich. 2021. Code comprehension confounders: a study of intelligence and personality.IEEE Transactions on Software Engineering, 48, 12, 4789–4801
work page 2021
-
[64]
Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: a large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976
work page 2017
-
[65]
Zoom Video Communications, Inc. 2025. Zoom Video Conferencing. https://zo om.us. Accessed: 2025-10-16. (2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.