On the Reliability of Code Comprehension Proxies
Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3
The pith
Proxies from input-output questions that measure response time align best with expert rankings of code comprehensibility, while syntax-based proxies do not.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By first creating an expert consensus ranking of eight code snippets via the Delphi protocol and then correlating that ranking with fourteen literature-derived proxies collected from forty-four students, the study concludes that input-output proxies measured by response time are especially reliable while syntax proxies are especially unreliable regardless of measurement strategy.
What carries the argument
Correlation of student-derived comprehension proxies against an expert Delphi consensus ranking of the same code snippets.
If this is right
- Empirical studies of code comprehension should favor input-output questions measured by response time over other common proxies.
- Existing findings that rest on syntax-question proxies should be treated as less reliable.
- The choice of proxy affects whether a study can claim to approximate how comprehensible code is to practicing engineers.
- Future replication studies can test the same set of proxies on new code snippets to confirm the pattern.
Where Pith is reading between the lines
- Tool builders who want to predict which code will be hard to read could incorporate quick input-output questions with timing.
- Education research on teaching programming might shift emphasis away from syntax-only quiz formats when measuring student understanding.
- The Delphi approach itself could be applied to other software-engineering judgment tasks where ground truth is hard to obtain.
Load-bearing premise
The ground-truth comprehensibility ranking produced by the five-expert Delphi consensus accurately reflects how comprehensible the code snippets are to software engineers in general.
What would settle it
A new panel of professional software engineers, following the same Delphi protocol on the same eight snippets, produces a ranking that differs substantially from the original five-expert ranking.
Figures
read the original abstract
Prior work on code comprehension uses different comprehension proxies-for example, Likert-scale ratings or answers to input-output questions about program snippets, usually collected from students, to approximate whether code is comprehensible to software engineers, but the relative reliability of these proxies is not known. This paper investigates the relative reliability of a collection of proxies common in the extant literature with a pair of human studies. First, we conducted an expert-consensus study with a panel of five professional software engineers to establish a ground-truth comprehensibility ranking of eight code snippets by adapting the Delphi expert-consensus protocol. The Delphi protocol is widely used for expert consensus under conditions of uncertainty in other domains, such as medicine and national-security forecasting, but to our knowledge, this is its first application in software engineering. Second, we conducted a study with 44 student participants who completed tasks, allowing us to measure 14 comprehension proxies derived from the literature on the same set of eight code snippets. Finally, we conducted a correlation analysis on the results, concluding that proxies 1) derived from input-output questions and 2) that measure response time rather than accuracy are especially reliable. We also found that proxies derived from questions about program syntax (rather than semantics) are especially unreliable, regardless of measurement strategy, which draws into question the reliability of parts of the existing comprehensibility literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports two human studies on code comprehension proxies: an expert Delphi consensus with five professional software engineers to rank eight code snippets by comprehensibility, a study with 44 students measuring 14 proxies from the literature, and a correlation analysis concluding that input-output question proxies and response-time measures are more reliable while syntax-based proxies are less reliable.
Significance. If the Delphi-derived ground truth is representative of software engineers in general, the findings offer practical guidance for selecting reliable proxies in future code comprehension research, addressing a gap in the literature regarding proxy reliability. The application of the Delphi protocol is noted as novel in SE.
major comments (3)
- [Expert-consensus study] Expert-consensus study: The ground-truth ranking rests on a Delphi consensus from only five experts with no reported validation against larger or more diverse panels of practicing engineers; because all 14 proxy reliability conclusions are derived solely from rank correlations against this single ordering, any instability or bias in the ground truth directly undermines the central claims about which proxies are 'especially reliable.'
- [Correlation analysis] Correlation analysis: No statistical details (e.g., Spearman or Kendall coefficients, p-values, confidence intervals), sample-size justification, or handling of confounds (e.g., order effects, fatigue) are supplied for the 44-student study or the subsequent correlations, so it is not possible to verify whether the data support the stated conclusions on proxy reliability.
- [Expert-consensus study] Delphi protocol: The description states that the Delphi protocol was adapted but supplies no specifics on number of rounds, feedback mechanisms, anonymity procedures, or convergence criteria, leaving unclear whether the five-expert consensus meets the standards used in other domains where Delphi is established.
minor comments (1)
- The manuscript would benefit from a table or appendix reporting the raw proxy scores, the expert ranking, and the full set of correlation results to allow independent evaluation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating revisions where appropriate to strengthen the paper while maintaining the integrity of our findings.
read point-by-point responses
-
Referee: [Expert-consensus study] The ground-truth ranking rests on a Delphi consensus from only five experts with no reported validation against larger or more diverse panels of practicing engineers; because all 14 proxy reliability conclusions are derived solely from rank correlations against this single ordering, any instability or bias in the ground truth directly undermines the central claims about which proxies are 'especially reliable.'
Authors: We acknowledge the small panel size as a limitation of the study. The Delphi method is designed for small expert groups to achieve consensus under uncertainty, and five professional engineers is consistent with applications in other fields. However, we agree that lack of external validation is a concern for generalizability. In revision, we will add an expanded limitations subsection discussing potential instability in the ground truth, report any available inter-expert agreement metrics from the process, and explicitly recommend future validation with larger panels. This does not change our core claims but contextualizes them appropriately. revision: partial
-
Referee: [Correlation analysis] No statistical details (e.g., Spearman or Kendall coefficients, p-values, confidence intervals), sample-size justification, or handling of confounds (e.g., order effects, fatigue) are supplied for the 44-student study or the subsequent correlations, so it is not possible to verify whether the data support the stated conclusions on proxy reliability.
Authors: We will revise the methods and results sections to include all requested details. This includes reporting Spearman rank correlation coefficients with p-values and confidence intervals for the proxy correlations, a sample-size justification based on prior code comprehension studies and power considerations for detecting moderate correlations, and explicit description of confound mitigation (snippet order was randomized across participants to address order effects; sessions were limited in duration to reduce fatigue, though fatigue was not directly measured). These additions will allow verification of the conclusions. revision: yes
-
Referee: [Expert-consensus study] Delphi protocol: The description states that the Delphi protocol was adapted but supplies no specifics on number of rounds, feedback mechanisms, anonymity procedures, or convergence criteria, leaving unclear whether the five-expert consensus meets the standards used in other domains where Delphi is established.
Authors: We will expand the methods section with full protocol details: two rounds were conducted; after round 1, anonymized aggregate rankings and rationales were shared as feedback; experts remained anonymous to each other throughout; convergence was reached when no participant changed their ranking in round 2. These specifics align with standard Delphi practices in other domains and will be added to clarify the adaptation. revision: yes
Circularity Check
No significant circularity; empirical study relies on independent data collection
full rationale
The paper's derivation consists of two separate human-subject studies (Delphi expert consensus for ground-truth ranking of eight snippets; student tasks yielding 14 proxies) followed by rank correlation analysis. No equations, fitted parameters, or self-citations are present that reduce any claimed result to its own inputs by construction. The reliability conclusions are statistical outcomes of fresh data against an externally elicited consensus; the central claims therefore remain independent of the measurement process itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
2025. libGDX. https://github.com/libgdx/libgdx
work page 2025
-
[3]
Amine Abbad-Andaloussi, Thierry Sorg, and Barbara Weber. 2022. Estimating developers’ cognitive load at a fine-grained level using eye-tracking measures. InICPC. 111–121
work page 2022
-
[4]
Youssef Abdelsalam, Norman Peitek, Annabelle Bergum, and Sven Apel. 2026. The effect of comments on program comprehension: an eye-tracking study. Empir. Softw. Eng.31, 4 (2026), 94
work page 2026
-
[5]
Tarek Alakmeh, David Reich, Lena Jäger, and Thomas Fritz. 2024. Predicting code comprehension: a novel approach to align human gaze with code using deep neural networks.Proc. ACM Softw. Eng.1, FSE (2024), 1982–2004
work page 2024
-
[6]
Ahmed S. Alardawi and Agil M. Agil. 2015. Novice comprehension of object- oriented OO programs: an empirical study. InWCITCA. 1–4
work page 2015
-
[7]
Aljehane, Bonita Sharif, and Jonathan I
Salwa D. Aljehane, Bonita Sharif, and Jonathan I. Maletic. 2023. Studying developer eye movements to measure cognitive workload and visual effort for expertise assessment.Proc. ACM Hum.-Comput. Interact.7, ETRA (2023), 1–18
work page 2023
-
[8]
On the Reliability of Code Com- prehension Proxies
Anonymous. 2026. Replication package for “On the Reliability of Code Com- prehension Proxies”. https://doi.org/10.5281/zenodo.19348389. Zenodo. DOI: 10.5281/zenodo.19348389
-
[9]
Dimitar Asenov, Otmar Hilliges, and Peter Müller. 2016. The effect of richer visualizations on code comprehension. InCHI. 5040–5045
work page 2016
-
[10]
Ronald Baecker. 1988. Enhancing program readability and comprehensibility with tools for program visualization. InICSE. 356–357
work page 1988
-
[11]
Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study.Empir. Softw. Eng.20 (2015), 1052–1094
work page 2015
-
[12]
Roman Bednarik, Carsten Schulte, Lea Budde, Birte Heinemann, and Hana Vrza- kova. 2018. Eye-movement modeling examples in source code comprehension: a classroom study. InKoli Calling. 1–8
work page 2018
-
[13]
Annabelle Bergum, Norman Peitek, Maurice Rekrut, Janet Siegmund, and Sven Apel. 2026. On the influence of the baseline in neuroimaging experiments on program comprehension.ACM Trans. Softw. Eng. Methodol. (TOSEM)35, 3 (2026), 1–27
work page 2026
-
[14]
Maletic, Christopher Morrell, and Bonita Sharif
Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I. Maletic, Christopher Morrell, and Bonita Sharif. 2013. The impact of identifier style on effort and comprehension.Empir. Softw. Eng.18, 2 (2013), 219–276
work page 2013
-
[15]
Scott Blinman and Andy Cockburn. 2005. Program comprehension: investigating the effects of naming style and documentation. InAUIC. 73–78
work page 2005
-
[16]
Jürgen Börstler and Barbara Paech. 2016. The role of method chains and com- ments in software readability and comprehension: an experiment.IEEE Trans. Softw. Eng. (TSE)42, 9 (2016), 886–898
work page 2016
-
[17]
Jean-Marie Burkhardt, Françoise Détienne, and Susan Wiedenbeck. 2002. Object- oriented program comprehension: effect of expertise, task, and phase.Empir. Softw. Eng.7, 2 (2002), 115–156
work page 2002
-
[18]
Raymond P. L. Buse and Westley R. Weimer. 2009. Learning a metric for code readability.IEEE Trans. Softw. Eng. (TSE)36, 4 (2009), 546–558
work page 2009
-
[19]
Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm
Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H. Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm. 2015. Eye movements in code reading: relaxing the linear order. InICPC. 255–265
work page 2015
-
[20]
Celia Chen, Reem Alfayez, Kamonphop Srisopha, Lin Shi, and Barry Boehm
-
[21]
Evaluating human-assessed software maintainability metrics. InNASAC. 120–132
-
[22]
code” back in “code comprehension study
Kyle D. Chin and Reid Holmes. 2026. Put the “code” back in “code comprehension study”. (2026)
work page 2026
-
[23]
2013.Statistical power analysis for the behavioral sciences
Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. Rout- ledge
work page 2013
-
[24]
Ricardo Couceiro, Raul Barbosa, João Duráes, Gonçalo Duarte, João Castelhano, Catarina Duarte, Cesar Teixeira, Nuno Laranjeiro, Júlio Medeiros, and Paulo Car- valho. 2019. Spotting problematic code lines using nonintrusive programmers’ biofeedback. InISSRE. 93–103
work page 2019
-
[25]
Igor Crk, Timothy Kluthe, and Andreas Stefik. 2015. Understanding program- ming expertise: an empirical study of phasic brain wave changes.ACM Trans. Comput.-Hum. Interact. (TOCHI)23, 1 (2015), 1–29
work page 2015
-
[26]
Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in GitHub for MSR studies. InMSR. 560–564
work page 2021
-
[27]
Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer
- [28]
-
[29]
Norman Dalkey and Olaf Helmer. 1963. An experimental application of the Delphi method to the use of experts.Manage. Sci.9, 3 (1963), 458–467
work page 1963
-
[30]
Norman C. Dalkey. 1969.The Delphi method: an experimental study of group opinion. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/RM5888
- [31]
-
[32]
WPM De Silva et al . 2025. Circular economic strategies for maximising the end-of-life value of modular buildings: a Delphi study.Smart Sustain. Built Environ.(2025)
work page 2025
-
[33]
Bart Du Bois, Serge Demeyer, and Jan Verelst. 2005. Does the “refactor to understand” reverse engineering pattern improve program comprehension?. In CSMR. 334–343
work page 2005
-
[34]
Aruna Duraisingam, Ramaswamy Palaniappan, and Samraj Andrews. 2017. Cognitive task difficulty analysis using EEG and data mining. InICEDSS. 52–57
work page 2017
- [35]
-
[36]
Sarah Fakhoury, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2018. The effect of poor source code lexicon and readability on developers’ cognitive load. InICPC. 286–296
work page 2018
-
[37]
Sarah Fakhoury, Devjeet Roy, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2020. Measuring the impact of lexical and structural inconsistencies on developers’ cognitive load during bug localization.Empir. Softw. Eng. (ESE) 25 (2020), 2140–2178
work page 2020
-
[38]
Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanen- berg. 2012. Measuring programming experience. InProc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 73–82
work page 2012
-
[39]
Flint, Robert Dyer, and Bonita Sharif
Samuel W. Flint, Robert Dyer, and Bonita Sharif. 2026. Do developers read type in- formation? An eye-tracking study on TypeScript.arXiv(2026). arXiv:2602.04824
-
[40]
Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.J. Am. Stat. Assoc.32, 200 (1937), 675–701
work page 1937
-
[41]
Hao Gao, Haytham Hijazi, Júlio Medeiros, João Durães, Chan Tong Lam, Paulo de Carvalho, and Henrique Madeira. 2025. NRevisit: a cognitive behavioral metric for code understandability assessment. InProc. Int. Conf. Evaluation Assessment Softw. Eng. (EASE). 908–918
work page 2025
-
[42]
Ileana Gefaell Larrondo et al . 2026. Strengthening Primary Health Care in Europe: A Delphi study towards accessibility, equity and continuity of care.Eur. J. Gen. Pract.32, 1 (2026), 2619226
work page 2026
-
[43]
David J. Gilmore and Thomas R. G. Green. 1984. Comprehension and recall of miniature programs.Int. J. Man-Mach. Stud.21, 1 (1984), 31–48
work page 1984
-
[44]
Google. 2024. Google Java Formatter. https://github.com/google/google-java- format. Accessed: 2024-11-20
work page 2024
- [45]
-
[46]
Dean Hendrix, James H. Cross, and Saeed Maghsoodloo. 2002. The effectiveness of control structure diagrams in source code comprehension activities.IEEE Trans. Softw. Eng. (TSE)28, 5 (2002), 463–477
work page 2002
-
[47]
Hofmeister, Janet Siegmund, and Daniel V
Johannes C. Hofmeister, Janet Siegmund, and Daniel V. Holt. 2019. Shorter identifier names take longer to comprehend.Empir. Softw. Eng.24 (2019), 417– 443
work page 2019
-
[48]
Errol R. Iselin. 1988. Conditional statements, looping constructs, and program comprehension: an experimental study.Int. J. Man-Mach. Stud.28, 1 (1988), 45–66. 11 Conference 2026, 1 - 4 January, 2026, City, Country Erfan Arvan, Nadeeshan de Silva, Oscar Chaparro, and Martin Kellogg
work page 1988
-
[49]
Oleksandra Ishchenko et al . 2025. Barriers and opportunities for Demand Response Aggregation in Ukraine and Norway: A Delphi-based study.Energy 328 (2025), 136296
work page 2025
-
[50]
Toyomi Ishida and Hidetake Uwano. 2019. Synchronized analysis of eye move- ment and EEG during program comprehension. InEMIP. 26–32
work page 2019
- [51]
-
[52]
John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif
-
[53]
An empirical study assessing source code readability in comprehension. InICSME. 513–523
-
[54]
Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A tale of two comprehensions? analyzing student programmer attention during code summarization.ACM Trans. Softw. Eng. Methodol. (TOSEM) 33, 7 (2024), 1–37
work page 2024
-
[55]
Nadia Kasto and Jacqueline Whalley. 2013. Measuring the difficulty of code comprehension tasks using software metrics. InProc. Australas. Comput. Educ. Conf. (ACE). 59–65
work page 2013
-
[56]
Maurice G. Kendall. 1945. The treatment of ties in ranking problems.Biometrika 33, 3 (1945), 239–251
work page 1945
-
[57]
Maurice G. Kendall, Sheila F. H. Kendall, and B. Babington Smith. 1939. The distribution of Spearman’s coefficient of rank correlation in a universe in which all rankings occur an equal number of times.Biometrika(1939), 251–273
work page 1939
-
[58]
2023.RAND methodological guidance for conducting and critically appraising Delphi panels
Dmitry Khodyakov, Sean Grant, Jack Kroger, and Melissa Bauman. 2023.RAND methodological guidance for conducting and critically appraising Delphi panels. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/TLA3082-1
-
[59]
George Kinnear, Ian Jones, and Ben Davies. 2025. Comparative judgement as a research tool: A meta-analysis of application and reliability.Behavior Research Methods57 (2025), 222. https://doi.org/10.3758/s13428-025-02744-w
-
[60]
Walter Kintsch. 1988. The role of knowledge in discourse comprehension: a construction-integration model.Psychol. Rev.95, 2 (1988), 163–182
work page 1988
- [61]
-
[62]
Luigi Lavazza, Sandro Morasca, and Marco Gatto. 2023. An empirical study on software understandability and its dependence on code characteristics.Empir. Softw. Eng.28, 6 (2023), 155
work page 2023
-
[63]
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007. Ef- fective identifier names for comprehension and memory.Innov. Syst. Softw. Eng. 3, 4 (2007), 303–318
work page 2007
-
[64]
SeolHwa Lee, Andrew Matteson, Danial Hooshyar, SongHyun Kim, JaeBum Jung, GiChun Nam, and HeuiSeok Lim. 2016. Comparing programming language comprehension between novice and expert programmers using EEG analysis. InBIBE. 350–355
work page 2016
-
[65]
Danielle R Lombardi et al. 2025. The increased role of advanced technology and automation in audit: A delphi study.Int. J. Account. Inf. Syst.56 (2025), 100733
work page 2025
-
[66]
Brady D Lund. 2020. Review of the Delphi method in library and information science research.J. Doc.76, 4 (2020), 929–960
work page 2020
-
[67]
Sarah B Maness, Stacey B Griner, and Erika L Thompson. 2025. Expert Consensus on Indicators of Social Determinants of Health: A Modified Delphi Study.J. Prim. Care Community Health16 (2025)
work page 2025
-
[68]
Jean Melo, Fabricio Batista Narcizo, Dan Witzner Hansen, Claus Brabrand, and Andrzej Wasowski. 2017. Variability through the eyes of the programmer. In Proc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 34–44
work page 2017
- [69]
-
[70]
Program indentation and comprehensibility.Commun. ACM26, 11 (1983), 861–867
work page 1983
-
[71]
Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: an investigation of how developers spend their time. InICPC. 25–35
work page 2015
-
[72]
Russell Mosemann and Susan Wiedenbeck. 2001. Navigation and comprehension of programs by novice programmers. InIWPC. 79–88
work page 2001
-
[73]
Sebastian Nielebock, Dariusz Krolikowski, Jacob Krüger, Thomas Leich, and Frank Ortmeier. 2019. Commenting source code: is it worth it for small pro- gramming tasks?Empir. Softw. Eng.24, 3 (2019), 1418–1457
work page 2019
-
[74]
Pavel A. Orlov and Roman Bednarik. 2017. The role of extrafoveal vision in source code comprehension.Perception46, 5 (2017), 541–565
work page 2017
-
[75]
Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif
Kang-il Park, Jack Johnson, Cole S. Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empir. Softw. Eng.29, 6 (2024), 160
work page 2024
-
[76]
Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund
-
[77]
Program comprehension and code complexity metrics: an fMRI study. In ICSE. 524–536
-
[78]
Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What drives the reading order of programmers? an eye tracking study. InICPC. 342–353
work page 2020
-
[79]
Norman Peitek, Janet Siegmund, Sven Apel, Christian Kästner, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2018. A look into programmers’ heads.IEEE Trans. Softw. Eng. (TSE)46, 4 (2018), 442–462
work page 2018
-
[80]
Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, and André Brechmann
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.