On the Reliability of Code Comprehension Proxies

Erfan Arvan; Martin Kellogg; Nadeeshan De Silva; Oscar Chaparro

arxiv: 2605.23008 · v1 · pith:WGZB573Gnew · submitted 2026-05-21 · 💻 cs.SE

On the Reliability of Code Comprehension Proxies

Erfan Arvan , Nadeeshan De Silva , Oscar Chaparro , Martin Kellogg This is my paper

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 💻 cs.SE

keywords code comprehensionproxiesreliabilityDelphi methodsoftware engineeringinput-output questionsresponse timesyntax questions

0 comments

The pith

Proxies from input-output questions that measure response time align best with expert rankings of code comprehensibility, while syntax-based proxies do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a ground-truth ranking of how comprehensible eight code snippets are by having five professional engineers reach consensus through a Delphi protocol. It then measures fourteen different proxies on the same snippets using responses from forty-four students and checks which proxies match the expert ranking. Proxies built from questions about what a program does, especially when they record how long participants take to answer, track the expert view most closely. Proxies built from questions about program syntax match poorly no matter whether they record accuracy or time. This matters because many existing studies of code readability and maintainability rely on one or another of these proxies, so the choice directly affects how much trust to place in their conclusions.

Core claim

By first creating an expert consensus ranking of eight code snippets via the Delphi protocol and then correlating that ranking with fourteen literature-derived proxies collected from forty-four students, the study concludes that input-output proxies measured by response time are especially reliable while syntax proxies are especially unreliable regardless of measurement strategy.

What carries the argument

Correlation of student-derived comprehension proxies against an expert Delphi consensus ranking of the same code snippets.

If this is right

Empirical studies of code comprehension should favor input-output questions measured by response time over other common proxies.
Existing findings that rest on syntax-question proxies should be treated as less reliable.
The choice of proxy affects whether a study can claim to approximate how comprehensible code is to practicing engineers.
Future replication studies can test the same set of proxies on new code snippets to confirm the pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tool builders who want to predict which code will be hard to read could incorporate quick input-output questions with timing.
Education research on teaching programming might shift emphasis away from syntax-only quiz formats when measuring student understanding.
The Delphi approach itself could be applied to other software-engineering judgment tasks where ground truth is hard to obtain.

Load-bearing premise

The ground-truth comprehensibility ranking produced by the five-expert Delphi consensus accurately reflects how comprehensible the code snippets are to software engineers in general.

What would settle it

A new panel of professional software engineers, following the same Delphi protocol on the same eight snippets, produces a ranking that differs substantially from the original five-expert ranking.

Figures

Figures reproduced from arXiv: 2605.23008 by Erfan Arvan, Martin Kellogg, Nadeeshan De Silva, Oscar Chaparro.

**Figure 2.** Figure 2: Two example code snippets used in the study: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation between proxies and expert-determined rank [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution of correlations between aggregated student [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 9.** Figure 9: Distribution of correlations between per-student proxy [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 7.** Figure 7: Distribution of correlations between aggregated student [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 4.** Figure 4: Correlation between aggregated student proxies and expert [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 11.** Figure 11: Distribution of correlations between per-student proxy [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

read the original abstract

Prior work on code comprehension uses different comprehension proxies-for example, Likert-scale ratings or answers to input-output questions about program snippets, usually collected from students, to approximate whether code is comprehensible to software engineers, but the relative reliability of these proxies is not known. This paper investigates the relative reliability of a collection of proxies common in the extant literature with a pair of human studies. First, we conducted an expert-consensus study with a panel of five professional software engineers to establish a ground-truth comprehensibility ranking of eight code snippets by adapting the Delphi expert-consensus protocol. The Delphi protocol is widely used for expert consensus under conditions of uncertainty in other domains, such as medicine and national-security forecasting, but to our knowledge, this is its first application in software engineering. Second, we conducted a study with 44 student participants who completed tasks, allowing us to measure 14 comprehension proxies derived from the literature on the same set of eight code snippets. Finally, we conducted a correlation analysis on the results, concluding that proxies 1) derived from input-output questions and 2) that measure response time rather than accuracy are especially reliable. We also found that proxies derived from questions about program syntax (rather than semantics) are especially unreliable, regardless of measurement strategy, which draws into question the reliability of parts of the existing comprehensibility literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Delphi is new here but the small expert panel makes the reliability rankings hard to trust without more validation.

read the letter

The main takeaway is that the paper brings the Delphi method to software engineering for the first time and runs a direct comparison of 14 comprehension proxies on the same set of snippets. The expert study and the student study are both new data. They handle the Delphi adaptation reasonably and collect the proxy data cleanly from 44 participants. The results flag input-output questions and time measures as stronger, and syntax questions as weaker. That part is useful because it questions some existing work that relies on syntax proxies. The concern is the ground truth ranking. Five experts is a small number for establishing a stable ordering, and without reported agreement levels or a follow-up with more people, it's hard to know how much the proxy correlations depend on this particular panel. Eight snippets is also a narrow base. The stress test note is right to flag this. The abstract doesn't give statistical details either, so the full paper will need to show the actual correlation coefficients and any adjustments for multiple comparisons. The student sample size of 44 is reasonable for this kind of study, but the reliance on students rather than professionals for the proxies is another layer that could be discussed. Readers working on measurement in code comprehension studies will find the proxy comparison interesting. It is worth sending to peer review so the statistical details and robustness can be checked properly. Even with the limitations, the approach is honest and the novelty of Delphi here makes it worth a look from referees.

Referee Report

3 major / 1 minor

Summary. The manuscript reports two human studies on code comprehension proxies: an expert Delphi consensus with five professional software engineers to rank eight code snippets by comprehensibility, a study with 44 students measuring 14 proxies from the literature, and a correlation analysis concluding that input-output question proxies and response-time measures are more reliable while syntax-based proxies are less reliable.

Significance. If the Delphi-derived ground truth is representative of software engineers in general, the findings offer practical guidance for selecting reliable proxies in future code comprehension research, addressing a gap in the literature regarding proxy reliability. The application of the Delphi protocol is noted as novel in SE.

major comments (3)

[Expert-consensus study] Expert-consensus study: The ground-truth ranking rests on a Delphi consensus from only five experts with no reported validation against larger or more diverse panels of practicing engineers; because all 14 proxy reliability conclusions are derived solely from rank correlations against this single ordering, any instability or bias in the ground truth directly undermines the central claims about which proxies are 'especially reliable.'
[Correlation analysis] Correlation analysis: No statistical details (e.g., Spearman or Kendall coefficients, p-values, confidence intervals), sample-size justification, or handling of confounds (e.g., order effects, fatigue) are supplied for the 44-student study or the subsequent correlations, so it is not possible to verify whether the data support the stated conclusions on proxy reliability.
[Expert-consensus study] Delphi protocol: The description states that the Delphi protocol was adapted but supplies no specifics on number of rounds, feedback mechanisms, anonymity procedures, or convergence criteria, leaving unclear whether the five-expert consensus meets the standards used in other domains where Delphi is established.

minor comments (1)

The manuscript would benefit from a table or appendix reporting the raw proxy scores, the expert ranking, and the full set of correlation results to allow independent evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating revisions where appropriate to strengthen the paper while maintaining the integrity of our findings.

read point-by-point responses

Referee: [Expert-consensus study] The ground-truth ranking rests on a Delphi consensus from only five experts with no reported validation against larger or more diverse panels of practicing engineers; because all 14 proxy reliability conclusions are derived solely from rank correlations against this single ordering, any instability or bias in the ground truth directly undermines the central claims about which proxies are 'especially reliable.'

Authors: We acknowledge the small panel size as a limitation of the study. The Delphi method is designed for small expert groups to achieve consensus under uncertainty, and five professional engineers is consistent with applications in other fields. However, we agree that lack of external validation is a concern for generalizability. In revision, we will add an expanded limitations subsection discussing potential instability in the ground truth, report any available inter-expert agreement metrics from the process, and explicitly recommend future validation with larger panels. This does not change our core claims but contextualizes them appropriately. revision: partial
Referee: [Correlation analysis] No statistical details (e.g., Spearman or Kendall coefficients, p-values, confidence intervals), sample-size justification, or handling of confounds (e.g., order effects, fatigue) are supplied for the 44-student study or the subsequent correlations, so it is not possible to verify whether the data support the stated conclusions on proxy reliability.

Authors: We will revise the methods and results sections to include all requested details. This includes reporting Spearman rank correlation coefficients with p-values and confidence intervals for the proxy correlations, a sample-size justification based on prior code comprehension studies and power considerations for detecting moderate correlations, and explicit description of confound mitigation (snippet order was randomized across participants to address order effects; sessions were limited in duration to reduce fatigue, though fatigue was not directly measured). These additions will allow verification of the conclusions. revision: yes
Referee: [Expert-consensus study] Delphi protocol: The description states that the Delphi protocol was adapted but supplies no specifics on number of rounds, feedback mechanisms, anonymity procedures, or convergence criteria, leaving unclear whether the five-expert consensus meets the standards used in other domains where Delphi is established.

Authors: We will expand the methods section with full protocol details: two rounds were conducted; after round 1, anonymized aggregate rankings and rationales were shared as feedback; experts remained anonymous to each other throughout; convergence was reached when no participant changed their ranking in round 2. These specifics align with standard Delphi practices in other domains and will be added to clarify the adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study relies on independent data collection

full rationale

The paper's derivation consists of two separate human-subject studies (Delphi expert consensus for ground-truth ranking of eight snippets; student tasks yielding 14 proxies) followed by rank correlation analysis. No equations, fitted parameters, or self-citations are present that reduce any claimed result to its own inputs by construction. The reliability conclusions are statistical outcomes of fresh data against an externally elicited consensus; the central claims therefore remain independent of the measurement process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical human-subjects study with no mathematical model, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5768 in / 1109 out tokens · 28356 ms · 2026-05-25T05:34:29.683928+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages

[1]

Apache Kafka

2025. Apache Kafka. https://github.com/apache/kafka

work page 2025
[2]

2025. libGDX. https://github.com/libgdx/libgdx

work page 2025
[3]

Amine Abbad-Andaloussi, Thierry Sorg, and Barbara Weber. 2022. Estimating developers’ cognitive load at a fine-grained level using eye-tracking measures. InICPC. 111–121

work page 2022
[4]

Youssef Abdelsalam, Norman Peitek, Annabelle Bergum, and Sven Apel. 2026. The effect of comments on program comprehension: an eye-tracking study. Empir. Softw. Eng.31, 4 (2026), 94

work page 2026
[5]

Tarek Alakmeh, David Reich, Lena Jäger, and Thomas Fritz. 2024. Predicting code comprehension: a novel approach to align human gaze with code using deep neural networks.Proc. ACM Softw. Eng.1, FSE (2024), 1982–2004

work page 2024
[6]

Alardawi and Agil M

Ahmed S. Alardawi and Agil M. Agil. 2015. Novice comprehension of object- oriented OO programs: an empirical study. InWCITCA. 1–4

work page 2015
[7]

Aljehane, Bonita Sharif, and Jonathan I

Salwa D. Aljehane, Bonita Sharif, and Jonathan I. Maletic. 2023. Studying developer eye movements to measure cognitive workload and visual effort for expertise assessment.Proc. ACM Hum.-Comput. Interact.7, ETRA (2023), 1–18

work page 2023
[8]

On the Reliability of Code Com- prehension Proxies

Anonymous. 2026. Replication package for “On the Reliability of Code Com- prehension Proxies”. https://doi.org/10.5281/zenodo.19348389. Zenodo. DOI: 10.5281/zenodo.19348389

work page doi:10.5281/zenodo.19348389 2026
[9]

Dimitar Asenov, Otmar Hilliges, and Peter Müller. 2016. The effect of richer visualizations on code comprehension. InCHI. 5040–5045

work page 2016
[10]

Ronald Baecker. 1988. Enhancing program readability and comprehensibility with tools for program visualization. InICSE. 356–357

work page 1988
[11]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study.Empir. Softw. Eng.20 (2015), 1052–1094

work page 2015
[12]

Roman Bednarik, Carsten Schulte, Lea Budde, Birte Heinemann, and Hana Vrza- kova. 2018. Eye-movement modeling examples in source code comprehension: a classroom study. InKoli Calling. 1–8

work page 2018
[13]

Annabelle Bergum, Norman Peitek, Maurice Rekrut, Janet Siegmund, and Sven Apel. 2026. On the influence of the baseline in neuroimaging experiments on program comprehension.ACM Trans. Softw. Eng. Methodol. (TOSEM)35, 3 (2026), 1–27

work page 2026
[14]

Maletic, Christopher Morrell, and Bonita Sharif

Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I. Maletic, Christopher Morrell, and Bonita Sharif. 2013. The impact of identifier style on effort and comprehension.Empir. Softw. Eng.18, 2 (2013), 219–276

work page 2013
[15]

Scott Blinman and Andy Cockburn. 2005. Program comprehension: investigating the effects of naming style and documentation. InAUIC. 73–78

work page 2005
[16]

Jürgen Börstler and Barbara Paech. 2016. The role of method chains and com- ments in software readability and comprehension: an experiment.IEEE Trans. Softw. Eng. (TSE)42, 9 (2016), 886–898

work page 2016
[17]

Jean-Marie Burkhardt, Françoise Détienne, and Susan Wiedenbeck. 2002. Object- oriented program comprehension: effect of expertise, task, and phase.Empir. Softw. Eng.7, 2 (2002), 115–156

work page 2002
[18]

Raymond P. L. Buse and Westley R. Weimer. 2009. Learning a metric for code readability.IEEE Trans. Softw. Eng. (TSE)36, 4 (2009), 546–558

work page 2009
[19]

Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm

Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H. Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm. 2015. Eye movements in code reading: relaxing the linear order. InICPC. 255–265

work page 2015
[20]

Celia Chen, Reem Alfayez, Kamonphop Srisopha, Lin Shi, and Barry Boehm

work page
[21]

Evaluating human-assessed software maintainability metrics. InNASAC. 120–132

work page
[22]

code” back in “code comprehension study

Kyle D. Chin and Reid Holmes. 2026. Put the “code” back in “code comprehension study”. (2026)

work page 2026
[23]

2013.Statistical power analysis for the behavioral sciences

Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. Rout- ledge

work page 2013
[24]

Ricardo Couceiro, Raul Barbosa, João Duráes, Gonçalo Duarte, João Castelhano, Catarina Duarte, Cesar Teixeira, Nuno Laranjeiro, Júlio Medeiros, and Paulo Car- valho. 2019. Spotting problematic code lines using nonintrusive programmers’ biofeedback. InISSRE. 93–103

work page 2019
[25]

Igor Crk, Timothy Kluthe, and Andreas Stefik. 2015. Understanding program- ming expertise: an empirical study of phasic brain wave changes.ACM Trans. Comput.-Hum. Interact. (TOCHI)23, 1 (2015), 1–29

work page 2015
[26]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in GitHub for MSR studies. InMSR. 560–564

work page 2021
[27]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer

work page
[28]

InESEC/FSE

Modeling readability to improve unit tests. InESEC/FSE. 107–118

work page
[29]

Norman Dalkey and Olaf Helmer. 1963. An experimental application of the Delphi method to the use of experts.Manage. Sci.9, 3 (1963), 458–467

work page 1963
[30]

Norman C. Dalkey. 1969.The Delphi method: an experimental study of group opinion. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/RM5888

work page doi:10.7249/rm5888 1969
[31]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2025. Relative code comprehensibility prediction.arXiv(2025). arXiv:2510.03474

work page arXiv 2025
[32]

WPM De Silva et al . 2025. Circular economic strategies for maximising the end-of-life value of modular buildings: a Delphi study.Smart Sustain. Built Environ.(2025)

work page 2025
[33]

refactor to understand

Bart Du Bois, Serge Demeyer, and Jan Verelst. 2005. Does the “refactor to understand” reverse engineering pattern improve program comprehension?. In CSMR. 334–343

work page 2005
[34]

Aruna Duraisingam, Ramaswamy Palaniappan, and Samraj Andrews. 2017. Cognitive task difficulty analysis using EEG and data mining. InICEDSS. 52–57

work page 2017
[35]

Yasmine Elfares, Gül Çalikli, and Mohamed Khamis. 2025. GazeCopilot: evalu- ating novel gaze-informed prompting for AI-supported code comprehension and readability.arXiv(2025). arXiv:2511.08177

work page arXiv 2025
[36]

Sarah Fakhoury, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2018. The effect of poor source code lexicon and readability on developers’ cognitive load. InICPC. 286–296

work page 2018
[37]

Sarah Fakhoury, Devjeet Roy, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2020. Measuring the impact of lexical and structural inconsistencies on developers’ cognitive load during bug localization.Empir. Softw. Eng. (ESE) 25 (2020), 2140–2178

work page 2020
[38]

Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanen- berg. 2012. Measuring programming experience. InProc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 73–82

work page 2012
[39]

Flint, Robert Dyer, and Bonita Sharif

Samuel W. Flint, Robert Dyer, and Bonita Sharif. 2026. Do developers read type in- formation? An eye-tracking study on TypeScript.arXiv(2026). arXiv:2602.04824

work page arXiv 2026
[40]

Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.J. Am. Stat. Assoc.32, 200 (1937), 675–701

work page 1937
[41]

Hao Gao, Haytham Hijazi, Júlio Medeiros, João Durães, Chan Tong Lam, Paulo de Carvalho, and Henrique Madeira. 2025. NRevisit: a cognitive behavioral metric for code understandability assessment. InProc. Int. Conf. Evaluation Assessment Softw. Eng. (EASE). 908–918

work page 2025
[42]

Ileana Gefaell Larrondo et al . 2026. Strengthening Primary Health Care in Europe: A Delphi study towards accessibility, equity and continuity of care.Eur. J. Gen. Pract.32, 1 (2026), 2619226

work page 2026
[43]

Gilmore and Thomas R

David J. Gilmore and Thomas R. G. Green. 1984. Comprehension and recall of miniature programs.Int. J. Man-Mach. Stud.21, 1 (1984), 31–48

work page 1984
[44]

Google. 2024. Google Java Formatter. https://github.com/google/google-java- format. Accessed: 2024-11-20

work page 2024
[45]

Goldstone

Michael Hansen, Andrew Lumsdaine, and Richard L. Goldstone. 2013. An experiment on the cognitive complexity of code. InProc. Annu. Conf. Cogn. Sci. Soc. (CogSci)

work page 2013
[46]

Cross, and Saeed Maghsoodloo

Dean Hendrix, James H. Cross, and Saeed Maghsoodloo. 2002. The effectiveness of control structure diagrams in source code comprehension activities.IEEE Trans. Softw. Eng. (TSE)28, 5 (2002), 463–477

work page 2002
[47]

Hofmeister, Janet Siegmund, and Daniel V

Johannes C. Hofmeister, Janet Siegmund, and Daniel V. Holt. 2019. Shorter identifier names take longer to comprehend.Empir. Softw. Eng.24 (2019), 417– 443

work page 2019
[48]

Errol R. Iselin. 1988. Conditional statements, looping constructs, and program comprehension: an experimental study.Int. J. Man-Mach. Stud.28, 1 (1988), 45–66. 11 Conference 2026, 1 - 4 January, 2026, City, Country Erfan Arvan, Nadeeshan de Silva, Oscar Chaparro, and Martin Kellogg

work page 1988
[49]

Oleksandra Ishchenko et al . 2025. Barriers and opportunities for Demand Response Aggregation in Ukraine and Norway: A Delphi-based study.Energy 328 (2025), 136296

work page 2025
[50]

Toyomi Ishida and Hidetake Uwano. 2019. Synchronized analysis of eye move- ment and EEG during program comprehension. InEMIP. 26–32

work page 2019
[51]

Feitelson

Ahmad Jbara and Dror G. Feitelson. 2017. How programmers read regular code: a controlled experiment using eye tracking.Empir. Softw. Eng.22 (2017), 1440–1477

work page 2017
[52]

John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif

work page
[53]

An empirical study assessing source code readability in comprehension. InICSME. 513–523

work page
[54]

Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A tale of two comprehensions? analyzing student programmer attention during code summarization.ACM Trans. Softw. Eng. Methodol. (TOSEM) 33, 7 (2024), 1–37

work page 2024
[55]

Nadia Kasto and Jacqueline Whalley. 2013. Measuring the difficulty of code comprehension tasks using software metrics. InProc. Australas. Comput. Educ. Conf. (ACE). 59–65

work page 2013
[56]

Maurice G. Kendall. 1945. The treatment of ties in ranking problems.Biometrika 33, 3 (1945), 239–251

work page 1945
[57]

Kendall, Sheila F

Maurice G. Kendall, Sheila F. H. Kendall, and B. Babington Smith. 1939. The distribution of Spearman’s coefficient of rank correlation in a universe in which all rankings occur an equal number of times.Biometrika(1939), 251–273

work page 1939
[58]

2023.RAND methodological guidance for conducting and critically appraising Delphi panels

Dmitry Khodyakov, Sean Grant, Jack Kroger, and Melissa Bauman. 2023.RAND methodological guidance for conducting and critically appraising Delphi panels. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/TLA3082-1

work page doi:10.7249/tla3082-1 2023
[59]

George Kinnear, Ian Jones, and Ben Davies. 2025. Comparative judgement as a research tool: A meta-analysis of application and reliability.Behavior Research Methods57 (2025), 222. https://doi.org/10.3758/s13428-025-02744-w

work page doi:10.3758/s13428-025-02744-w 2025
[60]

Walter Kintsch. 1988. The role of knowledge in discourse comprehension: a construction-integration model.Psychol. Rev.95, 2 (1988), 163–182

work page 1988
[61]

Van Dijk

Walter Kintsch and Teun A. Van Dijk. 1978. Toward a model of text comprehen- sion and production.Psychol. Rev.85, 5 (1978), 363–394

work page 1978
[62]

Luigi Lavazza, Sandro Morasca, and Marco Gatto. 2023. An empirical study on software understandability and its dependence on code characteristics.Empir. Softw. Eng.28, 6 (2023), 155

work page 2023
[63]

Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007. Ef- fective identifier names for comprehension and memory.Innov. Syst. Softw. Eng. 3, 4 (2007), 303–318

work page 2007
[64]

SeolHwa Lee, Andrew Matteson, Danial Hooshyar, SongHyun Kim, JaeBum Jung, GiChun Nam, and HeuiSeok Lim. 2016. Comparing programming language comprehension between novice and expert programmers using EEG analysis. InBIBE. 350–355

work page 2016
[65]

Danielle R Lombardi et al. 2025. The increased role of advanced technology and automation in audit: A delphi study.Int. J. Account. Inf. Syst.56 (2025), 100733

work page 2025
[66]

Brady D Lund. 2020. Review of the Delphi method in library and information science research.J. Doc.76, 4 (2020), 929–960

work page 2020
[67]

Sarah B Maness, Stacey B Griner, and Erika L Thompson. 2025. Expert Consensus on Indicators of Social Determinants of Health: A Modified Delphi Study.J. Prim. Care Community Health16 (2025)

work page 2025
[68]

Jean Melo, Fabricio Batista Narcizo, Dan Witzner Hansen, Claus Brabrand, and Andrzej Wasowski. 2017. Variability through the eyes of the programmer. In Proc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 34–44

work page 2017
[69]

Miara, Joyce A

Richard J. Miara, Joyce A. Musselman, Juan A. Navarro, and Ben Shneiderman

work page
[70]

ACM26, 11 (1983), 861–867

Program indentation and comprehensibility.Commun. ACM26, 11 (1983), 861–867

work page 1983
[71]

Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: an investigation of how developers spend their time. InICPC. 25–35

work page 2015
[72]

Russell Mosemann and Susan Wiedenbeck. 2001. Navigation and comprehension of programs by novice programmers. InIWPC. 79–88

work page 2001
[73]

Sebastian Nielebock, Dariusz Krolikowski, Jacob Krüger, Thomas Leich, and Frank Ortmeier. 2019. Commenting source code: is it worth it for small pro- gramming tasks?Empir. Softw. Eng.24, 3 (2019), 1418–1457

work page 2019
[74]

Orlov and Roman Bednarik

Pavel A. Orlov and Roman Bednarik. 2017. The role of extrafoveal vision in source code comprehension.Perception46, 5 (2017), 541–565

work page 2017
[75]

Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif

Kang-il Park, Jack Johnson, Cole S. Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empir. Softw. Eng.29, 6 (2024), 160

work page 2024
[76]

Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund

work page
[77]

Program comprehension and code complexity metrics: an fMRI study. In ICSE. 524–536

work page
[78]

Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What drives the reading order of programmers? an eye tracking study. InICPC. 342–353

work page 2020
[79]

Norman Peitek, Janet Siegmund, Sven Apel, Christian Kästner, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2018. A look into programmers’ heads.IEEE Trans. Softw. Eng. (TSE)46, 4 (2018), 442–462

work page 2018
[80]

Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, and André Brechmann

work page

Showing first 80 references.

[1] [1]

Apache Kafka

2025. Apache Kafka. https://github.com/apache/kafka

work page 2025

[2] [2]

2025. libGDX. https://github.com/libgdx/libgdx

work page 2025

[3] [3]

Amine Abbad-Andaloussi, Thierry Sorg, and Barbara Weber. 2022. Estimating developers’ cognitive load at a fine-grained level using eye-tracking measures. InICPC. 111–121

work page 2022

[4] [4]

Youssef Abdelsalam, Norman Peitek, Annabelle Bergum, and Sven Apel. 2026. The effect of comments on program comprehension: an eye-tracking study. Empir. Softw. Eng.31, 4 (2026), 94

work page 2026

[5] [5]

Tarek Alakmeh, David Reich, Lena Jäger, and Thomas Fritz. 2024. Predicting code comprehension: a novel approach to align human gaze with code using deep neural networks.Proc. ACM Softw. Eng.1, FSE (2024), 1982–2004

work page 2024

[6] [6]

Alardawi and Agil M

Ahmed S. Alardawi and Agil M. Agil. 2015. Novice comprehension of object- oriented OO programs: an empirical study. InWCITCA. 1–4

work page 2015

[7] [7]

Aljehane, Bonita Sharif, and Jonathan I

Salwa D. Aljehane, Bonita Sharif, and Jonathan I. Maletic. 2023. Studying developer eye movements to measure cognitive workload and visual effort for expertise assessment.Proc. ACM Hum.-Comput. Interact.7, ETRA (2023), 1–18

work page 2023

[8] [8]

On the Reliability of Code Com- prehension Proxies

Anonymous. 2026. Replication package for “On the Reliability of Code Com- prehension Proxies”. https://doi.org/10.5281/zenodo.19348389. Zenodo. DOI: 10.5281/zenodo.19348389

work page doi:10.5281/zenodo.19348389 2026

[9] [9]

Dimitar Asenov, Otmar Hilliges, and Peter Müller. 2016. The effect of richer visualizations on code comprehension. InCHI. 5040–5045

work page 2016

[10] [10]

Ronald Baecker. 1988. Enhancing program readability and comprehensibility with tools for program visualization. InICSE. 356–357

work page 1988

[11] [11]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study.Empir. Softw. Eng.20 (2015), 1052–1094

work page 2015

[12] [12]

Roman Bednarik, Carsten Schulte, Lea Budde, Birte Heinemann, and Hana Vrza- kova. 2018. Eye-movement modeling examples in source code comprehension: a classroom study. InKoli Calling. 1–8

work page 2018

[13] [13]

Annabelle Bergum, Norman Peitek, Maurice Rekrut, Janet Siegmund, and Sven Apel. 2026. On the influence of the baseline in neuroimaging experiments on program comprehension.ACM Trans. Softw. Eng. Methodol. (TOSEM)35, 3 (2026), 1–27

work page 2026

[14] [14]

Maletic, Christopher Morrell, and Bonita Sharif

Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I. Maletic, Christopher Morrell, and Bonita Sharif. 2013. The impact of identifier style on effort and comprehension.Empir. Softw. Eng.18, 2 (2013), 219–276

work page 2013

[15] [15]

Scott Blinman and Andy Cockburn. 2005. Program comprehension: investigating the effects of naming style and documentation. InAUIC. 73–78

work page 2005

[16] [16]

Jürgen Börstler and Barbara Paech. 2016. The role of method chains and com- ments in software readability and comprehension: an experiment.IEEE Trans. Softw. Eng. (TSE)42, 9 (2016), 886–898

work page 2016

[17] [17]

Jean-Marie Burkhardt, Françoise Détienne, and Susan Wiedenbeck. 2002. Object- oriented program comprehension: effect of expertise, task, and phase.Empir. Softw. Eng.7, 2 (2002), 115–156

work page 2002

[18] [18]

Raymond P. L. Buse and Westley R. Weimer. 2009. Learning a metric for code readability.IEEE Trans. Softw. Eng. (TSE)36, 4 (2009), 546–558

work page 2009

[19] [19]

Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm

Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H. Pa- terson, Carsten Schulte, Bonita Sharif, and Sascha Tamm. 2015. Eye movements in code reading: relaxing the linear order. InICPC. 255–265

work page 2015

[20] [20]

Celia Chen, Reem Alfayez, Kamonphop Srisopha, Lin Shi, and Barry Boehm

work page

[21] [21]

Evaluating human-assessed software maintainability metrics. InNASAC. 120–132

work page

[22] [22]

code” back in “code comprehension study

Kyle D. Chin and Reid Holmes. 2026. Put the “code” back in “code comprehension study”. (2026)

work page 2026

[23] [23]

2013.Statistical power analysis for the behavioral sciences

Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. Rout- ledge

work page 2013

[24] [24]

Ricardo Couceiro, Raul Barbosa, João Duráes, Gonçalo Duarte, João Castelhano, Catarina Duarte, Cesar Teixeira, Nuno Laranjeiro, Júlio Medeiros, and Paulo Car- valho. 2019. Spotting problematic code lines using nonintrusive programmers’ biofeedback. InISSRE. 93–103

work page 2019

[25] [25]

Igor Crk, Timothy Kluthe, and Andreas Stefik. 2015. Understanding program- ming expertise: an empirical study of phasic brain wave changes.ACM Trans. Comput.-Hum. Interact. (TOCHI)23, 1 (2015), 1–29

work page 2015

[26] [26]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in GitHub for MSR studies. InMSR. 560–564

work page 2021

[27] [27]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer

work page

[28] [28]

InESEC/FSE

Modeling readability to improve unit tests. InESEC/FSE. 107–118

work page

[29] [29]

Norman Dalkey and Olaf Helmer. 1963. An experimental application of the Delphi method to the use of experts.Manage. Sci.9, 3 (1963), 458–467

work page 1963

[30] [30]

Norman C. Dalkey. 1969.The Delphi method: an experimental study of group opinion. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/RM5888

work page doi:10.7249/rm5888 1969

[31] [31]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2025. Relative code comprehensibility prediction.arXiv(2025). arXiv:2510.03474

work page arXiv 2025

[32] [32]

WPM De Silva et al . 2025. Circular economic strategies for maximising the end-of-life value of modular buildings: a Delphi study.Smart Sustain. Built Environ.(2025)

work page 2025

[33] [33]

refactor to understand

Bart Du Bois, Serge Demeyer, and Jan Verelst. 2005. Does the “refactor to understand” reverse engineering pattern improve program comprehension?. In CSMR. 334–343

work page 2005

[34] [34]

Aruna Duraisingam, Ramaswamy Palaniappan, and Samraj Andrews. 2017. Cognitive task difficulty analysis using EEG and data mining. InICEDSS. 52–57

work page 2017

[35] [35]

Yasmine Elfares, Gül Çalikli, and Mohamed Khamis. 2025. GazeCopilot: evalu- ating novel gaze-informed prompting for AI-supported code comprehension and readability.arXiv(2025). arXiv:2511.08177

work page arXiv 2025

[36] [36]

Sarah Fakhoury, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2018. The effect of poor source code lexicon and readability on developers’ cognitive load. InICPC. 286–296

work page 2018

[37] [37]

Sarah Fakhoury, Devjeet Roy, Yuzhan Ma, Venera Arnaoudova, and Olusola Adesope. 2020. Measuring the impact of lexical and structural inconsistencies on developers’ cognitive load during bug localization.Empir. Softw. Eng. (ESE) 25 (2020), 2140–2178

work page 2020

[38] [38]

Janet Feigenspan, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanen- berg. 2012. Measuring programming experience. InProc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 73–82

work page 2012

[39] [39]

Flint, Robert Dyer, and Bonita Sharif

Samuel W. Flint, Robert Dyer, and Bonita Sharif. 2026. Do developers read type in- formation? An eye-tracking study on TypeScript.arXiv(2026). arXiv:2602.04824

work page arXiv 2026

[40] [40]

Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.J. Am. Stat. Assoc.32, 200 (1937), 675–701

work page 1937

[41] [41]

Hao Gao, Haytham Hijazi, Júlio Medeiros, João Durães, Chan Tong Lam, Paulo de Carvalho, and Henrique Madeira. 2025. NRevisit: a cognitive behavioral metric for code understandability assessment. InProc. Int. Conf. Evaluation Assessment Softw. Eng. (EASE). 908–918

work page 2025

[42] [42]

Ileana Gefaell Larrondo et al . 2026. Strengthening Primary Health Care in Europe: A Delphi study towards accessibility, equity and continuity of care.Eur. J. Gen. Pract.32, 1 (2026), 2619226

work page 2026

[43] [43]

Gilmore and Thomas R

David J. Gilmore and Thomas R. G. Green. 1984. Comprehension and recall of miniature programs.Int. J. Man-Mach. Stud.21, 1 (1984), 31–48

work page 1984

[44] [44]

Google. 2024. Google Java Formatter. https://github.com/google/google-java- format. Accessed: 2024-11-20

work page 2024

[45] [45]

Goldstone

Michael Hansen, Andrew Lumsdaine, and Richard L. Goldstone. 2013. An experiment on the cognitive complexity of code. InProc. Annu. Conf. Cogn. Sci. Soc. (CogSci)

work page 2013

[46] [46]

Cross, and Saeed Maghsoodloo

Dean Hendrix, James H. Cross, and Saeed Maghsoodloo. 2002. The effectiveness of control structure diagrams in source code comprehension activities.IEEE Trans. Softw. Eng. (TSE)28, 5 (2002), 463–477

work page 2002

[47] [47]

Hofmeister, Janet Siegmund, and Daniel V

Johannes C. Hofmeister, Janet Siegmund, and Daniel V. Holt. 2019. Shorter identifier names take longer to comprehend.Empir. Softw. Eng.24 (2019), 417– 443

work page 2019

[48] [48]

Errol R. Iselin. 1988. Conditional statements, looping constructs, and program comprehension: an experimental study.Int. J. Man-Mach. Stud.28, 1 (1988), 45–66. 11 Conference 2026, 1 - 4 January, 2026, City, Country Erfan Arvan, Nadeeshan de Silva, Oscar Chaparro, and Martin Kellogg

work page 1988

[49] [49]

Oleksandra Ishchenko et al . 2025. Barriers and opportunities for Demand Response Aggregation in Ukraine and Norway: A Delphi-based study.Energy 328 (2025), 136296

work page 2025

[50] [50]

Toyomi Ishida and Hidetake Uwano. 2019. Synchronized analysis of eye move- ment and EEG during program comprehension. InEMIP. 26–32

work page 2019

[51] [51]

Feitelson

Ahmad Jbara and Dror G. Feitelson. 2017. How programmers read regular code: a controlled experiment using eye tracking.Empir. Softw. Eng.22 (2017), 1440–1477

work page 2017

[52] [52]

John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif

work page

[53] [53]

An empirical study assessing source code readability in comprehension. InICSME. 513–523

work page

[54] [54]

Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A tale of two comprehensions? analyzing student programmer attention during code summarization.ACM Trans. Softw. Eng. Methodol. (TOSEM) 33, 7 (2024), 1–37

work page 2024

[55] [55]

Nadia Kasto and Jacqueline Whalley. 2013. Measuring the difficulty of code comprehension tasks using software metrics. InProc. Australas. Comput. Educ. Conf. (ACE). 59–65

work page 2013

[56] [56]

Maurice G. Kendall. 1945. The treatment of ties in ranking problems.Biometrika 33, 3 (1945), 239–251

work page 1945

[57] [57]

Kendall, Sheila F

Maurice G. Kendall, Sheila F. H. Kendall, and B. Babington Smith. 1939. The distribution of Spearman’s coefficient of rank correlation in a universe in which all rankings occur an equal number of times.Biometrika(1939), 251–273

work page 1939

[58] [58]

2023.RAND methodological guidance for conducting and critically appraising Delphi panels

Dmitry Khodyakov, Sean Grant, Jack Kroger, and Melissa Bauman. 2023.RAND methodological guidance for conducting and critically appraising Delphi panels. RAND Corp., Santa Monica, CA. https://doi.org/10.7249/TLA3082-1

work page doi:10.7249/tla3082-1 2023

[59] [59]

George Kinnear, Ian Jones, and Ben Davies. 2025. Comparative judgement as a research tool: A meta-analysis of application and reliability.Behavior Research Methods57 (2025), 222. https://doi.org/10.3758/s13428-025-02744-w

work page doi:10.3758/s13428-025-02744-w 2025

[60] [60]

Walter Kintsch. 1988. The role of knowledge in discourse comprehension: a construction-integration model.Psychol. Rev.95, 2 (1988), 163–182

work page 1988

[61] [61]

Van Dijk

Walter Kintsch and Teun A. Van Dijk. 1978. Toward a model of text comprehen- sion and production.Psychol. Rev.85, 5 (1978), 363–394

work page 1978

[62] [62]

Luigi Lavazza, Sandro Morasca, and Marco Gatto. 2023. An empirical study on software understandability and its dependence on code characteristics.Empir. Softw. Eng.28, 6 (2023), 155

work page 2023

[63] [63]

Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007. Ef- fective identifier names for comprehension and memory.Innov. Syst. Softw. Eng. 3, 4 (2007), 303–318

work page 2007

[64] [64]

SeolHwa Lee, Andrew Matteson, Danial Hooshyar, SongHyun Kim, JaeBum Jung, GiChun Nam, and HeuiSeok Lim. 2016. Comparing programming language comprehension between novice and expert programmers using EEG analysis. InBIBE. 350–355

work page 2016

[65] [65]

Danielle R Lombardi et al. 2025. The increased role of advanced technology and automation in audit: A delphi study.Int. J. Account. Inf. Syst.56 (2025), 100733

work page 2025

[66] [66]

Brady D Lund. 2020. Review of the Delphi method in library and information science research.J. Doc.76, 4 (2020), 929–960

work page 2020

[67] [67]

Sarah B Maness, Stacey B Griner, and Erika L Thompson. 2025. Expert Consensus on Indicators of Social Determinants of Health: A Modified Delphi Study.J. Prim. Care Community Health16 (2025)

work page 2025

[68] [68]

Jean Melo, Fabricio Batista Narcizo, Dan Witzner Hansen, Claus Brabrand, and Andrzej Wasowski. 2017. Variability through the eyes of the programmer. In Proc. IEEE/ACM Int. Conf. Program Comprehension (ICPC). 34–44

work page 2017

[69] [69]

Miara, Joyce A

Richard J. Miara, Joyce A. Musselman, Juan A. Navarro, and Ben Shneiderman

work page

[70] [70]

ACM26, 11 (1983), 861–867

Program indentation and comprehensibility.Commun. ACM26, 11 (1983), 861–867

work page 1983

[71] [71]

Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: an investigation of how developers spend their time. InICPC. 25–35

work page 2015

[72] [72]

Russell Mosemann and Susan Wiedenbeck. 2001. Navigation and comprehension of programs by novice programmers. InIWPC. 79–88

work page 2001

[73] [73]

Sebastian Nielebock, Dariusz Krolikowski, Jacob Krüger, Thomas Leich, and Frank Ortmeier. 2019. Commenting source code: is it worth it for small pro- gramming tasks?Empir. Softw. Eng.24, 3 (2019), 1418–1457

work page 2019

[74] [74]

Orlov and Roman Bednarik

Pavel A. Orlov and Roman Bednarik. 2017. The role of extrafoveal vision in source code comprehension.Perception46, 5 (2017), 541–565

work page 2017

[75] [75]

Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif

Kang-il Park, Jack Johnson, Cole S. Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empir. Softw. Eng.29, 6 (2024), 160

work page 2024

[76] [76]

Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund

work page

[77] [77]

Program comprehension and code complexity metrics: an fMRI study. In ICSE. 524–536

work page

[78] [78]

Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What drives the reading order of programmers? an eye tracking study. InICPC. 342–353

work page 2020

[79] [79]

Norman Peitek, Janet Siegmund, Sven Apel, Christian Kästner, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2018. A look into programmers’ heads.IEEE Trans. Softw. Eng. (TSE)46, 4 (2018), 442–462

work page 2018

[80] [80]

Norman Peitek, Janet Siegmund, Chris Parnin, Sven Apel, and André Brechmann

work page