Verifier Warnings Do Not Improve Comprehensibility Prediction

Martin Kellogg; Nadeeshan De Silva; Oscar Chaparro

arxiv: 2604.22653 · v1 · submitted 2026-04-24 · 💻 cs.SE

Verifier Warnings Do Not Improve Comprehensibility Prediction

Nadeeshan De Silva , Martin Kellogg , Oscar Chaparro This is my paper

Pith reviewed 2026-05-08 11:24 UTC · model grok-4.3

classification 💻 cs.SE

keywords code comprehensibilityverifier warningsmachine learning predictionsoftware verificationempirical studysyntactic featuresprediction performance

0 comments

The pith

Adding verifier warning counts does not improve machine learning models for predicting code comprehensibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the total number of warnings from a formal verifier can serve as a useful input feature to boost the accuracy of machine learning models that predict how understandable humans find code. Researchers took existing models that rely on syntactic code properties and developer information, added the verifier warning sum as an extra feature, and ran a controlled comparison. The results showed no meaningful gain in predictive performance from including the warnings. This outcome indicates that the correlation between warning counts and human comprehensibility judgments does not translate into better discrimination when models already have access to syntactic and developer data.

Core claim

We performed a control-treatment experiment incorporating the verifier warning sum feature into machine learning models from the literature, and conducted a comparative analysis of their performance against models trained only on syntactic and developer features. We found no significant difference in the prediction performance of models with and without the warnings feature. Our findings suggest that while a correlation exists, the verifier warning sum offers limited discriminative power: combining syntactic and developer features is just as effective for predicting human-judged code comprehensibility.

What carries the argument

The control-treatment experiment that adds the sum of verifier warnings as an input feature to existing ML models and measures any change in prediction accuracy compared to models using only syntactic and developer features.

If this is right

Machine learning models for code comprehensibility can rely on syntactic and developer features alone without loss of predictive power.
The total count of verifier warnings does not supply enough unique information to justify its inclusion in comprehensibility predictors.
Empirical studies can treat verifier warning sums as optional rather than required when building or evaluating such models.
Correlation between warnings and comprehensibility does not guarantee that the warnings will improve downstream predictive tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verifier tools might need to produce different kinds of warnings or summaries if their outputs are to help with human-focused code quality predictions.
The limited value of warning sums could stem from overlap with existing syntactic metrics, suggesting future work to test alternative aggregations of verifier output.
This result may affect how software teams decide whether to run formal verifiers primarily for comprehensibility assessment versus other goals.
Replication on code from different domains or languages could reveal whether the finding holds beyond the datasets used here.

Load-bearing premise

The machine learning models and datasets drawn from prior literature are appropriate and sufficient to detect any real contribution from the verifier warning feature if one exists.

What would settle it

Re-running the same models on a new dataset where the version that includes verifier warning sums achieves statistically significantly higher accuracy than the version without them.

Figures

Figures reproduced from arXiv: 2604.22653 by Martin Kellogg, Nadeeshan De Silva, Oscar Chaparro.

**Figure 1.** Figure 1: Overview of our methodology for evaluating the impact of verifier warnings on code comprehensibility prediction. view at source ↗

read the original abstract

Proponents of software verification suggest that code simplicity is linked to the effort to verify code, hypothesizing that formal verifiers produce fewer false positive warnings and require less manual intervention when analyzing simpler code. A recent meta-analysis study found empirical support for this hypothesis: a small correlation between the sum of verifier warnings and human-derived code comprehensibility metrics. Based on this finding, we conjectured that using the sum of verifier tool (verifier) warnings to represent program semantic information as an input feature to machine learning (ML) models for code comprehensibility prediction can enhance their performance, when combined with traditional syntactic and developer features. To test this conjecture, we performed a control-treatment experiment incorporating the verifier warning sum feature into machine learning models from the literature, and conducted a comparative analysis of their performance against models trained only on syntactic and developer features. We found no significant difference in the prediction performance of models with and without the warnings feature. Our findings suggest that while a correlation exists, the verifier warning sum offers limited discriminative power: combining syntactic and developer features is just as effective for predicting human-judged code comprehensibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a direct test of whether verifier warning counts improve ML models for code comprehensibility and reports no gain, which is a narrow but useful negative result that still needs power analysis to land cleanly.

read the letter

The core finding is that adding the sum of verifier warnings as a feature produces no detectable improvement in predicting human-judged code comprehensibility over models that already use syntactic and developer features. The authors start from a meta-analysis correlation, turn it into a concrete conjecture about ML performance, and run the control-treatment comparison to check it. That step is the paper's real contribution: it moves from correlation to a falsifiable prediction test on an established task instead of just adding another feature study. Reporting the null is also worthwhile because it can stop people from chasing this particular signal in future modeling work. The experiment itself is straightforward and stays close to prior literature setups, which keeps the comparison fair. The main soft spot is exactly the one the stress-test note flags. The abstract and available details give no effect sizes, no confidence intervals around the performance delta, no power calculation, and no clear statement of dataset size or variance. A small correlation from the meta-analysis could easily produce a lift too weak for this setup to catch, so the null result is harder to interpret without those numbers. If the full paper supplies them and shows the study was powered for the expected effect, the claim strengthens; otherwise it stays at the level of 'we did not observe a difference here.' This work is aimed at researchers doing feature selection for code quality ML models. It will mainly interest people already working on comprehensibility prediction who want to know which signals are worth including. It is not a broad methodological advance, but the targeted negative result is the kind of incremental check that keeps the literature honest. I would send it to peer review. The question is well-posed and the design is simple enough that referees can focus on tightening the statistical reporting and confirming the methods details rather than rethinking the whole approach.

Referee Report

2 major / 1 minor

Summary. The paper reports a control-treatment experiment testing whether adding the sum of verifier warnings as an input feature improves the performance of machine learning models (drawn from prior literature) for predicting human-judged code comprehensibility. The authors find no significant difference in predictive performance between models using only syntactic and developer features versus those that also include the verifier-warning-sum feature, and conclude that the warning sum offers limited discriminative power beyond the baseline features.

Significance. A robust null result would indicate that the small correlation between verifier warnings and comprehensibility metrics identified in prior meta-analyses does not yield practically useful gains in ML prediction tasks. This would help bound the utility of verification artifacts for comprehensibility modeling and reinforce that syntactic plus developer features are sufficient, potentially guiding future work away from incorporating verifier outputs in this domain.

major comments (2)

[Methods] Methods section: The manuscript provides no information on dataset size, number of code samples, how the data were split for training/testing, or the specific ML models and hyperparameters employed. These omissions prevent evaluation of whether the experiment was adequately powered to detect a small performance lift consistent with the meta-analytic correlation cited in the introduction.
[Results] Results section: The claim of 'no significant difference' is reported without effect sizes, confidence intervals on the performance delta, or a power analysis. Given that the motivating meta-analysis reports only a small correlation, the absence of these statistics leaves open the possibility of a type-II error and undermines the stronger conclusion that the verifier warning sum 'offers limited discriminative power.'

minor comments (1)

[Abstract] The abstract and introduction refer to 'models from the literature' without naming the specific models or citing the exact prior papers; adding these references would improve reproducibility and context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important omissions in the reporting of our experimental design and results. We agree that additional details are needed to allow proper evaluation of statistical power and the strength of the null finding. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript provides no information on dataset size, number of code samples, how the data were split for training/testing, or the specific ML models and hyperparameters employed. These omissions prevent evaluation of whether the experiment was adequately powered to detect a small performance lift consistent with the meta-analytic correlation cited in the introduction.

Authors: We agree that these methodological details are essential for reproducibility and for assessing whether the study was adequately powered. In the revised manuscript we will expand the Methods section to report the total number of code samples, the exact train/test split procedure and ratios, the specific machine-learning models used, and all hyperparameter values. We will also add an a-priori or post-hoc power analysis that references the small correlation size reported in the motivating meta-analysis. revision: yes
Referee: [Results] Results section: The claim of 'no significant difference' is reported without effect sizes, confidence intervals on the performance delta, or a power analysis. Given that the motivating meta-analysis reports only a small correlation, the absence of these statistics leaves open the possibility of a type-II error and undermines the stronger conclusion that the verifier warning sum 'offers limited discriminative power.'

Authors: We concur that effect sizes, confidence intervals around the performance difference, and a power analysis should be reported to support the interpretation of the null result. The revised Results section will include these quantities (e.g., Cohen’s d or AUC differences with 95 % CIs) together with the power calculation. This will allow readers to judge both statistical and practical significance and will temper the language of the conclusion if the power analysis indicates the study may have been under-powered for a small effect. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of ML models with/without verifier warnings feature

full rationale

The paper performs a control-treatment experiment: it takes existing ML models and datasets from the literature, adds the verifier-warning-sum feature as an additional input, and reports that performance metrics show no statistically significant difference versus the syntactic+developer baseline. This is a direct empirical measurement, not a derivation, equation, or fitted parameter that reduces to its own inputs by construction. The meta-analysis citation supplies the motivating hypothesis but is not invoked as a uniqueness theorem or ansatz that forces the result; the experiment tests and rejects the performance implication. No self-citation chain, self-definitional loop, or renaming of a known result is present in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is a straightforward empirical test that relies on standard machine learning evaluation practices and statistical comparison methods already established in the literature.

axioms (1)

standard math Standard assumptions of statistical significance testing for model performance comparison
Used to conclude there is no significant difference between the two model sets.

pith-pipeline@v0.9.0 · 5491 in / 1074 out tokens · 43295 ms · 2026-05-08T11:24:55.595232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages

[1]

K Fold Cross Validation

2024. K Fold Cross Validation. https://scikit-learn.org/stable/modules/generated/ sklearn.model_selection.KFold.html/

work page 2024
[2]

Java Language Specification

2026. Java Language Specification. https://docs.oracle.com/javase/specs/jls/se8/ html/index.html

work page 2026
[3]

SpotBugs

2026. SpotBugs. https://spotbugs.github.io/

work page 2026
[4]

Amine Abbad-Andaloussi, Thierry Sorg, and Barbara Weber. 2022. Estimating Developers’ Cognitive Load at a Fine-grained Level Using Eye-tracking Measures. InIntl. Conf. on Prog. Compr. (ICPC). 111–121

work page 2022
[5]

Herve Abdi, Lynne J Williams, et al . 2010. Normalizing data.Encyclopedia of research design1 (2010), 935–938

work page 2010
[6]

Feitelson

Shulamyt Ajami, Yonatan Woodbridge, and Dror G. Feitelson. 2019. Syntax, predicates, idioms — what really affects code complexity?Emp. Soft. Eng.24, 1 (2019), 287–328

work page 2019
[7]

Vard Antinyan. 2020. Evaluating Essential and Accidental Code Complexity Triggers by Practitioners’ Perception.IEEE Soft.37, 6 (2020), 86–93

work page 2020
[8]

Vard Antinyan, Miroslaw Staron, and Anna Sandberg. 2017. Evaluating code com- plexity triggers, use of complexity measures and the influence of code complexity on maintenance time.Emp. Soft. Eng.22, 6 (2017), 3057–3087

work page 2017
[9]

Maletic, Christopher Morrell, and Bonita Sharif

Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I. Maletic, Christopher Morrell, and Bonita Sharif. 2013. The impact of identifier style on effort and comprehension.Emp. Soft. Eng.18, 2 (2013), 219–276

work page 2013
[10]

Jürgen Börstler, Kwabena E Bennin, Sara Hooshangi, Johan Jeuring, Hieke Keun- ing, Carsten Kleiner, Bonnie MacKellar, Rodrigo Duran, Harald Störrle, Daniel Toll, et al. 2023. Developers talking about code quality.Empirical Software Engineering28, 6 (2023), 128

work page 2023
[11]

Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

work page 2001
[12]

1987.No silver bullet

Frederick Brooks and H Kugler. 1987.No silver bullet. April

work page 1987
[13]

Raymond Buse and Westley Weimer. 2009. Learning a metric for code readability. Trans. on Soft. Eng. (TSE)36, 4 (2009), 546–558

work page 2009
[14]

Cristiano Calcagno, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving fast with software verification. In NASA Formal Methods Symp.Springer, 3–11

work page 2015
[15]

Cristiano Calcagno, Dino Distefano, Peter O’Hearn, and Hongseok Yang. 2009. Compositional shape analysis by means of bi-abduction. InPrinciples of Program- ming Languages (POPL). 289–300

work page 2009
[16]

Gavin C Cawley and Nicola LC Talbot. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation.The Journal of Machine Learning Research11 (2010), 2079–2107

work page 2010
[17]

S le Cessie and JC Van Houwelingen. 1992. Ridge estimators in logistic regression. Journal of the Royal Statistical Society Series C: Applied Statistics41, 1 (1992), 191– 201

work page 1992
[18]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer

work page
[19]

SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research16 (2002), 321–357

work page 2002
[20]

1999.Elements of information theory

Thomas M Cover. 1999.Elements of information theory. John Wiley & Sons

work page 1999
[21]

Carlos Dantas, Adriano Rocha, and Marcelo Maia. 2023. Assessing the readability of chatgpt code snippet recommendations: A comparative study. InProceedings of the XXXVII Brazilian Symposium on Software Engineering. 283–292

work page 2023
[22]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2025. Relative Code Comprehensibility Prediction.arXiv preprint arXiv:2510.03474(2025)

work page arXiv 2025
[23]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2026. Online replication package. https://github.com/sea-lab-wm/warning-comprehensibility

work page 2026
[24]

Pablo Del Moral, Sławomir Nowaczyk, and Sepideh Pashami. 2022. Why is multiclass classification hard?IEEE Access10 (2022), 80448–80462

work page 2022
[25]

Jonathan Dorn. 2012. A general software readability model.MCS Thesis available from (http://www. cs. virginia. edu/weimer/students/dorn-mcs-paper. pdf)5 (2012), 11–14

work page 2012
[26]

Stephen G Eick, Todd L Graves, Alan F Karr, J Steve Marron, and Audris Mockus

work page
[27]

IEEE transactions on software engineering27, 1 (2002), 1–12

Does code decay? assessing the evidence from change management data. IEEE transactions on software engineering27, 1 (2002), 1–12

work page 2002
[28]

Janet Feigenspan, Sven Apel, Jorg Liebig, and Christian Kastner. 2011. Exploring Software Measures to Assess Program Comprehension. InIntl. Symp. on Emp. Soft. Eng. and Meas. (ESEM). 127–136

work page 2011
[29]

Dror G Feitelson. 2021. Considerations and pitfalls in controlled experiments on code comprehension. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 106–117

work page 2021
[30]

Kobi Feldman, Martin Kellogg, and Oscar Chaparro. 2023. On the Relationship between Code Verifiability and Understandability. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 211–223

work page 2023
[31]

1979.How to write plain English: A book for lawyers and consumers

Rudolf Flesch. 1979.How to write plain English: A book for lawyers and consumers. Vol. 76026225. Harper & Row New York

work page 1979
[32]

George Forman and Martin Scholz. 2010. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement.Acm Sigkdd Explorations Newsletter12, 1 (2010), 49–57

work page 2010
[33]

Müller, Serap Yigit-Elliott, and Manuela Züger

Thomas Fritz, Andrew Begel, Sebastian C. Müller, Serap Yigit-Elliott, and Manuela Züger. 2014. Using psycho-physiological measures to assess task difficulty in software development. InIntl. Conf. on Soft. Eng. (ICSE). 402–413

work page 2014
[34]

Davide Fucci, Daniela Girardi, Nicole Novielli, Luigi Quaranta, and Filippo Lanu- bile. 2019. A Replication Study on Code Comprehension and Expertise using Lightweight Biometric Sensors. InIntl. Conf. on Prog. Compr. (ICPC). 311–322

work page 2019
[35]

Amy GrabNGoInfo. 2022. Support Vector Machine (SVM) Hyperparameter Tun- ing In Python. https://medium.com/grabngoinfo/support-vector-machine-svm- hyperparameter-tuning-in-python-a65586289bcb/

work page 2022
[36]

Halstead

Maurice H. Halstead. 1977.Elements of Soft. Science. Elsevier

work page 1977
[37]

James A Hanley and Barbara J McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology143, 1 (1982), 29–36

work page 1982
[38]

Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf

Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines.IEEE Intelligent Systems and their applications13, 4 (1998), 18–28

work page 1998
[39]

Mohammad Hossin and Md Nasir Sulaiman. 2015. A review on evaluation metrics for data classification evaluations.International journal of data mining & knowledge management process5, 2 (2015), 1

work page 2015
[40]

Feitelson

Ahmad Jbara and Dror G. Feitelson. 2017. How programmers read regular code: a controlled experiment using eye tracking.Emp. Soft. Eng.22, 3 (2017), 1440–1477

work page 2017
[41]

Cem Kaner, Senior Member, and Walter P. Bond. 2004. Software Engineering Metrics: What Do They Measure and How Do We Know?. InIntl. Soft. Metrics Symp. (METRICS)

work page 2004
[42]

Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A tale of two comprehensions? analyzing student programmer attention during code summarization.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–37

work page 2024
[43]

Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

work page 1938
[44]

Maurice G. Kendall. 1938. A new measure of rank correlation.Biometrika30, 1/2 (1938), 81–93

work page 1938
[45]

Amy J Ko and Brad A Myers. 2005. A framework and methodology for study- ing the causes of software errors in programming systems.Journal of Visual Languages & Computing16, 1-2 (2005), 41–84

work page 2005
[46]

Cognitive Complexity

Luigi Lavazza, Abedallah Zaid Abualkishik, Geng Liu, and Sandro Morasca. 2023. An empirical evaluation of the “Cognitive Complexity” measure as a predictor of code understandability.Journal of Systems and Software197 (2023), 111561

work page 2023
[47]

Luigi Lavazza, Sandro Morasca, and Marco Gatto. 2023. An empirical study on software understandability and its dependence on code characteristics.Empirical Software Engineering28, 6 (2023), 155

work page 2023
[48]

Gary T Leavens, Albert L Baker, and Clyde Ruby. 1998. JML: a Java modeling language. InFormal Underpinnings of Java Workshop (at OOPSLA 1998). Citeseer, 404–420

work page 1998
[49]

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.Advances in neural information processing systems30 (2017)

work page 2017
[50]

Walid Maalej, Rebecca Tiarks, Tobias Roehm, and Rainer Koschke. 2014. On the Comprehension of Program Comprehension.Trans. on Soft. Eng. and Methodology (TSEM)23, 4 (2014), 1–37

work page 2014
[51]

T.J. McCabe. 1976. A Complexity Measure.Trans. on Soft. Eng. (TSE)SE-2, 4 (1976), 308–320

work page 1976
[52]

Patrick E McKnight and Julius Najab. 2010. Mann-Whitney U Test.The Corsini encyclopedia of psychology(2010), 1–1

work page 2010
[53]

Qing Mi, Yiqun Hao, Liwei Ou, and Wei Ma. 2022. Towards using visual, semantic and structural features to improve code readability classification.Journal of Systems and Software193 (2022), 111454

work page 2022
[54]

Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I Know What You Did Last Summer - An Investigation of How Developers Spend Their Time. InIntl. Conf. on Prog. Compr. (ICPC). 25–35

work page 2015
[55]

João Mota, Marco Giunti, and António Ravara. 2021. Java typestate checker. In Intl. Conf. on Coord. Lang. and Models. Springer, 121–133

work page 2021
[56]

Gireen Naidu, Tranos Zuva, and Elias Mmbongeni Sibanda. 2023. A review of evaluation metrics in machine learning algorithms. InComputer science on-line conference. Springer, 15–25

work page 2023
[57]

Peter O’Hearn, John Reynolds, and Hongseok Yang. 2001. Local reasoning about programs that alter data structures. InIntl. Workshop on Computer Science Logic. Springer, 1–19

work page 2001
[58]

Paul Oman and Jack Hagemeister. 1992. Metrics for assessing a software system’s maintainability. InProceedings Conference on Software Maintenance 1992. IEEE, 337–344

work page 1992
[59]

Matthew M Papi, Mahmood Ali, Telmo Luis Correa Jr, Jeff H Perkins, and Michael D Ernst. 2008. Practical pluggable types for Java. InProceedings of the 2008 international symposium on Software testing and analysis. 201–212

work page 2008
[60]

Kang-il Park, Jack Johnson, Cole S Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empirical Software Engineering29, 6 (2024), 160. EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Nadeeshan De Silva, Martin Kellogg, and O...

work page 2024
[61]

In: 2024 IEEE Interna- tional Conference on Big Data (BigData), pp

Abhi Patel, Kazi Zakia Sultana, and Bharath K. Samanthula. 2024. A Comparative Analysis between AI Generated Code and Human Written Code: A Preliminary Study. In2024 IEEE International Conference on Big Data (BigData). 7521–7529. https://doi.org/10.1109/BigData62323.2024.10825958

work page doi:10.1109/bigdata62323.2024.10825958 2024
[62]

Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund

work page
[63]

Program comprehension and code complexity metrics: An fMRI study. In Intl. Conf. on Soft. Eng. (ICSE). 524–536

work page
[64]

Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What Drives the Reading Order of Programmers? An Eye Tracking Study. InIntl. Conf. on Prog. Compr. (ICPC). 342–353

work page 2020
[65]

Norman Peitek, Janet Siegmund, Sven Apel, Christian Kästner, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2018. A look into programmers’ heads.Trans. on Soft. Eng. (TSE)46, 4 (2018), 442–462

work page 2018
[66]

Leif E Peterson. 2009. K-nearest neighbor.Scholarpedia4, 2 (2009), 1883

work page 2009
[67]

Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2011. A simpler model of software readability. InProceedings of the 8th working conference on mining software repositories. 73–82

work page 2011
[68]

Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2021. Reflections on: A Simpler Model of Software Readability.ACM SIGSOFT Soft. Eng. Notes46, 3 (2021), 30–32

work page 2021
[69]

Hassan Ramchoun, Youssef Ghanou, Mohamed Ettaouil, and Mohammed Amine Janati Idrissi. 2016. Multilayer perceptron: Architecture optimization and training. (2016)

work page 2016
[70]

Steven J Rigatti. 2017. Random forest.Journal of Insurance Medicine47, 1 (2017), 31–39

work page 2017
[71]

Simone Scalabrino, Gabriele Bavota, Christopher Vendome, Mario Linares- Vasquez, Denys Poshyvanyk, and Rocco Oliveto. 2019. Automatically assessing code understandability.Trans. on Soft. Eng. (TSE)47, 3 (2019), 595–613

work page 2019
[72]

Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshy- vanyk. 2018. A comprehensive model for code readability.Journal of Software: Evolution and Process30, 6 (2018), e1958

work page 2018
[73]

Simone Scalabrino, Mario Linares-Vasquez, Denys Poshyvanyk, and Rocco Oliveto. 2016. Improving code readability models with textual features. In2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE, 1–10

work page 2016
[74]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, Evgeniia Kirillova, and Timofey Bryksin. 2024. Reassessing java code readability models with a human-centered approach. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 225–235

work page 2024
[75]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948
[76]

Janet Siegmund. 2016. Program Comprehension: Past, Present, and Future. In Intl. Conf. on Soft. Analysis, Evolution, and ReEng. (SANER), Vol. 5. 13–20

work page 2016
[77]

Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. InIntl. Conf. on Soft. Eng. (ICSE). 378–389

work page 2014
[78]

Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering.IEEE transactions on software engineering 31, 9 (2005), 733–753

work page 2005
[79]

Ryo SOGA, Takatomi KUBO, Takashi ISHIO, Yuna NUNOMURA, Takahiro KI- NOSHITA, Hideyuki KANUKA, and Kenichi MATSUMOTO. 2025. Your heart foretells your performance: Analysis of pre-task heart rate in program compre- hension tasks.IEICE Transactions on Information and Systems(2025)

work page 2025
[80]

Stevens, Glenford J

Wayne P. Stevens, Glenford J. Myers, and Larry L. Constantine. 1974. Structured design.IBM systems journal13, 2 (1974), 115–139

work page 1974

Showing first 80 references.

[1] [1]

K Fold Cross Validation

2024. K Fold Cross Validation. https://scikit-learn.org/stable/modules/generated/ sklearn.model_selection.KFold.html/

work page 2024

[2] [2]

Java Language Specification

2026. Java Language Specification. https://docs.oracle.com/javase/specs/jls/se8/ html/index.html

work page 2026

[3] [3]

SpotBugs

2026. SpotBugs. https://spotbugs.github.io/

work page 2026

[4] [4]

Amine Abbad-Andaloussi, Thierry Sorg, and Barbara Weber. 2022. Estimating Developers’ Cognitive Load at a Fine-grained Level Using Eye-tracking Measures. InIntl. Conf. on Prog. Compr. (ICPC). 111–121

work page 2022

[5] [5]

Herve Abdi, Lynne J Williams, et al . 2010. Normalizing data.Encyclopedia of research design1 (2010), 935–938

work page 2010

[6] [6]

Feitelson

Shulamyt Ajami, Yonatan Woodbridge, and Dror G. Feitelson. 2019. Syntax, predicates, idioms — what really affects code complexity?Emp. Soft. Eng.24, 1 (2019), 287–328

work page 2019

[7] [7]

Vard Antinyan. 2020. Evaluating Essential and Accidental Code Complexity Triggers by Practitioners’ Perception.IEEE Soft.37, 6 (2020), 86–93

work page 2020

[8] [8]

Vard Antinyan, Miroslaw Staron, and Anna Sandberg. 2017. Evaluating code com- plexity triggers, use of complexity measures and the influence of code complexity on maintenance time.Emp. Soft. Eng.22, 6 (2017), 3057–3087

work page 2017

[9] [9]

Maletic, Christopher Morrell, and Bonita Sharif

Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I. Maletic, Christopher Morrell, and Bonita Sharif. 2013. The impact of identifier style on effort and comprehension.Emp. Soft. Eng.18, 2 (2013), 219–276

work page 2013

[10] [10]

Jürgen Börstler, Kwabena E Bennin, Sara Hooshangi, Johan Jeuring, Hieke Keun- ing, Carsten Kleiner, Bonnie MacKellar, Rodrigo Duran, Harald Störrle, Daniel Toll, et al. 2023. Developers talking about code quality.Empirical Software Engineering28, 6 (2023), 128

work page 2023

[11] [11]

Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

work page 2001

[12] [12]

1987.No silver bullet

Frederick Brooks and H Kugler. 1987.No silver bullet. April

work page 1987

[13] [13]

Raymond Buse and Westley Weimer. 2009. Learning a metric for code readability. Trans. on Soft. Eng. (TSE)36, 4 (2009), 546–558

work page 2009

[14] [14]

Cristiano Calcagno, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving fast with software verification. In NASA Formal Methods Symp.Springer, 3–11

work page 2015

[15] [15]

Cristiano Calcagno, Dino Distefano, Peter O’Hearn, and Hongseok Yang. 2009. Compositional shape analysis by means of bi-abduction. InPrinciples of Program- ming Languages (POPL). 289–300

work page 2009

[16] [16]

Gavin C Cawley and Nicola LC Talbot. 2010. On over-fitting in model selection and subsequent selection bias in performance evaluation.The Journal of Machine Learning Research11 (2010), 2079–2107

work page 2010

[17] [17]

S le Cessie and JC Van Houwelingen. 1992. Ridge estimators in logistic regression. Journal of the Royal Statistical Society Series C: Applied Statistics41, 1 (1992), 191– 201

work page 1992

[18] [18]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer

work page

[19] [19]

SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research16 (2002), 321–357

work page 2002

[20] [20]

1999.Elements of information theory

Thomas M Cover. 1999.Elements of information theory. John Wiley & Sons

work page 1999

[21] [21]

Carlos Dantas, Adriano Rocha, and Marcelo Maia. 2023. Assessing the readability of chatgpt code snippet recommendations: A comparative study. InProceedings of the XXXVII Brazilian Symposium on Software Engineering. 283–292

work page 2023

[22] [22]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2025. Relative Code Comprehensibility Prediction.arXiv preprint arXiv:2510.03474(2025)

work page arXiv 2025

[23] [23]

Nadeeshan De Silva, Martin Kellogg, and Oscar Chaparro. 2026. Online replication package. https://github.com/sea-lab-wm/warning-comprehensibility

work page 2026

[24] [24]

Pablo Del Moral, Sławomir Nowaczyk, and Sepideh Pashami. 2022. Why is multiclass classification hard?IEEE Access10 (2022), 80448–80462

work page 2022

[25] [25]

Jonathan Dorn. 2012. A general software readability model.MCS Thesis available from (http://www. cs. virginia. edu/weimer/students/dorn-mcs-paper. pdf)5 (2012), 11–14

work page 2012

[26] [26]

Stephen G Eick, Todd L Graves, Alan F Karr, J Steve Marron, and Audris Mockus

work page

[27] [27]

IEEE transactions on software engineering27, 1 (2002), 1–12

Does code decay? assessing the evidence from change management data. IEEE transactions on software engineering27, 1 (2002), 1–12

work page 2002

[28] [28]

Janet Feigenspan, Sven Apel, Jorg Liebig, and Christian Kastner. 2011. Exploring Software Measures to Assess Program Comprehension. InIntl. Symp. on Emp. Soft. Eng. and Meas. (ESEM). 127–136

work page 2011

[29] [29]

Dror G Feitelson. 2021. Considerations and pitfalls in controlled experiments on code comprehension. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 106–117

work page 2021

[30] [30]

Kobi Feldman, Martin Kellogg, and Oscar Chaparro. 2023. On the Relationship between Code Verifiability and Understandability. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 211–223

work page 2023

[31] [31]

1979.How to write plain English: A book for lawyers and consumers

Rudolf Flesch. 1979.How to write plain English: A book for lawyers and consumers. Vol. 76026225. Harper & Row New York

work page 1979

[32] [32]

George Forman and Martin Scholz. 2010. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement.Acm Sigkdd Explorations Newsletter12, 1 (2010), 49–57

work page 2010

[33] [33]

Müller, Serap Yigit-Elliott, and Manuela Züger

Thomas Fritz, Andrew Begel, Sebastian C. Müller, Serap Yigit-Elliott, and Manuela Züger. 2014. Using psycho-physiological measures to assess task difficulty in software development. InIntl. Conf. on Soft. Eng. (ICSE). 402–413

work page 2014

[34] [34]

Davide Fucci, Daniela Girardi, Nicole Novielli, Luigi Quaranta, and Filippo Lanu- bile. 2019. A Replication Study on Code Comprehension and Expertise using Lightweight Biometric Sensors. InIntl. Conf. on Prog. Compr. (ICPC). 311–322

work page 2019

[35] [35]

Amy GrabNGoInfo. 2022. Support Vector Machine (SVM) Hyperparameter Tun- ing In Python. https://medium.com/grabngoinfo/support-vector-machine-svm- hyperparameter-tuning-in-python-a65586289bcb/

work page 2022

[36] [36]

Halstead

Maurice H. Halstead. 1977.Elements of Soft. Science. Elsevier

work page 1977

[37] [37]

James A Hanley and Barbara J McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology143, 1 (1982), 29–36

work page 1982

[38] [38]

Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf

Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines.IEEE Intelligent Systems and their applications13, 4 (1998), 18–28

work page 1998

[39] [39]

Mohammad Hossin and Md Nasir Sulaiman. 2015. A review on evaluation metrics for data classification evaluations.International journal of data mining & knowledge management process5, 2 (2015), 1

work page 2015

[40] [40]

Feitelson

Ahmad Jbara and Dror G. Feitelson. 2017. How programmers read regular code: a controlled experiment using eye tracking.Emp. Soft. Eng.22, 3 (2017), 1440–1477

work page 2017

[41] [41]

Cem Kaner, Senior Member, and Walter P. Bond. 2004. Software Engineering Metrics: What Do They Measure and How Do We Know?. InIntl. Soft. Metrics Symp. (METRICS)

work page 2004

[42] [42]

Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A tale of two comprehensions? analyzing student programmer attention during code summarization.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–37

work page 2024

[43] [43]

Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

work page 1938

[44] [44]

Maurice G. Kendall. 1938. A new measure of rank correlation.Biometrika30, 1/2 (1938), 81–93

work page 1938

[45] [45]

Amy J Ko and Brad A Myers. 2005. A framework and methodology for study- ing the causes of software errors in programming systems.Journal of Visual Languages & Computing16, 1-2 (2005), 41–84

work page 2005

[46] [46]

Cognitive Complexity

Luigi Lavazza, Abedallah Zaid Abualkishik, Geng Liu, and Sandro Morasca. 2023. An empirical evaluation of the “Cognitive Complexity” measure as a predictor of code understandability.Journal of Systems and Software197 (2023), 111561

work page 2023

[47] [47]

Luigi Lavazza, Sandro Morasca, and Marco Gatto. 2023. An empirical study on software understandability and its dependence on code characteristics.Empirical Software Engineering28, 6 (2023), 155

work page 2023

[48] [48]

Gary T Leavens, Albert L Baker, and Clyde Ruby. 1998. JML: a Java modeling language. InFormal Underpinnings of Java Workshop (at OOPSLA 1998). Citeseer, 404–420

work page 1998

[49] [49]

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.Advances in neural information processing systems30 (2017)

work page 2017

[50] [50]

Walid Maalej, Rebecca Tiarks, Tobias Roehm, and Rainer Koschke. 2014. On the Comprehension of Program Comprehension.Trans. on Soft. Eng. and Methodology (TSEM)23, 4 (2014), 1–37

work page 2014

[51] [51]

T.J. McCabe. 1976. A Complexity Measure.Trans. on Soft. Eng. (TSE)SE-2, 4 (1976), 308–320

work page 1976

[52] [52]

Patrick E McKnight and Julius Najab. 2010. Mann-Whitney U Test.The Corsini encyclopedia of psychology(2010), 1–1

work page 2010

[53] [53]

Qing Mi, Yiqun Hao, Liwei Ou, and Wei Ma. 2022. Towards using visual, semantic and structural features to improve code readability classification.Journal of Systems and Software193 (2022), 111454

work page 2022

[54] [54]

Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I Know What You Did Last Summer - An Investigation of How Developers Spend Their Time. InIntl. Conf. on Prog. Compr. (ICPC). 25–35

work page 2015

[55] [55]

João Mota, Marco Giunti, and António Ravara. 2021. Java typestate checker. In Intl. Conf. on Coord. Lang. and Models. Springer, 121–133

work page 2021

[56] [56]

Gireen Naidu, Tranos Zuva, and Elias Mmbongeni Sibanda. 2023. A review of evaluation metrics in machine learning algorithms. InComputer science on-line conference. Springer, 15–25

work page 2023

[57] [57]

Peter O’Hearn, John Reynolds, and Hongseok Yang. 2001. Local reasoning about programs that alter data structures. InIntl. Workshop on Computer Science Logic. Springer, 1–19

work page 2001

[58] [58]

Paul Oman and Jack Hagemeister. 1992. Metrics for assessing a software system’s maintainability. InProceedings Conference on Software Maintenance 1992. IEEE, 337–344

work page 1992

[59] [59]

Matthew M Papi, Mahmood Ali, Telmo Luis Correa Jr, Jeff H Perkins, and Michael D Ernst. 2008. Practical pluggable types for Java. InProceedings of the 2008 international symposium on Software testing and analysis. 201–212

work page 2008

[60] [60]

Kang-il Park, Jack Johnson, Cole S Peterson, Nishitha Yedla, Isaac Baysinger, Jairo Aponte, and Bonita Sharif. 2024. An eye tracking study assessing source code readability rules for program comprehension.Empirical Software Engineering29, 6 (2024), 160. EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Nadeeshan De Silva, Martin Kellogg, and O...

work page 2024

[61] [61]

In: 2024 IEEE Interna- tional Conference on Big Data (BigData), pp

Abhi Patel, Kazi Zakia Sultana, and Bharath K. Samanthula. 2024. A Comparative Analysis between AI Generated Code and Human Written Code: A Preliminary Study. In2024 IEEE International Conference on Big Data (BigData). 7521–7529. https://doi.org/10.1109/BigData62323.2024.10825958

work page doi:10.1109/bigdata62323.2024.10825958 2024

[62] [62]

Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund

work page

[63] [63]

Program comprehension and code complexity metrics: An fMRI study. In Intl. Conf. on Soft. Eng. (ICSE). 524–536

work page

[64] [64]

Norman Peitek, Janet Siegmund, and Sven Apel. 2020. What Drives the Reading Order of Programmers? An Eye Tracking Study. InIntl. Conf. on Prog. Compr. (ICPC). 342–353

work page 2020

[65] [65]

Norman Peitek, Janet Siegmund, Sven Apel, Christian Kästner, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2018. A look into programmers’ heads.Trans. on Soft. Eng. (TSE)46, 4 (2018), 442–462

work page 2018

[66] [66]

Leif E Peterson. 2009. K-nearest neighbor.Scholarpedia4, 2 (2009), 1883

work page 2009

[67] [67]

Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2011. A simpler model of software readability. InProceedings of the 8th working conference on mining software repositories. 73–82

work page 2011

[68] [68]

Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2021. Reflections on: A Simpler Model of Software Readability.ACM SIGSOFT Soft. Eng. Notes46, 3 (2021), 30–32

work page 2021

[69] [69]

Hassan Ramchoun, Youssef Ghanou, Mohamed Ettaouil, and Mohammed Amine Janati Idrissi. 2016. Multilayer perceptron: Architecture optimization and training. (2016)

work page 2016

[70] [70]

Steven J Rigatti. 2017. Random forest.Journal of Insurance Medicine47, 1 (2017), 31–39

work page 2017

[71] [71]

Simone Scalabrino, Gabriele Bavota, Christopher Vendome, Mario Linares- Vasquez, Denys Poshyvanyk, and Rocco Oliveto. 2019. Automatically assessing code understandability.Trans. on Soft. Eng. (TSE)47, 3 (2019), 595–613

work page 2019

[72] [72]

Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshy- vanyk. 2018. A comprehensive model for code readability.Journal of Software: Evolution and Process30, 6 (2018), e1958

work page 2018

[73] [73]

Simone Scalabrino, Mario Linares-Vasquez, Denys Poshyvanyk, and Rocco Oliveto. 2016. Improving code readability models with textual features. In2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE, 1–10

work page 2016

[74] [74]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, Evgeniia Kirillova, and Timofey Bryksin. 2024. Reassessing java code readability models with a human-centered approach. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 225–235

work page 2024

[75] [75]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948

[76] [76]

Janet Siegmund. 2016. Program Comprehension: Past, Present, and Future. In Intl. Conf. on Soft. Analysis, Evolution, and ReEng. (SANER), Vol. 5. 13–20

work page 2016

[77] [77]

Janet Siegmund, Christian Kästner, Sven Apel, Chris Parnin, Anja Bethmann, Thomas Leich, Gunter Saake, and André Brechmann. 2014. Understanding understanding source code with functional magnetic resonance imaging. InIntl. Conf. on Soft. Eng. (ICSE). 378–389

work page 2014

[78] [78]

Dag IK Sjøberg, Jo Erskine Hannay, Ove Hansen, Vigdis By Kampenes, Amela Karahasanovic, N-K Liborg, and Anette C Rekdal. 2005. A survey of controlled experiments in software engineering.IEEE transactions on software engineering 31, 9 (2005), 733–753

work page 2005

[79] [79]

Ryo SOGA, Takatomi KUBO, Takashi ISHIO, Yuna NUNOMURA, Takahiro KI- NOSHITA, Hideyuki KANUKA, and Kenichi MATSUMOTO. 2025. Your heart foretells your performance: Analysis of pre-task heart rate in program compre- hension tasks.IEICE Transactions on Information and Systems(2025)

work page 2025

[80] [80]

Stevens, Glenford J

Wayne P. Stevens, Glenford J. Myers, and Larry L. Constantine. 1974. Structured design.IBM systems journal13, 2 (1974), 115–139

work page 1974