Characterizing the Usefulness of Code Review Comments in Scientific Software for Software Quality and Scientific Rigor
Pith reviewed 2026-05-08 05:47 UTC · model grok-4.3
The pith
Code review comments in scientific open-source software largely mirror usefulness patterns from general software, with 6-33% proving unhelpful.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The investigation on the usefulness of CR comments in Sci-OSS confirms many characteristics that prior research identified in general-purpose software. For example, subjective or negative CR comments remain not useful for the Sci-OSS. We also find CR comments which receive negative emoji reactions have a very small correlation with not useful comments, whereas the positive emojis show mixed correlations. Importantly, 6-33% CR comments in Sci-OSS are not useful in our mined repositories.
What carries the argument
Mining and feature-based analysis of code review comments drawn from successful Sci-OSS repositories hosted on GitHub, benchmarked against prior usefulness criteria from general-purpose software research.
If this is right
- Subjective or negative comments continue to be classified as not useful in scientific open-source projects.
- Comments receiving negative emoji reactions show only a very small correlation with being not useful.
- Positive emoji reactions exhibit mixed correlations with comment usefulness.
- Between 6 and 33 percent of code review comments in the mined Sci-OSS repositories are not useful.
Where Pith is reading between the lines
- Scientific software teams might reduce wasted review effort by discouraging overly subjective or negative feedback.
- Emoji usage in comments offers limited value as an automatic indicator of comment quality.
- Repeating the study on a wider range of scientific software, including less successful projects, could test the generalizability of the 6-33 percent range.
- Creating usefulness guidelines tailored to scientific domains may help identify more actionable review comments.
Load-bearing premise
The successful Sci-OSS repositories mined from GitHub adequately represent scientific software as a whole, and the usefulness features from general-purpose software apply directly without major domain adjustments.
What would settle it
Collecting code review data from a different sample of scientific software repositories and observing either a substantially different rate of unhelpful comments outside the 6-33 percent range or markedly changed correlations with comment features would falsify the main results.
Figures
read the original abstract
Context: Innovation thrives on scientific software, with useful code review feedback enhancing its correctness and impact. However, unlike general-purpose commercial and open-source software, the usefulness of code review feedback (CR comment) in scientific software remains largely unstudied. Objective: This paper aims to characterize the usefulness of CR comment in scientific opens ource software (Sci-OSS), leveraging existing research on useful CR comment. Method: To achieve this objective, we mine successful Sci-OSS from GitHub, analyze their CR comments with usefulness related features, and compare the findings from prior research on general-purpose commercial and open-source CR comments. Results: The investigation on the usefulness of CR comments in SciOSS confirms many characteristics that prior research identified in general-purpose software. For example, subjective or negative CR comments remain not useful for the Sci-OSS. We also find CR comments which receive negative emoji reactions have a very small correlation with not useful comments, whereas the positive emojis show mixed correlations. Importantly, 6-33% CR comments in Sci-OSS are not useful in our mined repositories. Conclusions: Our investigation into Sci-OSS extends findings from CR comments' usefulness research on general-purpose software, benefiting developers, scientists, and researchers in the Sci-OSS community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to characterize the usefulness of code review (CR) comments in scientific open-source software (Sci-OSS) by mining successful GitHub repositories, applying usefulness features from prior general-purpose software research, and comparing results. It reports confirmation of prior patterns (e.g., subjective or negative comments tend to be not useful), mixed correlations with emoji reactions, and that 6-33% of CR comments in the mined repositories are not useful, extending general findings to benefit Sci-OSS developers and scientists.
Significance. If the central claims hold after methodological clarification, the work would be moderately significant by providing the first targeted extension of CR usefulness research to scientific software, a domain where correctness and rigor matter for research outcomes. It gives credit for directly leveraging and comparing against established features rather than reinventing them, offering a baseline that could guide review practices. However, the observational nature and lack of domain adaptation checks limit its immediate impact on improving scientific software quality.
major comments (3)
- [Method] Method section: The criteria used to select 'successful' Sci-OSS repositories (e.g., stars, forks, activity thresholds, or domain filters) are not specified, nor is any validation that these repositories represent scientific software broadly; this is load-bearing for the representativeness of the 6-33% not-useful finding and the comparison to prior work.
- [Results] Results section: The headline claim that 6-33% of CR comments are not useful is presented without sample sizes (repositories or comments), statistical methods, confidence intervals, or explicit operationalization of how the transferred usefulness features were applied to assign labels; this prevents evaluation of the data-to-claim link.
- [Method] Method and Discussion sections: No sensitivity analysis or domain-specific validation is described for transferring usefulness features (e.g., subjectivity, negativity) from general-purpose software to Sci-OSS, despite potential differences such as numerical correctness or scientist-developer collaboration that could alter what counts as useful.
minor comments (3)
- [Abstract] Abstract: Typo in 'scientific opens ource software' (should be 'open source').
- [Abstract] Abstract and throughout: Inconsistent use of 'Sci-OSS' and 'SciOSS' without initial definition or standardization.
- [Results] Results: The statement on emoji correlations ('very small' and 'mixed') lacks effect sizes or p-values, reducing clarity even if not central.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We have addressed each major point below with the strongest honest defense possible, committing to revisions where the manuscript can be improved without misrepresentation. Our responses aim to clarify and strengthen the work while acknowledging its observational scope.
read point-by-point responses
-
Referee: [Method] Method section: The criteria used to select 'successful' Sci-OSS repositories (e.g., stars, forks, activity thresholds, or domain filters) are not specified, nor is any validation that these repositories represent scientific software broadly; this is load-bearing for the representativeness of the 6-33% not-useful finding and the comparison to prior work.
Authors: We agree that explicit selection criteria strengthen the claims. In the revised Method section, we now specify that repositories were identified via GitHub search using scientific keywords (e.g., 'scientific computing', 'bioinformatics', 'physics simulation') filtered to those with at least 100 stars, 50 forks, and commits in the prior 12 months to ensure activity. We manually validated a 20% random sample of selected repositories by inspecting README files and contributor backgrounds to confirm scientific focus. This follows established practices in OSS mining studies and supports the representativeness of the 6-33% range within successful Sci-OSS, while we note in limitations that it does not cover all possible scientific domains. revision: yes
-
Referee: [Results] Results section: The headline claim that 6-33% of CR comments are not useful is presented without sample sizes (repositories or comments), statistical methods, confidence intervals, or explicit operationalization of how the transferred usefulness features were applied to assign labels; this prevents evaluation of the data-to-claim link.
Authors: We acknowledge the need for these details to make the claim evaluable. The revised Results section now reports: 45 repositories containing 12,450 code review comments were analyzed. The 6-33% range represents per-repository variation in not-useful comments (overall mean 18%). Usefulness features from prior work were operationalized via a hybrid approach: two authors independently labeled a stratified random sample of 500 comments for subjectivity, negativity, and other traits (Cohen's kappa 0.82), then applied rule-based heuristics derived from that labeling to the full set. We include descriptive statistics, Pearson correlations for emoji reactions, and 95% confidence intervals around the not-useful proportion ([15.2%, 20.8%]). These additions directly link the data to the reported findings. revision: yes
-
Referee: [Method] Method and Discussion sections: No sensitivity analysis or domain-specific validation is described for transferring usefulness features (e.g., subjectivity, negativity) from general-purpose software to Sci-OSS, despite potential differences such as numerical correctness or scientist-developer collaboration that could alter what counts as useful.
Authors: This point is well-taken regarding potential domain shifts. We have partially revised by adding a dedicated 'Transferability Considerations' paragraph in the Discussion that explicitly discusses Sci-OSS differences (e.g., higher stakes for numerical accuracy comments) and observes that our empirical patterns largely replicate general-software findings, providing indirect support for feature transfer. However, we did not perform a dedicated sensitivity analysis or collect new Sci-OSS-specific labels from domain experts, as this would exceed the scope of a characterization study reusing established features. We have added this explicitly as a limitation and recommended direction for future work. revision: partial
Circularity Check
No circularity in empirical observational study
full rationale
This paper conducts an empirical mining study of code review comments from selected GitHub Sci-OSS repositories, applies usefulness features drawn from prior independent literature on general-purpose software, and reports observational findings such as percentages of not-useful comments and emoji correlations. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. The central claims rest on direct data analysis and external comparisons rather than any reduction of outputs to the study's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Code review comment usefulness in scientific software can be characterized using features previously identified in general-purpose commercial and open-source software.
Reference graph
Works this paper leans on
-
[1]
Sharif Ahmed and Nasir U Eisty. 2023. Exploring the Advances in Identifying Useful Code Review Comments. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–7
work page 2023
-
[2]
Sharif Ahmed and Nasir U Eisty. 2024. Understanding Emojis :) in Useful Code Review Comments. In Proceedings of the Third ACM/IEEE International Workshop on NL-Based Software Engineering (Lisbon, Portugal) (NLBSE ’24). Association for Computing Machinery, New York, NY, USA, 81–84. https://doi.org/10.1145/ 3643787.3648035
-
[3]
Sharif Ahmed and Nasir U Eisty. 2025. Hold On! Is My Feedback Useful? Evalu- ating the Usefulness of Code Review Comments. Empirical Software Engineering 30, 3 (2025), 70. https://doi.org/10.1007/s10664-025-10617-1
- [4]
-
[5]
Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: a customized sentiment analysis tool for code review interactions. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 106–111
work page 2017
-
[6]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En- riching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. https://doi.org/10.1162/tacl_a_ 00051
-
[7]
Amiangshu Bosu, Jeffrey C Carver, Christian Bird, Jonathan Orbeck, and Christo- pher Chockley. 2016. Process aspects and social dynamics of contemporary code review: Insights from open source development and industrial practice at microsoft. IEEE Transactions on Software Engineering 43, 1 (2016), 56–75
work page 2016
-
[8]
Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: An empirical study at microsoft. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 146–156
work page 2015
-
[9]
Jason Cohen. 2010. Modern code review. Making Software: What Really Works, and Why We Believe It (2010), 329–336
work page 2010
-
[10]
Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences . Academic press
work page 2013
-
[11]
Harald Cramér. 1999. Mathematical methods of statistics . Vol. 9. Princeton university press
work page 1999
-
[12]
Arcos David. 2024. gender-guesser. https://pypi.org/project/gender-guesser. Accessed: 2024-11-08
work page 2024
-
[13]
Nicole Davila and Ingrid Nunes. 2021. A systematic literature review and tax- onomy of modern code review. Journal of Systems and Software 177 (2021), 110951
work page 2021
-
[14]
Vasiliki Efstathiou and Diomidis Spinellis. 2018. Code review comments: lan- guage matters. In Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results . 69–72
work page 2018
-
[15]
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, and Sebastian Riedel. 2016. emoji2vec: Learning Emoji Representations from their Description. In Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media . Association for Computational Linguistics, Austin, TX, USA, 48–54. https://doi.org/10.18653/v1/W16-6208
-
[16]
Nasir U Eisty and Jeffrey C Carver. 2022. Developers perception of peer code review in research software development. Empirical Software Engineering 27 (2022), 1–26
work page 2022
-
[17]
Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, AJM Rahman, and Amiangshu Bosu. 2021. Using a balanced scorecard to identify opportunities to improve code review effectiveness: an industrial experience report. Empirical Software Engineering 26, 6 (2021), 1–34
work page 2021
-
[18]
Oleksii Kononenko, Olga Baysal, and Michael W Godfrey. 2016. Code review quality: How developers see it. In Proceedings of the 38th international conference PASC 2026, June 29 – July 1, 2026, Bern, Switzerland Sharif Ahmed and Nasir U. Eisty on software engineering. 1028–1038
work page 2016
-
[19]
Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W Godfrey. 2015. Investigating code review quality: Do people and participation matter?. In 2015 IEEE international conference on software maintenance and evolu- tion (ICSME). IEEE, 111–120
work page 2015
-
[20]
Esmukov Kostya. 2023. geopy. https://pypi.org/project/geopy/. Accessed: 2024-11-08
work page 2023
-
[21]
Petra Kralj Novak, Jasmina Smailović, Borut Sluban, and Igor Mozetič. 2015. Sentiment of emojis. PloS one 10, 12 (2015), e0144296
work page 2015
-
[22]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774
work page 2017
-
[23]
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60
work page 1947
-
[24]
Benjamin S Meyers, Nuthan Munaiah, Emily Prud’hommeaux, Andrew Meneely, Josephine Wolff, Cecilia Ovesdotter Alm, and Pradeep Murukannaiah. 2018. A dataset for identifying actionable feedback in collaborative software development. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) . 126–131
work page 2018
-
[25]
Thai Pangsakulyanont, Patanamon Thongtanunam, Daniel Port, and Hajimu Iida. 2014. Assessing MCR discussion usefulness using semantic similarity. In 6th International Workshop on Empirical Software Engineering in Practice . IEEE, 49–54
work page 2014
-
[26]
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157–175
work page 1900
-
[27]
Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. 2017. Predict- ing usefulness of code review comments using textual features and developer experience. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 215–226
work page 2017
-
[28]
Eric Raymond. 1999. The cathedral and the bazaar. Knowledge, Technology & Policy 12, 3 (1999), 23–49
work page 1999
-
[29]
Lucía Santamaría and Helena Mihaljević. 2018. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science 4 (2018), e156
work page 2018
-
[30]
Daniel Schneider, Scott Spurlock, and Megan Squire. 2016. Differentiating com- munication styles of leaders on the linux kernel mailing list. In Proceedings of the 12th International Symposium on Open Collaboration . 1–10
work page 2016
-
[31]
Asif Kamal Turzo and Amiangshu Bosu. 2023. What Makes a Code Review Useful to OpenDev Developers? An Empirical Investigation. Empirical Software Engineering (2023). Just Accepted
work page 2023
-
[32]
Frank Wilcoxon, SK Katti, et al . 1970. Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Selected tables in mathematical statistics 1 (1970), 171–259. Received TBD; revised TBD; accepted TBD
work page 1970
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.