Automated Classification of Human Code Review Comments with Large Language Models
Pith reviewed 2026-05-08 06:03 UTC · model grok-4.3
The pith
LLMs can classify code review comments into a nine-label taxonomy of smells and intents using only the comment text and its diff hunk.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that comment text together with its unified diff hunk supplies enough information for LLMs to classify certain review-comment issues and intents in zero-shot and one-shot settings, producing macro-F1 scores between 0.360 and 0.374, while evidence-sensitive smells remain difficult to classify accurately even when exemplars are supplied.
What carries the argument
A nine-label taxonomy of six review comment smells and three common useful intents, used to label comment-diff pairs for zero-shot and one-shot LLM classification.
If this is right
- Automated flagging of redundant or vague comments could be added to review platforms without needing full thread history.
- One-shot prompting helps boundary cases between intents, suggesting targeted exemplar selection can improve specific labels.
- Evidence-sensitive smells will require extra context such as review threads to reach usable accuracy.
- The same comment-diff input format can be reused to test other LLMs or prompting strategies on the same taxonomy.
Where Pith is reading between the lines
- Review platforms could run the classifier in real time to coach reviewers before they post.
- The labeled dataset could serve as training material for smaller, specialized models that run locally.
- If the taxonomy proves stable, it could become a shared benchmark for measuring progress in automated code-review assistance.
Load-bearing premise
The 448 manually labeled comments are a representative sample of real-world code reviews and the nine categories capture the main issue types without significant overlap or omission.
What would settle it
A larger, multi-platform collection of labeled code review comments that produces substantially different macro-F1 scores or changes which labels count as evidence-sensitive.
Figures
read the original abstract
Context: Code reviews are essential for maintaining software quality, yet many human review comments suffer from issues such as redundancy, vagueness, or lack of constructiveness. These types of comments may slow down feedback and obscure important insights. Prior work on code review comments mostly explore the detection and categorization of useful comments, while fine-grained categorization of comment issues remains underexplored. Objective: This work aims to design and evaluate an automated system for classifying code review comments according to specific categories of issues. Methodology: We introduced a nine-label taxonomy for code review comments, covering six review comment smells and three common useful intents, and manually labeled 448 comments from a publicly available dataset. We benchmarked zero-shot and one-shot single-label classification over each comment and its associated unified diff hunk, comparing GPT-5-mini, LLaMA-3.3, and DeepSeek-R1. We reported macro-F1 as the primary metric. Results: Zero-shot performance was moderate under class imbalance (macro-F1 0.360 to 0.374). One-shot exemplar conditioning had model-dependent effects: GPT-5-mini and DeepSeek-R1 macro-F1 scores improved, however LLaMA-3.3 suffered a slight decrease. Exemplars most consistently helped intent-boundary labels, whereas classification of evidence-sensitive labels remain challenging. Conclusion: Our results indicate that comment--diff evidence is sufficient for some labels but limited for evidence-sensitive smells. Future work includes adding thread context, improving intent-preserving rewrites, and validating robustness across platforms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a nine-label taxonomy (six review comment smells and three useful intents), manually labels 448 comments from a public dataset, and benchmarks zero-shot and one-shot single-label classification performance of three LLMs (GPT-5-mini, LLaMA-3.3, DeepSeek-R1) using each comment plus its unified diff hunk. It reports moderate macro-F1 scores (0.360–0.374) under class imbalance, notes model-dependent effects of one-shot exemplars, and concludes that comment-diff evidence suffices for some labels but is limited for evidence-sensitive smells.
Significance. If the taxonomy and labels prove reliable, the work supplies a needed empirical benchmark for automated classification of code-review comment issues, an area noted as underexplored. The comparative zero/one-shot evaluation across models and the explicit distinction between intent-boundary and evidence-sensitive labels are constructive contributions that could guide future tool-building for review quality.
major comments (4)
- [Methodology] The manual labeling process for the 448 comments (described in the Methodology section) reports no inter-rater agreement metric (e.g., Cohen’s kappa or Fleiss’ kappa). Without this, the reliability of the ground-truth labels that underpin all macro-F1 results cannot be assessed.
- [Experiments] Exact prompt templates, system instructions, and the full wording of the nine label definitions are omitted from the Experiments and Appendix sections. This prevents reproduction and makes it impossible to judge whether observed performance differences arise from prompt engineering choices rather than model capability.
- [Results] No statistical significance tests (e.g., McNemar or bootstrap confidence intervals) are provided for the reported macro-F1 differences between zero-shot and one-shot conditions or across the three models. The claim of “model-dependent effects” therefore rests on point estimates alone.
- [Taxonomy and Dataset] The nine-label taxonomy is presented without empirical validation for label overlap, mutual exclusivity, or coverage of the comment population. The representativeness of the 448-comment sample is also untested, directly affecting the generalizability of the sufficiency/limitation conclusion.
minor comments (2)
- [Abstract] The abstract and results text refer to “GPT-5-mini”; confirm the precise model identifier (e.g., gpt-4o-mini) and include the version date or checkpoint used.
- [Dataset] The publicly available dataset is cited only generically; provide the exact repository URL, commit hash, or DOI so readers can retrieve the identical 448 comments.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methodology] The manual labeling process for the 448 comments (described in the Methodology section) reports no inter-rater agreement metric (e.g., Cohen’s kappa or Fleiss’ kappa). Without this, the reliability of the ground-truth labels that underpin all macro-F1 results cannot be assessed.
Authors: We acknowledge the importance of reporting inter-rater reliability. The labeling was performed primarily by the first author using detailed guidelines, with the second author independently annotating a 50-comment overlap subset to check consistency. In the revised manuscript we will compute and report Fleiss’ kappa on this overlap to quantify agreement and discuss any disagreements. revision: yes
-
Referee: [Experiments] Exact prompt templates, system instructions, and the full wording of the nine label definitions are omitted from the Experiments and Appendix sections. This prevents reproduction and makes it impossible to judge whether observed performance differences arise from prompt engineering choices rather than model capability.
Authors: We agree that full reproducibility requires the exact wording. The revised version will add a dedicated Appendix containing the complete system instructions, zero-shot and one-shot prompt templates (including how exemplars were selected and formatted), and the verbatim definitions of all nine labels. revision: yes
-
Referee: [Results] No statistical significance tests (e.g., McNemar or bootstrap confidence intervals) are provided for the reported macro-F1 differences between zero-shot and one-shot conditions or across the three models. The claim of “model-dependent effects” therefore rests on point estimates alone.
Authors: This is a fair criticism of the current evidence strength. We will augment the Results section with bootstrap confidence intervals (1,000 resamples) for all macro-F1 scores and apply McNemar’s test for paired zero-shot vs. one-shot comparisons per model. These additions will support the “model-dependent effects” claim with statistical grounding. revision: yes
-
Referee: [Taxonomy and Dataset] The nine-label taxonomy is presented without empirical validation for label overlap, mutual exclusivity, or coverage of the comment population. The representativeness of the 448-comment sample is also untested, directly affecting the generalizability of the sufficiency/limitation conclusion.
Authors: The taxonomy was derived from prior code-review literature and refined via a pilot study on 100 comments to reduce overlap. The 448 instances were drawn uniformly at random from the public dataset. While a separate large-scale validation study lies outside the scope of this benchmarking paper, we will expand the Discussion to explicitly address potential label ambiguities, coverage limitations, and threats to generalizability. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper is a standard empirical classification study: authors define a 9-label taxonomy, manually annotate 448 comments from a public dataset, then measure zero-shot and one-shot LLM performance via macro-F1 against those human labels. No equations, derivations, fitted parameters presented as predictions, or self-citations appear as load-bearing steps. The central claim (comment-diff evidence suffices for some labels but not others) follows directly from the reported F1 scores without reducing to self-definition or prior author work by construction. The methodology is externally falsifiable against the labeled set and uses conventional held-out evaluation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human annotators can consistently assign the nine labels to comments
- domain assumption Comment text plus unified diff hunk contains sufficient information for the chosen labels
invented entities (1)
-
Nine-label taxonomy (six smells + three intents)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Uğur Can Altun, Ismail Sergen Göçmen, Emre Sülün, Erdem Tuna, and Eray Tüzün. 2025. Process smells in practice: an evaluative case study.Empirical Software Engineering30, 5 (2025), 115. doi:10.1007/s10664-025-10664-8
-
[2]
Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. InProceedings of the 2013 International Conference on Software Engineering(San Francisco, CA, USA)(ICSE ’13). IEEE Press, 712–721
work page 2013
-
[3]
Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley
Amiangshu Bosu, Jeffrey C. Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley. 2017. Process Aspects and Social Dynamics of Con- temporary Code Review: Insights from Open Source Development and Indus- trial Practice at Microsoft.IEEE Trans. Softw. Eng.43, 1 (Jan. 2017), 56–75. doi:10.1109/TSE.2016.2576451
-
[4]
Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: an empirical study at Microsoft. InProceedings of the 12th Working Conference on Mining Software Repositories(Florence, Italy)(MSR ’15). IEEE Press, 146–156
work page 2015
-
[5]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models.ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. doi:10.1145/3641289
-
[6]
Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia. 2025. Understanding Practitioners’ Expectations on Clear Code Review Comments. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA056 (June 2025), 23 pages. doi:10. 1145/3728931
work page 2025
-
[7]
Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emircan Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Auto- mated Code Review in Practice. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 425–436. doi:10.1109/ICSE-SEIP66354.2025.00043
-
[8]
Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educa- tional and Psychological Measurement20, 1 (Apr 1960), 37–46. doi:10.1177/ 001316446002000104
work page 1960
-
[9]
Jacek Czerwonka, Michaela Greiler, and Jack Tilford. 2015. Code Reviews Do Not Find Bugs. How the Current Code Review Best Practice Slows Us Down. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. 27–28. doi:10.1109/ICSE.2015.131
-
[10]
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501. 12948 Accessed: 2026-01-13
work page internal anchor Pith review arXiv 2025
-
[11]
Emre Doğan and Eray Tüzün. 2022. Towards a taxonomy of code review smells. Information and Software Technology142 (2022), 106737. doi:10.1016/j.infsof.2021. 106737
-
[12]
M. E. Fagan. 1976. Design and code inspections to reduce errors in program development.IBM Syst. J.15, 3 (Sept. 1976), 182–211. doi:10.1147/sj.153.0182
-
[13]
Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The "Shut the f**k up" Phenomenon: Characterizing Incivility in Open Source Code Review Discussions. Proc. ACM Hum.-Comput. Interact.5, CSCW2, Article 353 (Oct. 2021), 35 pages. doi:10.1145/3479497
-
[14]
Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli
-
[15]
Engg.27, 4 (July 2022), 43 pages
What happens in my code reviews? An investigation on automatically classifying review changes.Empirical Softw. Engg.27, 4 (July 2022), 43 pages. doi:10.1007/s10664-021-10075-5
-
[16]
Imtiajur Rah- man, and Amiangshu Bosu
Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, A.J.M. Imtiajur Rah- man, and Amiangshu Bosu. 2021. Using a balanced scorecard to identify opportu- nities to improve code review effectiveness: an industrial experience report.Em- pirical Softw. Engg.26, 6 (Nov. 2021), 34 pages. doi:10.1007/s10664-021-10038-w
-
[17]
Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. 2022. AUGER: automatically generating review comments with pre-training models. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore)(ESEC/FSE 2022). Association...
-
[18]
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan
-
[19]
Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1035–1047. doi:10.1145/3540250.3549081
-
[20]
Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan, and Huaimin Wang. 2017. Automatic Classification of Review Comments in Pull-based Development Model. InProceedings of The 29th International Conference on Software Engineering and Knowledge Engineering. 572–577. doi:10.18293/SEKE2017-039
-
[21]
Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 236–248. doi:10.1109/MSR66628.2025.00043
-
[22]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282
work page 2012
-
[23]
Meta AI. 2024. LLaMA 3.3 70B Instruct. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct. Accessed: 2026-01-13
work page 2024
-
[24]
Linh Nguyen, Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Exploring the Potential of Large Language Models in Fine-Grained Review Com- ment Classification. In2025 IEEE International Conference on Source Code Analysis & Manipulation (SCAM). 43–54. doi:10.1109/SCAM67354.2025.00012
-
[25]
Miroslaw Ochodek, Miroslaw Staron, Wilhelm Meding, and Ola Söder. 2022. Automated Code Review Comment Classification to Improve Modern Code Reviews. InSoftware Quality: The Next Big Thing in Software Engineering and Quality, Daniel Mendez, Manuel Wimmer, Dietmar Winkler, Stefan Biffl, and Johannes Bergsmann (Eds.). Springer International Publishing, Cham, 23–40
work page 2022
-
[26]
OpenAI. 2025. GPT-5 mini Model | OpenAI API. https://platform.openai.com/ docs/models/gpt-5-mini. Accessed: 2026-01-13
work page 2025
-
[27]
OpenAI. 2025. Using GPT-5.2 | OpenAI API. https://platform.openai.com/docs/ guides/latest-model. Accessed: 2026-01-24
work page 2025
-
[28]
Luca Pascarella, Davide Spadini, Fabio Palomba, Magiel Bruntink, and Alberto Bacchelli. 2018. Information Needs in Contemporary Code Review.Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 135 (Nov. 2018), 27 pages. doi:10.1145/ 3274404
work page 2018
-
[29]
Mohammad Masudur Rahman, Chanchal K. Roy, and Raula G. Kula. 2017. Predict- ing Usefulness of Code Review Comments Using Textual Features and Developer Experience. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 215–226. doi:10.1109/MSR.2017.17
-
[30]
2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches
Shadikur Rahman. 2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches. MSc thesis. York University, Toronto, Canada. https://hdl.handle.net/10315/41946 Advisor: Enamul Hoque
work page 2024
-
[31]
Peter C. Rigby and Christian Bird. 2013. Convergent contemporary software peer review practices. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering(Saint Petersburg, Russia)(ESEC/FSE 2013). Association for Computing Machinery, New York, NY, USA, 202–212. doi:10.1145/2491411. 2491444
-
[32]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern code review: a case study at google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice(Gothenburg, Sweden)(ICSE-SEIP ’18). Association for Computing Machinery, New York, NY, USA, 181–190. doi:10.1145/3...
-
[33]
Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Trans. Softw. Eng. Methodol.32, 5, Article 118 (July 2023), 32 pages. doi:10.1145/3583562
- [34]
-
[35]
Rosalia Tufano, Ozren Dabić, Antonio Mastropaolo, Matteo Ciniselli, and Gabriele Bavota. 2024. Code Review Automation: Strengths and Weaknesses of the State of the Art.IEEE Trans. Softw. Eng.50, 2 (Feb. 2024), 338–353. doi:10.1109/TSE. 2023.3348172
work page doi:10.1109/tse 2024
-
[36]
Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code review automation. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2291–2302. doi:1...
-
[37]
Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. InPro- ceedings of the 43rd International Conference on Software Engineering(Madrid, Spain)(ICSE ’21). IEEE Press, 163–174. doi:10.1109/ICSE43902.2021.00027
-
[38]
Asif Kamal Turzo and Amiangshu Bosu. 2023. What makes a code review useful to OpenDev developers? An empirical investigation.Empirical Software Engineering 29, 1 (2023), 6. doi:10.1007/s10664-023-10411-x
-
[39]
Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. 2023. Towards Automated Classification of Code Review Feedback to Support Analytics. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12. doi:10.1109/ ESEM56168.2023.10304851
-
[40]
Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli. 2023. EvaCRC: Evaluating Code Review Comments. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 275–287. d...
-
[41]
Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Comput...
- [42]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.