Automated Classification of Human Code Review Comments with Large Language Models

\c{S}\"ukr\"u Eren G\"ok{\i}rmak; Eray T\"uz\"un; Semih \c{C}a\u{g}lar

arxiv: 2604.23667 · v1 · submitted 2026-04-26 · 💻 cs.SE

Automated Classification of Human Code Review Comments with Large Language Models

Semih \c{C}a\u{g}lar , \c{S}\"ukr\"u Eren G\"ok{\i}rmak , Eray T\"uz\"un This is my paper

Pith reviewed 2026-05-08 06:03 UTC · model grok-4.3

classification 💻 cs.SE

keywords code reviewcomment classificationlarge language modelsreview smellssoftware qualityzero-shot classificationLLM prompting

0 comments

The pith

LLMs can classify code review comments into a nine-label taxonomy of smells and intents using only the comment text and its diff hunk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a taxonomy of six comment smells and three positive intents, then manually labels 448 comments from an existing dataset. It tests whether large language models can assign the correct label when given each comment plus the associated code change, running both zero-shot and one-shot experiments across three models and measuring macro-F1. A sympathetic reader would care because many real code reviews contain redundant, vague, or unconstructive comments that slow development and hide real issues; an automated classifier could surface those problems quickly. The results show moderate overall performance that improves with one example for some labels but stays limited for smells that need evidence beyond the immediate comment and diff.

Core claim

The paper claims that comment text together with its unified diff hunk supplies enough information for LLMs to classify certain review-comment issues and intents in zero-shot and one-shot settings, producing macro-F1 scores between 0.360 and 0.374, while evidence-sensitive smells remain difficult to classify accurately even when exemplars are supplied.

What carries the argument

A nine-label taxonomy of six review comment smells and three common useful intents, used to label comment-diff pairs for zero-shot and one-shot LLM classification.

If this is right

Automated flagging of redundant or vague comments could be added to review platforms without needing full thread history.
One-shot prompting helps boundary cases between intents, suggesting targeted exemplar selection can improve specific labels.
Evidence-sensitive smells will require extra context such as review threads to reach usable accuracy.
The same comment-diff input format can be reused to test other LLMs or prompting strategies on the same taxonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Review platforms could run the classifier in real time to coach reviewers before they post.
The labeled dataset could serve as training material for smaller, specialized models that run locally.
If the taxonomy proves stable, it could become a shared benchmark for measuring progress in automated code-review assistance.

Load-bearing premise

The 448 manually labeled comments are a representative sample of real-world code reviews and the nine categories capture the main issue types without significant overlap or omission.

What would settle it

A larger, multi-platform collection of labeled code review comments that produces substantially different macro-F1 scores or changes which labels count as evidence-sensitive.

Figures

Figures reproduced from arXiv: 2604.23667 by \c{S}\"ukr\"u Eren G\"ok{\i}rmak, Eray T\"uz\"un, Semih \c{C}a\u{g}lar.

**Figure 1.** Figure 1: Methodology overview. Category Brief Definition Example Comment Supporting Studies Count Incorrect Claims a specific problem in the code, but that claim is false for the current patch. there should be ’True’? (refer to view at source ↗

**Figure 2.** Figure 2: An example of an Incorrect review comment, where the reviewer claims a mistake despite the code being correct under the intended logic. 𝜅 = 0.49. During calibration, we also introduced a dedicated Clarification category to capture comments that primarily add explanatory context without proposing a concrete code change. Independent labeling and conflict resolution. After the pilot session, A1 and A2 indepe… view at source ↗

**Figure 3.** Figure 3: Example review comment screenshots. 5.1 Implications for Researchers Treating Incorrect comments as context vs. as targets. Although Incorrect review comments can be valuable in real repositories, because they surface and correct misinformation during collaborative development, they are undesirable as training targets for automation: a model that learns to produce such comments would be harmful. Therefor… view at source ↗

read the original abstract

Context: Code reviews are essential for maintaining software quality, yet many human review comments suffer from issues such as redundancy, vagueness, or lack of constructiveness. These types of comments may slow down feedback and obscure important insights. Prior work on code review comments mostly explore the detection and categorization of useful comments, while fine-grained categorization of comment issues remains underexplored. Objective: This work aims to design and evaluate an automated system for classifying code review comments according to specific categories of issues. Methodology: We introduced a nine-label taxonomy for code review comments, covering six review comment smells and three common useful intents, and manually labeled 448 comments from a publicly available dataset. We benchmarked zero-shot and one-shot single-label classification over each comment and its associated unified diff hunk, comparing GPT-5-mini, LLaMA-3.3, and DeepSeek-R1. We reported macro-F1 as the primary metric. Results: Zero-shot performance was moderate under class imbalance (macro-F1 0.360 to 0.374). One-shot exemplar conditioning had model-dependent effects: GPT-5-mini and DeepSeek-R1 macro-F1 scores improved, however LLaMA-3.3 suffered a slight decrease. Exemplars most consistently helped intent-boundary labels, whereas classification of evidence-sensitive labels remain challenging. Conclusion: Our results indicate that comment--diff evidence is sufficient for some labels but limited for evidence-sensitive smells. Future work includes adding thread context, improving intent-preserving rewrites, and validating robustness across platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward empirical benchmark that introduces a nine-label taxonomy for code review comments and tests three LLMs on diff context, with moderate results and clear limits on evidence-sensitive labels.

read the letter

This paper introduces a nine-label taxonomy for code review comments—six smells plus three intents—and benchmarks zero-shot and one-shot classification with GPT-5-mini, LLaMA-3.3, and DeepSeek-R1 using comment text plus unified diff hunks. The central result is that macro-F1 stays moderate (0.36–0.37) and that some labels improve with exemplars while evidence-sensitive smells stay hard. That matches the practical claim that comment-diff evidence is enough for some categories but not others.

Referee Report

4 major / 2 minor

Summary. The paper introduces a nine-label taxonomy (six review comment smells and three useful intents), manually labels 448 comments from a public dataset, and benchmarks zero-shot and one-shot single-label classification performance of three LLMs (GPT-5-mini, LLaMA-3.3, DeepSeek-R1) using each comment plus its unified diff hunk. It reports moderate macro-F1 scores (0.360–0.374) under class imbalance, notes model-dependent effects of one-shot exemplars, and concludes that comment-diff evidence suffices for some labels but is limited for evidence-sensitive smells.

Significance. If the taxonomy and labels prove reliable, the work supplies a needed empirical benchmark for automated classification of code-review comment issues, an area noted as underexplored. The comparative zero/one-shot evaluation across models and the explicit distinction between intent-boundary and evidence-sensitive labels are constructive contributions that could guide future tool-building for review quality.

major comments (4)

[Methodology] The manual labeling process for the 448 comments (described in the Methodology section) reports no inter-rater agreement metric (e.g., Cohen’s kappa or Fleiss’ kappa). Without this, the reliability of the ground-truth labels that underpin all macro-F1 results cannot be assessed.
[Experiments] Exact prompt templates, system instructions, and the full wording of the nine label definitions are omitted from the Experiments and Appendix sections. This prevents reproduction and makes it impossible to judge whether observed performance differences arise from prompt engineering choices rather than model capability.
[Results] No statistical significance tests (e.g., McNemar or bootstrap confidence intervals) are provided for the reported macro-F1 differences between zero-shot and one-shot conditions or across the three models. The claim of “model-dependent effects” therefore rests on point estimates alone.
[Taxonomy and Dataset] The nine-label taxonomy is presented without empirical validation for label overlap, mutual exclusivity, or coverage of the comment population. The representativeness of the 448-comment sample is also untested, directly affecting the generalizability of the sufficiency/limitation conclusion.

minor comments (2)

[Abstract] The abstract and results text refer to “GPT-5-mini”; confirm the precise model identifier (e.g., gpt-4o-mini) and include the version date or checkpoint used.
[Dataset] The publicly available dataset is cited only generically; provide the exact repository URL, commit hash, or DOI so readers can retrieve the identical 448 comments.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Methodology] The manual labeling process for the 448 comments (described in the Methodology section) reports no inter-rater agreement metric (e.g., Cohen’s kappa or Fleiss’ kappa). Without this, the reliability of the ground-truth labels that underpin all macro-F1 results cannot be assessed.

Authors: We acknowledge the importance of reporting inter-rater reliability. The labeling was performed primarily by the first author using detailed guidelines, with the second author independently annotating a 50-comment overlap subset to check consistency. In the revised manuscript we will compute and report Fleiss’ kappa on this overlap to quantify agreement and discuss any disagreements. revision: yes
Referee: [Experiments] Exact prompt templates, system instructions, and the full wording of the nine label definitions are omitted from the Experiments and Appendix sections. This prevents reproduction and makes it impossible to judge whether observed performance differences arise from prompt engineering choices rather than model capability.

Authors: We agree that full reproducibility requires the exact wording. The revised version will add a dedicated Appendix containing the complete system instructions, zero-shot and one-shot prompt templates (including how exemplars were selected and formatted), and the verbatim definitions of all nine labels. revision: yes
Referee: [Results] No statistical significance tests (e.g., McNemar or bootstrap confidence intervals) are provided for the reported macro-F1 differences between zero-shot and one-shot conditions or across the three models. The claim of “model-dependent effects” therefore rests on point estimates alone.

Authors: This is a fair criticism of the current evidence strength. We will augment the Results section with bootstrap confidence intervals (1,000 resamples) for all macro-F1 scores and apply McNemar’s test for paired zero-shot vs. one-shot comparisons per model. These additions will support the “model-dependent effects” claim with statistical grounding. revision: yes
Referee: [Taxonomy and Dataset] The nine-label taxonomy is presented without empirical validation for label overlap, mutual exclusivity, or coverage of the comment population. The representativeness of the 448-comment sample is also untested, directly affecting the generalizability of the sufficiency/limitation conclusion.

Authors: The taxonomy was derived from prior code-review literature and refined via a pilot study on 100 comments to reduce overlap. The 448 instances were drawn uniformly at random from the public dataset. While a separate large-scale validation study lies outside the scope of this benchmarking paper, we will expand the Discussion to explicitly address potential label ambiguities, coverage limitations, and threats to generalizability. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a standard empirical classification study: authors define a 9-label taxonomy, manually annotate 448 comments from a public dataset, then measure zero-shot and one-shot LLM performance via macro-F1 against those human labels. No equations, derivations, fitted parameters presented as predictions, or self-citations appear as load-bearing steps. The central claim (comment-diff evidence suffices for some labels but not others) follows directly from the reported F1 scores without reducing to self-definition or prior author work by construction. The methodology is externally falsifiable against the labeled set and uses conventional held-out evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that human labels are reliable ground truth and that the chosen taxonomy is exhaustive for the domain. No free parameters are fitted; the work is purely empirical benchmarking.

axioms (2)

domain assumption Human annotators can consistently assign the nine labels to comments
The methodology section states that 448 comments were manually labeled, but no inter-annotator agreement metric is mentioned in the abstract.
domain assumption Comment text plus unified diff hunk contains sufficient information for the chosen labels
This is the core experimental setup; the results section notes it is limited for evidence-sensitive smells.

invented entities (1)

Nine-label taxonomy (six smells + three intents) no independent evidence
purpose: To categorize code review comment issues for automated classification
New taxonomy introduced in the paper; no independent evidence provided beyond the authors' manual labeling.

pith-pipeline@v0.9.0 · 5605 in / 1442 out tokens · 23866 ms · 2026-05-08T06:03:39.177188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Uğur Can Altun, Ismail Sergen Göçmen, Emre Sülün, Erdem Tuna, and Eray Tüzün. 2025. Process smells in practice: an evaluative case study.Empirical Software Engineering30, 5 (2025), 115. doi:10.1007/s10664-025-10664-8

work page doi:10.1007/s10664-025-10664-8 2025
[2]

Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. InProceedings of the 2013 International Conference on Software Engineering(San Francisco, CA, USA)(ICSE ’13). IEEE Press, 712–721

work page 2013
[3]

Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley

Amiangshu Bosu, Jeffrey C. Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley. 2017. Process Aspects and Social Dynamics of Con- temporary Code Review: Insights from Open Source Development and Indus- trial Practice at Microsoft.IEEE Trans. Softw. Eng.43, 1 (Jan. 2017), 56–75. doi:10.1109/TSE.2016.2576451

work page doi:10.1109/tse.2016.2576451 2017
[4]

Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: an empirical study at Microsoft. InProceedings of the 12th Working Conference on Mining Software Repositories(Florence, Italy)(MSR ’15). IEEE Press, 146–156

work page 2015
[5]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models.ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. doi:10.1145/3641289

work page doi:10.1145/3641289 2024
[6]

Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia. 2025. Understanding Practitioners’ Expectations on Clear Code Review Comments. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA056 (June 2025), 23 pages. doi:10. 1145/3728931

work page 2025
[7]

Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emircan Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Auto- mated Code Review in Practice. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 425–436. doi:10.1109/ICSE-SEIP66354.2025.00043

work page doi:10.1109/icse-seip66354.2025.00043 2025
[8]

Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educa- tional and Psychological Measurement20, 1 (Apr 1960), 37–46. doi:10.1177/ 001316446002000104

work page 1960
[9]

Jacek Czerwonka, Michaela Greiler, and Jack Tilford. 2015. Code Reviews Do Not Find Bugs. How the Current Code Review Best Practice Slows Us Down. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. 27–28. doi:10.1109/ICSE.2015.131

work page doi:10.1109/icse.2015.131 2015
[10]

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501. 12948 Accessed: 2026-01-13

work page internal anchor Pith review arXiv 2025
[11]

Emre Doğan and Eray Tüzün. 2022. Towards a taxonomy of code review smells. Information and Software Technology142 (2022), 106737. doi:10.1016/j.infsof.2021. 106737

work page doi:10.1016/j.infsof.2021 2022
[12]

M. E. Fagan. 1976. Design and code inspections to reduce errors in program development.IBM Syst. J.15, 3 (Sept. 1976), 182–211. doi:10.1147/sj.153.0182

work page doi:10.1147/sj.153.0182 1976
[13]

Shut the f**k up

Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The "Shut the f**k up" Phenomenon: Characterizing Incivility in Open Source Code Review Discussions. Proc. ACM Hum.-Comput. Interact.5, CSCW2, Article 353 (Oct. 2021), 35 pages. doi:10.1145/3479497

work page doi:10.1145/3479497 2021
[14]

Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli

work page
[15]

Engg.27, 4 (July 2022), 43 pages

What happens in my code reviews? An investigation on automatically classifying review changes.Empirical Softw. Engg.27, 4 (July 2022), 43 pages. doi:10.1007/s10664-021-10075-5

work page doi:10.1007/s10664-021-10075-5 2022
[16]

Imtiajur Rah- man, and Amiangshu Bosu

Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, A.J.M. Imtiajur Rah- man, and Amiangshu Bosu. 2021. Using a balanced scorecard to identify opportu- nities to improve code review effectiveness: an industrial experience report.Em- pirical Softw. Engg.26, 6 (Nov. 2021), 34 pages. doi:10.1007/s10664-021-10038-w

work page doi:10.1007/s10664-021-10038-w 2021
[17]

Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. 2022. AUGER: automatically generating review comments with pre-training models. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore)(ESEC/FSE 2022). Association...

work page doi:10.1145/3540250.3549099 2022
[18]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

work page
[19]

InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022)

Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1035–1047. doi:10.1145/3540250.3549081

work page doi:10.1145/3540250.3549081 2022
[20]

Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan, and Huaimin Wang. 2017. Automatic Classification of Review Comments in Pull-based Development Model. InProceedings of The 29th International Conference on Software Engineering and Knowledge Engineering. 572–577. doi:10.18293/SEKE2017-039

work page doi:10.18293/seke2017-039 2017
[21]

Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 236–248. doi:10.1109/MSR66628.2025.00043

work page doi:10.1109/msr66628.2025.00043 2025
[22]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

work page 2012
[23]

Meta AI. 2024. LLaMA 3.3 70B Instruct. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct. Accessed: 2026-01-13

work page 2024
[24]

Linh Nguyen, Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Exploring the Potential of Large Language Models in Fine-Grained Review Com- ment Classification. In2025 IEEE International Conference on Source Code Analysis & Manipulation (SCAM). 43–54. doi:10.1109/SCAM67354.2025.00012

work page doi:10.1109/scam67354.2025.00012 2025
[25]

Miroslaw Ochodek, Miroslaw Staron, Wilhelm Meding, and Ola Söder. 2022. Automated Code Review Comment Classification to Improve Modern Code Reviews. InSoftware Quality: The Next Big Thing in Software Engineering and Quality, Daniel Mendez, Manuel Wimmer, Dietmar Winkler, Stefan Biffl, and Johannes Bergsmann (Eds.). Springer International Publishing, Cham, 23–40

work page 2022
[26]

OpenAI. 2025. GPT-5 mini Model | OpenAI API. https://platform.openai.com/ docs/models/gpt-5-mini. Accessed: 2026-01-13

work page 2025
[27]

OpenAI. 2025. Using GPT-5.2 | OpenAI API. https://platform.openai.com/docs/ guides/latest-model. Accessed: 2026-01-24

work page 2025
[28]

Luca Pascarella, Davide Spadini, Fabio Palomba, Magiel Bruntink, and Alberto Bacchelli. 2018. Information Needs in Contemporary Code Review.Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 135 (Nov. 2018), 27 pages. doi:10.1145/ 3274404

work page 2018
[29]

Roy, and Raula G

Mohammad Masudur Rahman, Chanchal K. Roy, and Raula G. Kula. 2017. Predict- ing Usefulness of Code Review Comments Using Textual Features and Developer Experience. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 215–226. doi:10.1109/MSR.2017.17

work page doi:10.1109/msr.2017.17 2017
[30]

2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches

Shadikur Rahman. 2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches. MSc thesis. York University, Toronto, Canada. https://hdl.handle.net/10315/41946 Advisor: Enamul Hoque

work page 2024
[31]

Rigby and Christian Bird

Peter C. Rigby and Christian Bird. 2013. Convergent contemporary software peer review practices. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering(Saint Petersburg, Russia)(ESEC/FSE 2013). Association for Computing Machinery, New York, NY, USA, 202–212. doi:10.1145/2491411. 2491444

work page doi:10.1145/2491411 2013
[32]

Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern code review: a case study at google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice(Gothenburg, Sweden)(ICSE-SEIP ’18). Association for Computing Machinery, New York, NY, USA, 181–190. doi:10.1145/3...

work page doi:10.1145/3183519.3183525 2018
[33]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Trans. Softw. Eng. Methodol.32, 5, Article 118 (July 2023), 32 pages. doi:10.1145/3583562

work page doi:10.1145/3583562 2023
[34]

Oussama Ben Sghaier, Martin Weyssow, and Houari Sahraoui. 2025. Harnessing Large Language Models for Curated Code Reviews. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 187–198. doi:10. 1109/MSR66628.2025.00039

work page arXiv 2025
[35]

Rosalia Tufano, Ozren Dabić, Antonio Mastropaolo, Matteo Ciniselli, and Gabriele Bavota. 2024. Code Review Automation: Strengths and Weaknesses of the State of the Art.IEEE Trans. Softw. Eng.50, 2 (Feb. 2024), 338–353. doi:10.1109/TSE. 2023.3348172

work page doi:10.1109/tse 2024
[36]

Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code review automation. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2291–2302. doi:1...

work page doi:10.1145/3510003.3510621 2022
[37]

Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. InPro- ceedings of the 43rd International Conference on Software Engineering(Madrid, Spain)(ICSE ’21). IEEE Press, 163–174. doi:10.1109/ICSE43902.2021.00027

work page doi:10.1109/icse43902.2021.00027 2021
[38]

Asif Kamal Turzo and Amiangshu Bosu. 2023. What makes a code review useful to OpenDev developers? An empirical investigation.Empirical Software Engineering 29, 1 (2023), 6. doi:10.1007/s10664-023-10411-x

work page doi:10.1007/s10664-023-10411-x 2023
[39]

Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. 2023. Towards Automated Classification of Code Review Feedback to Support Analytics. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12. doi:10.1109/ ESEM56168.2023.10304851

work page arXiv 2023
[40]

Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli. 2023. EvaCRC: Evaluating Code Review Comments. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 275–287. d...

work page doi:10.1145/3611643 2023
[41]

Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Comput...

work page doi:10.18653/v1/2023.acl-long.411 2023
[42]

Zelin Zhao, Zhaogui Xu, Jialong Zhu, Peng Di, Yuan Yao, and Xiaoxing Ma. 2023. The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model. arXiv:2312.17485 [cs.SE] https://arxiv.org/abs/2312.17485

work page arXiv 2023

[1] [1]

Uğur Can Altun, Ismail Sergen Göçmen, Emre Sülün, Erdem Tuna, and Eray Tüzün. 2025. Process smells in practice: an evaluative case study.Empirical Software Engineering30, 5 (2025), 115. doi:10.1007/s10664-025-10664-8

work page doi:10.1007/s10664-025-10664-8 2025

[2] [2]

Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. InProceedings of the 2013 International Conference on Software Engineering(San Francisco, CA, USA)(ICSE ’13). IEEE Press, 712–721

work page 2013

[3] [3]

Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley

Amiangshu Bosu, Jeffrey C. Carver, Christian Bird, Jonathan Orbeck, and Christopher Chockley. 2017. Process Aspects and Social Dynamics of Con- temporary Code Review: Insights from Open Source Development and Indus- trial Practice at Microsoft.IEEE Trans. Softw. Eng.43, 1 (Jan. 2017), 56–75. doi:10.1109/TSE.2016.2576451

work page doi:10.1109/tse.2016.2576451 2017

[4] [4]

Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: an empirical study at Microsoft. InProceedings of the 12th Working Conference on Mining Software Repositories(Florence, Italy)(MSR ’15). IEEE Press, 146–156

work page 2015

[5] [5]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models.ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. doi:10.1145/3641289

work page doi:10.1145/3641289 2024

[6] [6]

Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia. 2025. Understanding Practitioners’ Expectations on Clear Code Review Comments. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA056 (June 2025), 23 pages. doi:10. 1145/3728931

work page 2025

[7] [7]

Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emircan Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Auto- mated Code Review in Practice. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 425–436. doi:10.1109/ICSE-SEIP66354.2025.00043

work page doi:10.1109/icse-seip66354.2025.00043 2025

[8] [8]

Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.Educa- tional and Psychological Measurement20, 1 (Apr 1960), 37–46. doi:10.1177/ 001316446002000104

work page 1960

[9] [9]

Jacek Czerwonka, Michaela Greiler, and Jack Tilford. 2015. Code Reviews Do Not Find Bugs. How the Current Code Review Best Practice Slows Us Down. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. 27–28. doi:10.1109/ICSE.2015.131

work page doi:10.1109/icse.2015.131 2015

[10] [10]

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501. 12948 Accessed: 2026-01-13

work page internal anchor Pith review arXiv 2025

[11] [11]

Emre Doğan and Eray Tüzün. 2022. Towards a taxonomy of code review smells. Information and Software Technology142 (2022), 106737. doi:10.1016/j.infsof.2021. 106737

work page doi:10.1016/j.infsof.2021 2022

[12] [12]

M. E. Fagan. 1976. Design and code inspections to reduce errors in program development.IBM Syst. J.15, 3 (Sept. 1976), 182–211. doi:10.1147/sj.153.0182

work page doi:10.1147/sj.153.0182 1976

[13] [13]

Shut the f**k up

Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The "Shut the f**k up" Phenomenon: Characterizing Incivility in Open Source Code Review Discussions. Proc. ACM Hum.-Comput. Interact.5, CSCW2, Article 353 (Oct. 2021), 35 pages. doi:10.1145/3479497

work page doi:10.1145/3479497 2021

[14] [14]

Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli

work page

[15] [15]

Engg.27, 4 (July 2022), 43 pages

What happens in my code reviews? An investigation on automatically classifying review changes.Empirical Softw. Engg.27, 4 (July 2022), 43 pages. doi:10.1007/s10664-021-10075-5

work page doi:10.1007/s10664-021-10075-5 2022

[16] [16]

Imtiajur Rah- man, and Amiangshu Bosu

Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, A.J.M. Imtiajur Rah- man, and Amiangshu Bosu. 2021. Using a balanced scorecard to identify opportu- nities to improve code review effectiveness: an industrial experience report.Em- pirical Softw. Engg.26, 6 (Nov. 2021), 34 pages. doi:10.1007/s10664-021-10038-w

work page doi:10.1007/s10664-021-10038-w 2021

[17] [17]

Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. 2022. AUGER: automatically generating review comments with pre-training models. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore)(ESEC/FSE 2022). Association...

work page doi:10.1145/3540250.3549099 2022

[18] [18]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundaresan

work page

[19] [19]

InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022)

Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore)(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1035–1047. doi:10.1145/3540250.3549081

work page doi:10.1145/3540250.3549081 2022

[20] [20]

Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan, and Huaimin Wang. 2017. Automatic Classification of Review Comments in Pull-based Development Model. InProceedings of The 29th International Conference on Software Engineering and Knowledge Engineering. 572–577. doi:10.18293/SEKE2017-039

work page doi:10.18293/seke2017-039 2017

[21] [21]

Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 236–248. doi:10.1109/MSR66628.2025.00043

work page doi:10.1109/msr66628.2025.00043 2025

[22] [22]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

work page 2012

[23] [23]

Meta AI. 2024. LLaMA 3.3 70B Instruct. https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct. Accessed: 2026-01-13

work page 2024

[24] [24]

Linh Nguyen, Chunhua Liu, Hong Yi Lin, and Patanamon Thongtanunam. 2025. Exploring the Potential of Large Language Models in Fine-Grained Review Com- ment Classification. In2025 IEEE International Conference on Source Code Analysis & Manipulation (SCAM). 43–54. doi:10.1109/SCAM67354.2025.00012

work page doi:10.1109/scam67354.2025.00012 2025

[25] [25]

Miroslaw Ochodek, Miroslaw Staron, Wilhelm Meding, and Ola Söder. 2022. Automated Code Review Comment Classification to Improve Modern Code Reviews. InSoftware Quality: The Next Big Thing in Software Engineering and Quality, Daniel Mendez, Manuel Wimmer, Dietmar Winkler, Stefan Biffl, and Johannes Bergsmann (Eds.). Springer International Publishing, Cham, 23–40

work page 2022

[26] [26]

OpenAI. 2025. GPT-5 mini Model | OpenAI API. https://platform.openai.com/ docs/models/gpt-5-mini. Accessed: 2026-01-13

work page 2025

[27] [27]

OpenAI. 2025. Using GPT-5.2 | OpenAI API. https://platform.openai.com/docs/ guides/latest-model. Accessed: 2026-01-24

work page 2025

[28] [28]

Luca Pascarella, Davide Spadini, Fabio Palomba, Magiel Bruntink, and Alberto Bacchelli. 2018. Information Needs in Contemporary Code Review.Proc. ACM Hum.-Comput. Interact.2, CSCW, Article 135 (Nov. 2018), 27 pages. doi:10.1145/ 3274404

work page 2018

[29] [29]

Roy, and Raula G

Mohammad Masudur Rahman, Chanchal K. Roy, and Raula G. Kula. 2017. Predict- ing Usefulness of Code Review Comments Using Textual Features and Developer Experience. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 215–226. doi:10.1109/MSR.2017.17

work page doi:10.1109/msr.2017.17 2017

[30] [30]

2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches

Shadikur Rahman. 2024.Enhancing Code Review for Improved Code Quality with Language Model-Driven Approaches. MSc thesis. York University, Toronto, Canada. https://hdl.handle.net/10315/41946 Advisor: Enamul Hoque

work page 2024

[31] [31]

Rigby and Christian Bird

Peter C. Rigby and Christian Bird. 2013. Convergent contemporary software peer review practices. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering(Saint Petersburg, Russia)(ESEC/FSE 2013). Association for Computing Machinery, New York, NY, USA, 202–212. doi:10.1145/2491411. 2491444

work page doi:10.1145/2491411 2013

[32] [32]

Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. 2018. Modern code review: a case study at google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice(Gothenburg, Sweden)(ICSE-SEIP ’18). Association for Computing Machinery, New York, NY, USA, 181–190. doi:10.1145/3...

work page doi:10.1145/3183519.3183525 2018

[33] [33]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Auto- mated Identification of Toxic Code Reviews Using ToxiCR.ACM Trans. Softw. Eng. Methodol.32, 5, Article 118 (July 2023), 32 pages. doi:10.1145/3583562

work page doi:10.1145/3583562 2023

[34] [34]

Oussama Ben Sghaier, Martin Weyssow, and Houari Sahraoui. 2025. Harnessing Large Language Models for Curated Code Reviews. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 187–198. doi:10. 1109/MSR66628.2025.00039

work page arXiv 2025

[35] [35]

Rosalia Tufano, Ozren Dabić, Antonio Mastropaolo, Matteo Ciniselli, and Gabriele Bavota. 2024. Code Review Automation: Strengths and Weaknesses of the State of the Art.IEEE Trans. Softw. Eng.50, 2 (Feb. 2024), 338–353. doi:10.1109/TSE. 2023.3348172

work page doi:10.1109/tse 2024

[36] [36]

Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code review automation. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2291–2302. doi:1...

work page doi:10.1145/3510003.3510621 2022

[37] [37]

Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. InPro- ceedings of the 43rd International Conference on Software Engineering(Madrid, Spain)(ICSE ’21). IEEE Press, 163–174. doi:10.1109/ICSE43902.2021.00027

work page doi:10.1109/icse43902.2021.00027 2021

[38] [38]

Asif Kamal Turzo and Amiangshu Bosu. 2023. What makes a code review useful to OpenDev developers? An empirical investigation.Empirical Software Engineering 29, 1 (2023), 6. doi:10.1007/s10664-023-10411-x

work page doi:10.1007/s10664-023-10411-x 2023

[39] [39]

Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. 2023. Towards Automated Classification of Code Review Feedback to Support Analytics. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12. doi:10.1109/ ESEM56168.2023.10304851

work page arXiv 2023

[40] [40]

Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli. 2023. EvaCRC: Evaluating Code Review Comments. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 275–287. d...

work page doi:10.1145/3611643 2023

[41] [41]

Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Comput...

work page doi:10.18653/v1/2023.acl-long.411 2023

[42] [42]

Zelin Zhao, Zhaogui Xu, Jialong Zhu, Peng Di, Yuan Yao, and Xiaoxing Ma. 2023. The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model. arXiv:2312.17485 [cs.SE] https://arxiv.org/abs/2312.17485

work page arXiv 2023