Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation

Christoph Treude; Dong Shao; Guoping Rong; He Zhang; Hongyu Kuang; Jiaqi Sun; Kexin Sun; Xiaoxing Ma; Yucong Guan

arxiv: 2606.01859 · v1 · pith:FYQGNIWOnew · submitted 2026-06-01 · 💻 cs.SE

Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation

Kexin Sun , Yucong Guan , Jiaqi Sun , Hongyu Kuang , Guoping Rong , Dong Shao , He Zhang , Xiaoxing Ma

show 1 more author

Christoph Treude

This is my paper

Pith reviewed 2026-06-28 13:52 UTC · model grok-4.3

classification 💻 cs.SE

keywords code reviewlarge language modelsGo programming languageissue-list generationcontext augmentationrefinement evaluationcandidate integration

0 comments

The pith

Issue-list generation plus neighboring and co-change context lifts LLM Go code review to 28% refinement exact match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether changing how LLMs generate code review comments and what extra context they receive can make the comments more useful. Instead of asking the model to name only the single most important issue, it asks the model to list every potential issue it can find. Three kinds of context are added to the prompt: nearby code lines, language-server semantic information, and similar past changes found by information retrieval. Outputs from several prompt variants are merged and then pruned by checking which comments would produce the same code edit a human actually made. On 1,438 real Go review cases the best combination reaches 28% exact-match refinement, a clear rise from the 17% baseline.

Core claim

By shifting from primary-issue review to issue-list review and incorporating neighboring and similar co-change context, followed by integrating candidates and applying refinement-guided pruning, the approach achieves 28.00% refinement exact match on 1,438 Go review instances. This represents a statistically significant improvement of 10.85 percentage points over the 17.15% from primary-issue review without context. It also outperforms the specialized CodeReviewer model at 15.02% and approaches the human oracle at 36.09%, while pruning reduces average candidates to 3.1.

What carries the argument

The issue-list review paradigm that has LLMs enumerate multiple potential issues, augmented by neighboring code context and IR-based similar co-change context, with refinement-guided pruning of integrated candidate lists.

If this is right

More issues are identified when LLMs are prompted to list all potential problems rather than focusing on one.
Neighboring and similar co-change contexts enhance the discovery of relevant issues in code reviews.
Integrating outputs from context-free and context-enhanced generations improves overall review coverage.
Refinement-guided pruning maintains most of the performance benefit while reducing the number of candidates developers must inspect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting and pruning pattern could be tested on review data from languages other than Go to check whether the gains transfer.
The refinement exact-match signal could be used directly as a reward during fine-tuning of review models.
Embedding the pruned candidate lists inside an IDE might let developers accept or reject suggestions before the change is committed.

Load-bearing premise

That whether a generated comment would produce the same code change a human ultimately made is a sufficient measure of how useful the comment is in practice.

What would settle it

A controlled experiment in which developers are shown both the generated comment lists and the actual human comments, then asked which list they would rather receive, would show whether the refinement metric aligns with developer preference.

Figures

Figures reproduced from arXiv: 2606.01859 by Christoph Treude, Dong Shao, Guoping Rong, He Zhang, Hongyu Kuang, Jiaqi Sun, Kexin Sun, Xiaoxing Ma, Yucong Guan.

**Figure 2.** Figure 2: A motivating example on how issue-list generation, context augmen [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt template used for issue-list code review, with optional slots for neighboring, semantic, and similar-code context. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The simplified prompt containing only the task description and output [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An example illustrating how issue-list review recovers a human [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: An example of how excessive context causes [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: An example illustrating how different candidate integration strategies [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

LLMs have shown strong potential for automating code review, yet their practical utility depends heavily on the design of generation and context strategies. In this paper, we investigate how to improve LLM-based code review through generation strategy and contextual augmentation. We first propose an issue-list review paradigm, in which LLMs enumerate all potential issues rather than reporting only the single most important one (i.e., primary-issue review). We then systematically compare three types of code context augmentation -- neighboring, LSP-based semantics, and IR-based similar co-change context -- and study how they influence issue discovery. Finally, we integrate candidates from no-context and context-enhanced generation to improve review coverage, and introduce refinement-guided pruning to keep the candidate list at a practical size. We evaluate our approach on 1,438 Go review instances using downstream code refinement as the main metric, i.e., how often the candidate list contains at least one comment inducing the same code change as the final human revision. For comparison, we evaluate comments by CodeReviewer, a model trained specifically for review comment generation, as well as ground-truth human review comments (as a practical upper bound), under the same refinement-based evaluation. The results show that our best configuration, combining issue-list review, neighboring and similar co-change context, and candidate integration, reaches 28.00% refinement exact match, a statistically significant gain of +10.85 percentage points over primary-issue review without any additional context (17.15%), substantially outperforming CodeReviewer (15.02%) and approaching the human-oracle ceiling of 36.09%. Our refinement-guided pruning reduces the average candidate count from 7.2 to 3.1 at top-5 while retaining nearly the full benefit, making the candidate list easier to inspect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a 10+ point lift on refinement exact match for Go LLM reviews via issue-list generation plus context, but the metric's validity as a usefulness proxy is untested.

read the letter

The headline result is a clear empirical win on their chosen metric: issue-list review with neighboring and co-change context plus candidate integration reaches 28% refinement exact match on 1438 Go instances, up from 17% for primary-issue no-context and 15% for CodeReviewer, while pruning keeps the list short. That combination of generation strategy and context types is the actual new piece.

They do the comparison systematically, test three context sources, integrate outputs, and add a pruning step that cuts average candidates from 7.2 to 3.1 with little loss. The numbers are reported with claimed statistical significance, and they include the human-oracle ceiling at 36%. The work is straightforward empirical prompting research on a real language and task.

The soft spot is the evaluation. Refinement exact match only credits a comment if it would produce the exact human revision diff. That treats the final human change as ground truth and ignores other valid fixes. The low oracle score already shows the metric is strict; without any separate human rating of comment usefulness or inter-rater check on the generated comments, the +10 point gain could be an artifact of how the proxy is defined rather than evidence of better reviews in practice.

This is for people working on LLM code review pipelines who care about context design and candidate generation. It is narrow to Go and to this one downstream metric, but the experiments are concrete enough that a serious referee should see it. The central claim holds on the numbers they report; the open question is whether those numbers track what reviewers actually need.

I would send it to peer review.

Referee Report

1 major / 1 minor

Summary. The paper claims that shifting from primary-issue to issue-list review, augmenting with neighboring and similar co-change context, integrating candidates across configurations, and applying refinement-guided pruning improves LLM-generated Go code review comments. On 1,438 instances this yields 28.00% refinement exact match (a statistically significant +10.85 pp gain over the 17.15% no-context baseline), outperforming CodeReviewer (15.02%) while approaching the human-oracle ceiling of 36.09%.

Significance. If the central empirical result holds, the work supplies concrete, immediately usable prompting and context strategies that raise the fraction of LLM review comments capable of triggering the same edit as a human revision. The evaluation scale, direct comparison against a fine-tuned baseline, and explicit human upper bound are positive features; the pruning result that reduces candidate volume while preserving most of the gain is also a practical contribution.

major comments (1)

[Evaluation] Evaluation (refinement exact match definition and human-oracle result): the primary success criterion treats 'contains at least one comment inducing the same code change as the final human revision' as the key indicator of usefulness. The reported human-oracle ceiling of only 36.09% already shows that many ground-truth comments fail the metric; without an additional human usefulness rating or inter-rater study on the generated comments, the +10.85 pp gain could be an artifact of the chosen proxy rather than evidence of improved review quality.

minor comments (1)

[Abstract] Abstract: the claim of 'statistically significant gain' is stated without naming the test, exact p-value, or correction method; these details belong in the abstract or a methods footnote for immediate verifiability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation design. Below we address the major comment directly.

read point-by-point responses

Referee: [Evaluation] Evaluation (refinement exact match definition and human-oracle result): the primary success criterion treats 'contains at least one comment inducing the same code change as the final human revision' as the key indicator of usefulness. The reported human-oracle ceiling of only 36.09% already shows that many ground-truth comments fail the metric; without an additional human usefulness rating or inter-rater study on the generated comments, the +10.85 pp gain could be an artifact of the chosen proxy rather than evidence of improved review quality.

Authors: We selected refinement exact match as the primary metric because it supplies an objective, reproducible, and downstream-oriented measure: whether a generated comment would have produced the identical code edit that the human revision ultimately applied. This avoids subjective interpretation while directly quantifying practical utility for the code-review workflow. The 36.09% human-oracle ceiling is reported precisely to make the metric's limitations transparent; it indicates that many human comments address issues orthogonal to the final change (e.g., rejected suggestions or stylistic remarks) rather than invalidating the proxy. All comparisons (our configurations, the no-context baseline, and CodeReviewer) are performed under identical conditions, and the reported +10.85 pp gain is statistically significant. We therefore view the improvement as evidence that the proposed strategies increase the likelihood of surfacing comments aligned with observed human edits. While additional human usefulness ratings would be informative, they are not required to demonstrate relative gains under a consistent, automated criterion already used in prior code-review automation studies. revision: no

Circularity Check

0 steps flagged

No circularity; purely empirical comparison of strategies

full rationale

The paper conducts an empirical evaluation of prompting paradigms (issue-list vs primary-issue) and context augmentations on 1,438 Go review instances, measuring refinement exact match against human revisions and baselines (CodeReviewer, human oracle). No equations, parameter fits, derivations, or self-citation chains appear; all claims rest on direct experimental outcomes with external benchmarks. The metric choice is an explicit design decision, not a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study in software engineering. No free parameters, mathematical axioms, or invented entities are involved; all claims rest on experimental comparisons of LLM prompting strategies.

pith-pipeline@v0.9.1-grok · 5879 in / 1121 out tokens · 34386 ms · 2026-06-28T13:52:38.257469+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Convergent contemporary software peer review practices,

P. C. Rigby and C. Bird, “Convergent contemporary software peer review practices,” inProceedings of the 2013 9th joint meeting on foundations of software engineering, 2013, pp. 202–212

2013
[2]

What types of defects are really dis- covered in code reviews?

M. V . M ¨antyl¨a and C. Lassenius, “What types of defects are really dis- covered in code reviews?”IEEE Transactions on Software Engineering, vol. 35, no. 3, pp. 430–448, 2008

2008
[3]

Modern code reviews in open-source projects: which problems do they fix?

M. Beller, A. Bacchelli, A. Zaidman, and E. J ¨urgens, “Modern code reviews in open-source projects: which problems do they fix?” in 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, P. T. Devanbu, S. Kim, and M. Pinzger, Eds. ACM, 2014, pp. 202–211. [Online]. Available: https://doi.or...

work page doi:10.1145/2597073.2597082 2014
[4]

What makes a code review useful to opendev developers? an empirical investigation,

A. K. Turzo and A. Bosu, “What makes a code review useful to opendev developers? an empirical investigation,”Empirical Software Engineering, vol. 29, no. 1, p. 6, 2024

2024
[5]

An empirical study of the impact of modern code review practices on software quality,

S. McIntosh, Y . Kamei, B. Adams, and A. E. Hassan, “An empirical study of the impact of modern code review practices on software quality,”Empirical Software Engineering, vol. 21, no. 5, pp. 2146–2189, 2016

2016
[6]

Potential technical debt and its resolution in code reviews: An exploratory study of the openstack and qt communities,

L. Fu, P. Liang, Z. Rasheed, Z. Li, A. Tahir, and X. Han, “Potential technical debt and its resolution in code reviews: An exploratory study of the openstack and qt communities,” inProceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2022, pp. 216–226

2022
[7]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in2013 35th international conference on software engineering (ICSE). IEEE, 2013, pp. 712–721

2013
[8]

On the scalability of linux kernel maintainers’ work,

M. Zhou, Q. Chen, A. Mockus, and F. Wu, “On the scalability of linux kernel maintainers’ work,” inProceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 27–37

2017
[9]

Using static analysis to find bugs,

N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,”IEEE software, vol. 25, no. 5, pp. 22–29, 2008

2008
[10]

Automating code review activities by large-scale pre-training,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fuet al., “Automating code review activities by large-scale pre-training,” inProceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, 2022, pp. 1035–1047

2022
[11]

Using pre-trained models to boost code review automa- tion,

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion,” inProceedings of the 44th international conference on software engineering, 2022, pp. 2291–2302

2022
[12]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

2024
[13]

Automated code review in practice,

U. Cihan, V . Haratian, A. ˙Ic ¸¨oz, M. K. G ¨ul, ¨O. Devran, E. F. Bayendur, B. M. Uc ¸ar, and E. T¨uz¨un, “Automated code review in practice,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2025, pp. 425– 436

2025
[14]

Does ai code review lead to code changes? a case study of github actions,

K. Sun, H. Kuang, S. Baltes, X. Zhou, H. Zhang, X. Ma, G. Rong, D. Shao, and C. Treude, “Does ai code review lead to code changes? a case study of github actions,”IEEE Transactions on Software Engi- neering, 2026

2026
[15]

Assessing the students’ understanding and their mistakes in code review checklists: an experience report of 1,791 code review checklist questions from 394 students,

C. Y . Chong, P. Thongtanunam, and C. Tantithamthavorn, “Assessing the students’ understanding and their mistakes in code review checklists: an experience report of 1,791 code review checklist questions from 394 students,” in2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET). IEEE, ...

2021
[16]

What to look for in a code review,

Google, “What to look for in a code review,” https://google.github.io/ eng-practices/review/reviewer/looking-for.html, 2024, accessed: 2026- 04-15

2024
[17]

Deepcrceval: Revisiting the evaluation of code review comment generation,

J. Lu, X. Li, Z. Hua, L. Yu, S. Cheng, L. Yang, F. Zhang, and C. Zuo, “Deepcrceval: Revisiting the evaluation of code review comment generation,” inInternational Conference on Fundamental Approaches to Software Engineering. Springer, 2025, pp. 43–64

2025
[18]

Crscore: Grounding automated evaluation of code review comments in code claims and smells,

A. Naik, M. Alenius, D. Fried, and C. Rose, “Crscore: Grounding automated evaluation of code review comments in code claims and smells,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 9049–9076

2025
[19]

Replication package for LLMGoCodeReview,

Anonymous, “Replication package for LLMGoCodeReview,” To be pub- lished on Zenodo after acceptance. https://github.com/brinnarlyne8585/ LLMGoCodeReview, 2026

2026
[20]

Tree-sitter: A parser generator tool and an incremental parsing library,

M. Brunsfeldet al., “Tree-sitter: A parser generator tool and an incremental parsing library,” https://tree-sitter.github.io/tree-sitter/, 2026, accessed: 2026-03

2026
[21]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

gopls: The go language server,

Go Team, “gopls: The go language server,” https://github.com/golang/ tools/tree/master/gopls, 2026

2026
[23]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

2002
[24]

Bertscore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with BERT,” in8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

2020
[25]

Unixcoder: Unified cross-modal pre-training for code representation,

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” inProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225

2022
[26]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947
[27]

Cohen,Statistical power analysis for the behavioral sciences

J. Cohen,Statistical power analysis for the behavioral sciences. rout- ledge, 2013

2013
[28]

Classifying change types for qualifying change couplings,

B. Fluri and H. C. Gall, “Classifying change types for qualifying change couplings,” in14th IEEE International Conference on Program Comprehension (ICPC’06). IEEE, 2006, pp. 35–45

2006
[29]

Change distilling: Tree differencing for fine-grained source code change extraction,

B. Fluri, M. Wursch, M. PInzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on software engineering, vol. 33, no. 11, pp. 725–743, 2007

2007
[30]

Fine-tuning and prompt en- gineering for large language models-based code review automation,

C. Pornprasit and C. Tantithamthavorn, “Fine-tuning and prompt en- gineering for large language models-based code review automation,” Information and Software Technology, vol. 175, p. 107523, 2024

2024
[31]

The code review comprehension assessment for large language models,

H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “The code review comprehension assessment for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025
[32]

Towards automating code review at scale,

V . J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards automating code review at scale,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1479–1482

2021
[33]

Auger: automatically generating review comments with pre-training models,

L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: automatically generating review comments with pre-training models,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1009–1021

2022
[34]

Improving automated code reviews: Learning from experience,

H. Y . Lin, P. Thongtanunam, C. Treude, and W. Charoenwet, “Improving automated code reviews: Learning from experience,” inProceedings of the 21st International Conference on Mining Software Repositories, 2024, pp. 278–283

2024
[35]

Augmenting large language models with static code analysis for automated code quality improvements,

S. M. Abtahi and A. Azim, “Augmenting large language models with static code analysis for automated code quality improvements,” in2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 2025, pp. 82–92

2025
[36]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 647–658

2023
[37]

Fine-tuning large language models to improve accuracy and comprehensibility of automated code review,

Y . Yu, G. Rong, H. Shen, H. Zhang, D. Shao, M. Wang, Z. Wei, Y . Xu, and J. Wang, “Fine-tuning large language models to improve accuracy and comprehensibility of automated code review,”ACM transactions on software engineering and methodology, vol. 34, no. 1, pp. 1–26, 2024

2024
[38]

Laura: Enhanc- ing code review generation with context-enriched retrieval-augmented llm,

Y . Zhang, Y . Zhang, Z. Sun, Y . Jiang, and H. Liu, “Laura: Enhanc- ing code review generation with context-enriched retrieval-augmented llm,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 2983–2995

2025
[39]

icodereviewer: Improving secure code review with mixture of prompts,

Y . Peng, K. Kim, L. Meng, and K. Liu, “icodereviewer: Improving secure code review with mixture of prompts,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 3204–3215

2025
[40]

Bitsai-cr: Automated code review via llm in practice,

T. Sun, J. Xu, Y . Li, Z. Yan, G. Zhang, L. Xie, L. Geng, Z. Wang, Y . Chen, Q. Linet al., “Bitsai-cr: Automated code review via llm in practice,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 274–285

2025
[41]

Cr-bench: Evaluating the real-world utility of ai code review agents,

K. Pereira, N. Sinha, R. Ghosh, and D. Dutta, “Cr-bench: Evaluating the real-world utility of ai code review agents,”arXiv preprint arXiv:2603.11078, 2026. [Online]. Available: https://arxiv.org/pdf/2603. 11078

work page arXiv 2026
[42]

Code Review Agent Benchmark

Y . Zhang, Z. Pan, I. N. B. Yusuf, H. Ruan, R. Shariffdeen, and A. Roychoudhury, “Code review agent benchmark,”arXiv preprint arXiv:2603.23448, 2026. [Online]. Available: https://arxiv.org/abs/2603. 23448

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Convergent contemporary software peer review practices,

P. C. Rigby and C. Bird, “Convergent contemporary software peer review practices,” inProceedings of the 2013 9th joint meeting on foundations of software engineering, 2013, pp. 202–212

2013

[2] [2]

What types of defects are really dis- covered in code reviews?

M. V . M ¨antyl¨a and C. Lassenius, “What types of defects are really dis- covered in code reviews?”IEEE Transactions on Software Engineering, vol. 35, no. 3, pp. 430–448, 2008

2008

[3] [3]

Modern code reviews in open-source projects: which problems do they fix?

M. Beller, A. Bacchelli, A. Zaidman, and E. J ¨urgens, “Modern code reviews in open-source projects: which problems do they fix?” in 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, P. T. Devanbu, S. Kim, and M. Pinzger, Eds. ACM, 2014, pp. 202–211. [Online]. Available: https://doi.or...

work page doi:10.1145/2597073.2597082 2014

[4] [4]

What makes a code review useful to opendev developers? an empirical investigation,

A. K. Turzo and A. Bosu, “What makes a code review useful to opendev developers? an empirical investigation,”Empirical Software Engineering, vol. 29, no. 1, p. 6, 2024

2024

[5] [5]

An empirical study of the impact of modern code review practices on software quality,

S. McIntosh, Y . Kamei, B. Adams, and A. E. Hassan, “An empirical study of the impact of modern code review practices on software quality,”Empirical Software Engineering, vol. 21, no. 5, pp. 2146–2189, 2016

2016

[6] [6]

Potential technical debt and its resolution in code reviews: An exploratory study of the openstack and qt communities,

L. Fu, P. Liang, Z. Rasheed, Z. Li, A. Tahir, and X. Han, “Potential technical debt and its resolution in code reviews: An exploratory study of the openstack and qt communities,” inProceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2022, pp. 216–226

2022

[7] [7]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in2013 35th international conference on software engineering (ICSE). IEEE, 2013, pp. 712–721

2013

[8] [8]

On the scalability of linux kernel maintainers’ work,

M. Zhou, Q. Chen, A. Mockus, and F. Wu, “On the scalability of linux kernel maintainers’ work,” inProceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 27–37

2017

[9] [9]

Using static analysis to find bugs,

N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,”IEEE software, vol. 25, no. 5, pp. 22–29, 2008

2008

[10] [10]

Automating code review activities by large-scale pre-training,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fuet al., “Automating code review activities by large-scale pre-training,” inProceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, 2022, pp. 1035–1047

2022

[11] [11]

Using pre-trained models to boost code review automa- tion,

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion,” inProceedings of the 44th international conference on software engineering, 2022, pp. 2291–2302

2022

[12] [12]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

2024

[13] [13]

Automated code review in practice,

U. Cihan, V . Haratian, A. ˙Ic ¸¨oz, M. K. G ¨ul, ¨O. Devran, E. F. Bayendur, B. M. Uc ¸ar, and E. T¨uz¨un, “Automated code review in practice,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2025, pp. 425– 436

2025

[14] [14]

Does ai code review lead to code changes? a case study of github actions,

K. Sun, H. Kuang, S. Baltes, X. Zhou, H. Zhang, X. Ma, G. Rong, D. Shao, and C. Treude, “Does ai code review lead to code changes? a case study of github actions,”IEEE Transactions on Software Engi- neering, 2026

2026

[15] [15]

Assessing the students’ understanding and their mistakes in code review checklists: an experience report of 1,791 code review checklist questions from 394 students,

C. Y . Chong, P. Thongtanunam, and C. Tantithamthavorn, “Assessing the students’ understanding and their mistakes in code review checklists: an experience report of 1,791 code review checklist questions from 394 students,” in2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET). IEEE, ...

2021

[16] [16]

What to look for in a code review,

Google, “What to look for in a code review,” https://google.github.io/ eng-practices/review/reviewer/looking-for.html, 2024, accessed: 2026- 04-15

2024

[17] [17]

Deepcrceval: Revisiting the evaluation of code review comment generation,

J. Lu, X. Li, Z. Hua, L. Yu, S. Cheng, L. Yang, F. Zhang, and C. Zuo, “Deepcrceval: Revisiting the evaluation of code review comment generation,” inInternational Conference on Fundamental Approaches to Software Engineering. Springer, 2025, pp. 43–64

2025

[18] [18]

Crscore: Grounding automated evaluation of code review comments in code claims and smells,

A. Naik, M. Alenius, D. Fried, and C. Rose, “Crscore: Grounding automated evaluation of code review comments in code claims and smells,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 9049–9076

2025

[19] [19]

Replication package for LLMGoCodeReview,

Anonymous, “Replication package for LLMGoCodeReview,” To be pub- lished on Zenodo after acceptance. https://github.com/brinnarlyne8585/ LLMGoCodeReview, 2026

2026

[20] [20]

Tree-sitter: A parser generator tool and an incremental parsing library,

M. Brunsfeldet al., “Tree-sitter: A parser generator tool and an incremental parsing library,” https://tree-sitter.github.io/tree-sitter/, 2026, accessed: 2026-03

2026

[21] [21]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

gopls: The go language server,

Go Team, “gopls: The go language server,” https://github.com/golang/ tools/tree/master/gopls, 2026

2026

[23] [23]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

2002

[24] [24]

Bertscore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with BERT,” in8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

2020

[25] [25]

Unixcoder: Unified cross-modal pre-training for code representation,

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” inProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225

2022

[26] [26]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947

[27] [27]

Cohen,Statistical power analysis for the behavioral sciences

J. Cohen,Statistical power analysis for the behavioral sciences. rout- ledge, 2013

2013

[28] [28]

Classifying change types for qualifying change couplings,

B. Fluri and H. C. Gall, “Classifying change types for qualifying change couplings,” in14th IEEE International Conference on Program Comprehension (ICPC’06). IEEE, 2006, pp. 35–45

2006

[29] [29]

Change distilling: Tree differencing for fine-grained source code change extraction,

B. Fluri, M. Wursch, M. PInzger, and H. Gall, “Change distilling: Tree differencing for fine-grained source code change extraction,”IEEE Transactions on software engineering, vol. 33, no. 11, pp. 725–743, 2007

2007

[30] [30]

Fine-tuning and prompt en- gineering for large language models-based code review automation,

C. Pornprasit and C. Tantithamthavorn, “Fine-tuning and prompt en- gineering for large language models-based code review automation,” Information and Software Technology, vol. 175, p. 107523, 2024

2024

[31] [31]

The code review comprehension assessment for large language models,

H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “The code review comprehension assessment for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025

[32] [32]

Towards automating code review at scale,

V . J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards automating code review at scale,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1479–1482

2021

[33] [33]

Auger: automatically generating review comments with pre-training models,

L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: automatically generating review comments with pre-training models,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1009–1021

2022

[34] [34]

Improving automated code reviews: Learning from experience,

H. Y . Lin, P. Thongtanunam, C. Treude, and W. Charoenwet, “Improving automated code reviews: Learning from experience,” inProceedings of the 21st International Conference on Mining Software Repositories, 2024, pp. 278–283

2024

[35] [35]

Augmenting large language models with static code analysis for automated code quality improvements,

S. M. Abtahi and A. Azim, “Augmenting large language models with static code analysis for automated code quality improvements,” in2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 2025, pp. 82–92

2025

[36] [36]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 647–658

2023

[37] [37]

Fine-tuning large language models to improve accuracy and comprehensibility of automated code review,

Y . Yu, G. Rong, H. Shen, H. Zhang, D. Shao, M. Wang, Z. Wei, Y . Xu, and J. Wang, “Fine-tuning large language models to improve accuracy and comprehensibility of automated code review,”ACM transactions on software engineering and methodology, vol. 34, no. 1, pp. 1–26, 2024

2024

[38] [38]

Laura: Enhanc- ing code review generation with context-enriched retrieval-augmented llm,

Y . Zhang, Y . Zhang, Z. Sun, Y . Jiang, and H. Liu, “Laura: Enhanc- ing code review generation with context-enriched retrieval-augmented llm,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 2983–2995

2025

[39] [39]

icodereviewer: Improving secure code review with mixture of prompts,

Y . Peng, K. Kim, L. Meng, and K. Liu, “icodereviewer: Improving secure code review with mixture of prompts,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 3204–3215

2025

[40] [40]

Bitsai-cr: Automated code review via llm in practice,

T. Sun, J. Xu, Y . Li, Z. Yan, G. Zhang, L. Xie, L. Geng, Z. Wang, Y . Chen, Q. Linet al., “Bitsai-cr: Automated code review via llm in practice,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 274–285

2025

[41] [41]

Cr-bench: Evaluating the real-world utility of ai code review agents,

K. Pereira, N. Sinha, R. Ghosh, and D. Dutta, “Cr-bench: Evaluating the real-world utility of ai code review agents,”arXiv preprint arXiv:2603.11078, 2026. [Online]. Available: https://arxiv.org/pdf/2603. 11078

work page arXiv 2026

[42] [42]

Code Review Agent Benchmark

Y . Zhang, Z. Pan, I. N. B. Yusuf, H. Ruan, R. Shariffdeen, and A. Roychoudhury, “Code review agent benchmark,”arXiv preprint arXiv:2603.23448, 2026. [Online]. Available: https://arxiv.org/abs/2603. 23448

work page internal anchor Pith review Pith/arXiv arXiv 2026