pith. sign in

arxiv: 2604.24525 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

Pith reviewed 2026-05-08 02:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords automated code reviewLLM-as-a-JudgeG-Evalindustrial software developmentpull request commentsdeveloper labelsevaluation limitsworkflow constraints
0
0 comments X

The pith

Automated evaluation of code review bot comments reaches only moderate agreement with actual developer labels in industrial settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether two common AI-based methods can replace human judgment when deciding if comments from automated code review bots are useful. It applies these methods to more than 2,600 real pull-request comments that developers at one company had already marked as fixed or ignored. Agreement stays in the 0.44 to 0.62 range across models and scoring styles, and a follow-up interview shows that developer choices depend on timing, priorities, and team processes that static text cannot capture. The work therefore questions whether developer actions can serve as clean ground truth for scaling bot evaluation.

Core claim

On an industrial dataset of 2,604 bot-generated pull-request comments labeled by engineers as fixed or wontFix, both G-Eval and an LLM-as-Judge pipeline produce agreement ratios of roughly 0.44 to 0.62 with the human labels. The same range appears whether the automated judge outputs a binary decision or a 0-4 Likert score, and the level of agreement varies with the underlying model. A director interview confirms that labeling behavior is shaped by workflow pressures and organizational constraints rather than comment quality alone.

What carries the argument

Direct comparison of G-Eval and LLM-as-Judge outputs against developer fixed/wontFix labels on real industrial pull-request comments.

If this is right

  • Developer actions on bot comments cannot be treated as objective ground truth for evaluation.
  • Automated evaluation performance remains sensitive to model choice and to binary versus scale-based scoring.
  • Static comment text alone omits the contextual factors that drive real developer decisions.
  • Industrial adoption of automated code review bots will require evaluation approaches that account for workflow dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation systems for code review bots may need to ingest live project metadata such as deadline pressure or reviewer workload to improve alignment.
  • Similar limits could appear when automating assessment of other developer-assistance tools that produce suggestions in context-dependent settings.
  • Hybrid human-AI loops might remain necessary for reliable quality signals until richer context can be fed to evaluators.

Load-bearing premise

That moderate agreement between the chosen automated evaluators and developer labels on this dataset indicates a general limit of automation rather than a fixable issue of prompt design or missing context.

What would settle it

A new industrial dataset on which the same or improved automated evaluators reach agreement above 0.8 with developer labels after adding workflow metadata or richer prompts would falsify the claim of fundamental limits.

Figures

Figures reproduced from arXiv: 2604.24525 by Baykal Mehmet U\c{c}ar, Eray T\"uz\"un, Utku Boran Torun, Veli Karakaya.

Figure 1
Figure 1. Figure 1: This figure illustrates how multiple evaluation view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation setup comparing G-Eval and LLM-as-a view at source ↗
read the original abstract

Automated code review (ACR) bots are increasingly used in industrial software development to assist developers during pull request (PR) review. As adoption grows, a key challenge is how to evaluate the usefulness of bot-generated comments reliably and at scale. In practice, such evaluation often relies on developer actions and annotations that are shaped by contextual and organizational factors, complicating their use as objective ground truth. We examine the feasibility and limitations of automating the evaluation of LLM-powered ACR bots in an industrial setting. We analyze an industrial dataset from Beko comprising 2,604 bot-generated PR comments, each labeled by software engineers as fixed/wontFix. Two automated evaluation approaches, G-Eval and an LLM-as-a-Judge pipeline, are applied using both binary decisions and a 0-4 Likert-scale formulation, enabling a controlled comparison against developer-provided labels. Across Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2, both evaluation strategies achieve only moderate alignment with human labels. Agreement ratios range from approximately 0.44 to 0.62, with noticeable variation across models and between binary and Likert-scale formulations, indicating sensitivity to both model choice and evaluation design. Our findings highlight practical limitations in fully automating the evaluation of ACR bot comments in industrial contexts. Developer actions such as resolving or ignoring comments reflect not only comment quality, but also contextual constraints, prioritization decisions, and workflow dynamics that are difficult to capture through static artifacts. Insights from a follow-up interview with a software engineering director further corroborate that developer labeling behavior is strongly influenced by workflow pressures and organizational constraints, reinforcing the challenges of treating such signals as objective ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical analysis of 2,604 bot-generated pull request comments from an industrial dataset at Beko, each labeled by developers as fixed or wontFix. It applies two automated evaluation approaches—G-Eval and an LLM-as-Judge pipeline—using both binary decisions and 0-4 Likert-scale formulations across models including Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2. The study finds only moderate agreement (approximately 0.44 to 0.62) with the developer labels and concludes that this reflects practical limitations of automated evaluation in industrial contexts, since developer actions embed contextual constraints, prioritization, and workflow dynamics not capturable from static artifacts; a follow-up director interview is cited in support.

Significance. If the central empirical findings hold after addressing methodological gaps, the work usefully demonstrates that developer resolution labels cannot be treated as clean ground truth for assessing code review bot quality. This has direct implications for industrial tool evaluation practices and for research on LLM-based judges in software engineering, underscoring the need to incorporate richer context or organizational factors rather than relying on static comment analysis alone.

major comments (2)
  1. [Evaluation Methodology and Results] The claim that moderate agreement indicates inherent limits of automated evaluation (rather than limits of the specific G-Eval and LLM-as-Judge implementations) is load-bearing for the paper's conclusions, yet the manuscript provides no details on prompt construction, no ablation on prompt variants, and no experiments incorporating richer context such as full PR diffs or metadata. Without such tests it remains unclear whether agreement could be substantially raised, weakening the generalization from observed numbers to 'practical limitations in fully automating' evaluation.
  2. [Dataset and Human Labeling] The human labels (fixed/wontFix) are treated as the reference standard, but the paper reports neither inter-annotator agreement among the labeling engineers nor any statistical tests (e.g., confidence intervals or significance of differences across models and binary/Likert formulations). These omissions make it difficult to assess the reliability of the 0.44–0.62 range and the strength of the claim that developer actions are irreducibly contextual.
minor comments (2)
  1. [Results] Exact per-model and per-formulation agreement values, together with any confusion matrices or breakdown tables, should be presented explicitly rather than summarized as a range in the abstract and text.
  2. [Evaluation Approaches] The distinction between binary and Likert formulations is mentioned but the precise mapping of Likert scores to binary decisions (or vice versa) for comparison purposes is not stated clearly.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Methodology and Results] The claim that moderate agreement indicates inherent limits of automated evaluation (rather than limits of the specific G-Eval and LLM-as-Judge implementations) is load-bearing for the paper's conclusions, yet the manuscript provides no details on prompt construction, no ablation on prompt variants, and no experiments incorporating richer context such as full PR diffs or metadata. Without such tests it remains unclear whether agreement could be substantially raised, weakening the generalization from observed numbers to 'practical limitations in fully automating' evaluation.

    Authors: We appreciate the referee highlighting the need for greater methodological transparency to support our interpretation. The core empirical result is the moderate agreement (0.44-0.62) observed when applying established G-Eval and LLM-as-Judge pipelines to real industrial data; the director interview is cited to show that developer actions incorporate workflow and organizational factors absent from static comment text. To strengthen the paper, we will add the exact prompt templates for both binary and Likert-scale variants to an appendix. We will also include a limited ablation on prompt phrasing (e.g., variations in instruction specificity and scale anchoring). Experiments that incorporate full PR diffs and metadata were outside the scope of the current study due to data-access and computational constraints. We will revise the discussion and conclusion sections to frame the findings as evidence of practical limitations with current automated approaches on this dataset, rather than an absolute claim that no future implementation could improve agreement. revision: partial

  2. Referee: [Dataset and Human Labeling] The human labels (fixed/wontFix) are treated as the reference standard, but the paper reports neither inter-annotator agreement among the labeling engineers nor any statistical tests (e.g., confidence intervals or significance of differences across models and binary/Likert formulations). These omissions make it difficult to assess the reliability of the 0.44–0.62 range and the strength of the claim that developer actions are irreducibly contextual.

    Authors: We agree that additional statistical reporting will improve rigor. In the revised manuscript we will add bootstrap-derived confidence intervals for all reported agreement metrics and perform statistical significance tests (e.g., McNemar’s test for paired binary decisions and appropriate tests for Likert-scale differences) across models and evaluation formulations. Regarding inter-annotator agreement, the labels were assigned by individual developers as part of their standard PR workflow at Beko; the released dataset contains only these single annotations and does not include redundant ratings. Consequently, inter-annotator agreement cannot be computed. We will explicitly discuss this as a limitation of using authentic workflow labels and note that controlled multi-rater annotation studies could be pursued in future work. revision: partial

standing simulated objections not resolved
  • Inter-annotator agreement for the developer labels, which cannot be computed because the dataset contains only single annotations per comment from the industrial workflow.

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison on fixed external labels

full rationale

The paper conducts a direct empirical evaluation by applying G-Eval and LLM-as-Judge pipelines (with specified models, binary/Likert formulations, and prompts) to a fixed dataset of 2,604 developer-labeled comments and measuring agreement (0.44-0.62). No equations, fitted parameters, or derivations exist; the central claim follows from observed moderate alignment plus a separate director interview. No self-citations are load-bearing for any result, and the human labels serve as an independent external benchmark rather than being derived from the automated methods. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating developer fixed/wontFix labels as a usable (if imperfect) reference signal for measuring automated alignment, plus the assumption that the observed mismatch is driven by unmodeled contextual factors rather than evaluation implementation details.

axioms (1)
  • domain assumption Developer actions on bot comments provide a usable reference signal for evaluating comment quality despite known contextual influences
    The study uses these labels to compute agreement ratios and then interprets low agreement as evidence of evaluation limits.

pith-pipeline@v0.9.0 · 5618 in / 1394 out tokens · 118621 ms · 2026-05-08T02:58:38.186435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Understanding automated code review process and developer experience in industry

    Hyungjin Kim, Yonghwi Kwon, Sangwoo Joh, Hyukin Kwon, Yeonhee Ryou, and Taeksu Kim. Understanding automated code review process and developer experience in industry. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1398–1407, New York, NY, USA, 2022. ...

  2. [2]

    Modern code reviews - preliminary results of a systematic mapping study

    Deepika Badampudi, Ricardo Britto, and Michael Unterkalmsteiner. Modern code reviews - preliminary results of a systematic mapping study. InProceedings of the Evaluation and Assessment on Software Engineering, EASE ’19, page 340–345. ACM, April 2019

  3. [3]

    A survey on modern code review: Progresses, challenges and opportunities.arXiv preprint arXiv:2405.18216, 2024

    Zezhou Yang, Cuiyun Gao, Zhaoqiang Guo, Zhenhao Li, Kui Liu, Xin Xia, and Yuming Zhou. A survey on modern code review: Progresses, challenges and opportunities.arXiv preprint arXiv:2405.18216, 2024

  4. [4]

    Automated code review in practice: An industrial case study

    Umut Cihan, Vahid Haratian, Arda Icoz, Mert Kaan Gul, Omercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. Automated code review in practice: An industrial case study. InProceedings of the 46th Inter- national Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 1–12. ACM, 2024

  5. [5]

    SonarQube for IDE: Real-Time AI Code Verification

    SonarSource. SonarQube for IDE: Real-Time AI Code Verification. https://www. sonarsource.com/products/sonarqube/ide/. Accessed: 2026-03-03

  6. [6]

    Bitsai-cr: Automated code review via llm in practice

    Tao Sun, Jian Xu, Yuanpeng Li, Zhao Yan, Ge Zhang, Lintao Xie, Lu Geng, Zheng Wang, Yueyan Chen, Qin Lin, et al. Bitsai-cr: Automated code review via llm in practice. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 274–285, 2025

  7. [7]

    Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. Effects of adopting code review bots on pull requests to oss projects. In2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–11, 2020

  8. [8]

    A survey of code review benchmarks and evaluation practices in pre-llm and llm era, 2026

    Taufiqul Islam Khan, Shaowei Wang, Haoxiang Zhang, and Tse-Hsun Chen. A survey of code review benchmarks and evaluation practices in pre-llm and llm era, 2026

  9. [9]

    Evaluation of LLM- based software engineering tools: Practices, challenges, and future directions

    Utku Boran Torun, Veli Karakaya, Ali Babar, and Eray Tüzün. Evaluation of LLM- based software engineering tools: Practices, challenges, and future directions. InProceedings of the 30th International Conference on Evaluation and Assessment in Software Engineering, EASE ’26, New York, NY, USA, 2026. Association for Computing Machinery

  10. [10]

    https://www.coderabbit.ai/, 2026

    CodeRabbit: Ai-powered code review platform. https://www.coderabbit.ai/, 2026

  11. [11]

    https://www.qodo.ai/, 2026

    Qodo: Ai code review and code quality platform. https://www.qodo.ai/, 2026

  12. [12]

    https://docs

    Snyk Code: Static application security testing (sast) documentation. https://docs. snyk.io/scan-with-snyk/snyk-code, 2026

  13. [13]

    Ai-assisted assessment of coding practices in modern code review

    Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrović, Daniel Tarlow, Petros Maniatis, and René Just. Ai-assisted assessment of coding practices in modern code review. InProceedings of the 1st ACM Inter- national Conference on AI-Powered Software, A...

  14. [14]

    Exploring the potential of chatgpt in automated code refinement: An empirical study

    Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. Exploring the potential of chatgpt in automated code refinement: An empirical study. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA, 2024. Association for Computing Machinery

  15. [15]

    Fine-tuning and prompt engineering for large language models-based code review automation.Inf

    Chanathip Pornprasit and Chakkrit Tantithamthavorn. Fine-tuning and prompt engineering for large language models-based code review automation.Inf. Softw. Technol., 175(C), November 2024

  16. [16]

    G-eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational ...

  17. [17]

    Hashimoto

    Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024

  18. [18]

    Humans or LLMs as the judge? a study on judgement bias

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, Miami, Florida, USA, November 2024. Association for Computat...

  19. [19]

    LLM- based NLG evaluation: Current status and challenges.Computational Linguistics, 51:661–687, June 2025

    Mingqi Gao, Xinyu Hu, Xunjian Yin, Jie Ruan, Xiao Pu, and Xiaojun Wan. LLM- based NLG evaluation: Current status and challenges.Computational Linguistics, 51:661–687, June 2025

  20. [20]

    RAGAs: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, March 202...

  21. [21]

    GPT-4 Turbo Model | OpenAI API Documentation

    OpenAI. GPT-4 Turbo Model | OpenAI API Documentation. https://developers. openai.com/api/docs/models/gpt-4-turbo. Accessed: 2026-03-03

  22. [22]

    Towards automating code review activities

    Michele Tufano, Denys Poshyvanyk, Gabriele Bavota, and Massimiliano Di Penta. Towards automating code review activities. InProceedings of the 43rd International Conference on Software Engineering (ICSE), page 163–174, 2021

  23. [23]

    Codereviewer: Pre- training for automating code review activities.arXiv preprint arXiv:2203.09095, 2022

    Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Ma- jumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, et al. Codereviewer: Pre- training for automating code review activities.arXiv preprint arXiv:2203.09095, 2022

  24. [24]

    Auger: Auto- matically generating review comments with pre-trained models

    Yucheng Li, Xin Xia, David Lo, Weiqin Wang, and Zhiqiang Chen. Auger: Auto- matically generating review comments with pre-trained models. InProceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1087–1099, 2022

  25. [25]

    Commentfinder: A simpler, faster, more accurate code review comment recommendation

    Changrong Hong, Zhiqiang Chen, Xin Xia, and David Lo. Commentfinder: A simpler, faster, more accurate code review comment recommendation. In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 890– 902, 2022

  26. [26]

    Resolving code review comments with machine learning

    Alexander Froemmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- zagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Daniel Zheng, Satish Chandra, and Petros Maniatis. Resolving code review comments with machine learning. InProceedings of the 46th Internation...

  27. [27]

    Evaluating large language models for code review.arXiv preprint arXiv:2505.20206, 2025

    Umut Cihan, Arda İçöz, Vahid Haratian, and Eray Tüzün. Evaluating large language models for code review.arXiv preprint arXiv:2505.20206, 2025

  28. [28]

    Automated code review using large language models at ericsson: An experience report, 2025

    Shweta Ramesh, Joy Bose, Hamender Singh, A K Raghavan, Sujoy Roychowdhury, Giriprasad Sridhara, Nishrith Saini, and Ricardo Britto. Automated code review using large language models at ericsson: An experience report, 2025

  29. [29]

    Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning

    Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 647–658. IEEE, 2023

  30. [30]

    Laura: Enhancing code review generation with context-enriched retrieval-augmented llm

    Yuxin Zhang, Yuxia Zhang, Zeyu Sun, Yanjie Jiang, and Hui Liu. Laura: Enhancing code review generation with context-enriched retrieval-augmented llm. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 2983–2995. IEEE, 2025

  31. [31]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  32. [32]

    Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering, 2(IS- STA):1955–1977, June 2025

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering.Proceedings of the ACM on Software Engineering, 2(IS- STA):1955–1977, June 2025

  33. [33]

    Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead.ACM Trans

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead.ACM Trans. Softw. Eng. Methodol., February 2026. Just Accepted

  34. [34]

    Crscore: Ground- ing automated evaluation of code review comments in code claims and smells

    Atharva Naik, Marcus Alenius, Daniel Fried, and Carolyn Rose. Crscore: Ground- ing automated evaluation of code review comments in code claims and smells. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9049–9076, 2025

  35. [35]

    Gemini 2.5 Pro | Generative AI on Vertex AI

    Google Cloud. Gemini 2.5 Pro | Generative AI on Vertex AI. https://docs.cloud. google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro. Accessed: 2026-03-03

  36. [36]

    Introducing GPT-4.1 in the API

    OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Accessed: 2026-03-03

  37. [37]

    Introducing GPT-5.2

    OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-03-03

  38. [38]

    Boran Torun, Baykal Mehmet Uçar, and Eray Tüzün

    Veli Karakaya, U. Boran Torun, Baykal Mehmet Uçar, and Eray Tüzün. Under- standing the Limits of Automated Evaluation for Code Review Bots in Practice. 3 2026. https://doi.org/10.6084/m9.figshare.31462948

  39. [39]

    Yu, Qiang Yang, and Xing Xie

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models, 2023

  40. [40]

    Hallujudge: A reference-free hallu- cination detection for context misalignment in code review automation.arXiv preprint arXiv:2601.19072, 2026

    Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, and Ming Wu. Hallujudge: A reference-free hallu- cination detection for context misalignment in code review automation.arXiv preprint arXiv:2601.19072, 2026

  41. [41]

    Think again! the effect of test-time compute on preferences, opinions, and beliefs of large language models

    George Kour, Itay Nakash, Michal Shmueli-Scheuer, and Ateret Anaby Tavor. Think again! the effect of test-time compute on preferences, opinions, and beliefs of large language models. InProceedings of the 63rd Annual Meeting of the EASE ’26, June 09–12, 2026, Glasgow, UK Veli Karakaya, Utku Boran Torun, Baykal Mehmet Uçar, and Eray Tüzün Association for Co...

  42. [42]

    The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation

    Davide Chicco and Giuseppe Jurman. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):6, 2020

  43. [43]

    Grounded theory in software engineering research: a critical review and guidelines

    Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. Grounded theory in software engineering research: a critical review and guidelines. InProceedings of the 38th International Conference on Software Engineering, ICSE ’16, page 120–131, New York, NY, USA, 2016. Association for Computing Machinery

  44. [44]

    Ground-truth deficiencies in software engineering: When codifying the past can be counterproductive.IEEE Software, 39(3):85–95, 2022

    Eray Tüzün, Hakan Erdogmus, Maria Teresa Baldassarre, Michael Felderer, Robert Feldt, and Burak Turhan. Ground-truth deficiencies in software engineering: When codifying the past can be counterproductive.IEEE Software, 39(3):85–95, 2022

  45. [45]

    Towards trust- worthy sentiment analysis in software engineering: Dataset characteristics and tool selection

    Martin Obaidi, Marc Herrmann, Kurt Schneider, and Jil Klünder. Towards trust- worthy sentiment analysis in software engineering: Dataset characteristics and tool selection. In2025 IEEE 33rd International Requirements Engineering Confer- ence Workshops (REW), pages 538–547. IEEE, 2025