arxiv: 2604.16790 · v1 · submitted 2026-04-18 · 💻 cs.SE · cs.AI

Recognition: unknown

Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

Zixiao Zhao , Amirreza Esmaeili , Fatemeh Fard

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:25 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM-as-a-Judgeprompt biascode evaluationsoftware engineeringbias sensitivityconsistencymodel rankingtest generation

0 comments

The pith

LLM judges for code change their verdicts based on how the prompt presents the same snippet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests LLM judges on three software engineering tasks: code generation, code repair, and test generation. It applies controlled prompt changes that favor one response over another while keeping the underlying code identical. Judges consistently shift their preferences toward the favored option, which raises measured accuracy only when that option matches the gold answer and lowers it otherwise. In some cases the shifts are large enough to reverse which models appear stronger on the task. This matters because many agentic workflows now use these judges to rank candidate patches or solutions without further human checks.

Core claim

LLM-as-a-Judge decisions remain highly sensitive to prompt biases across code generation, code repair, and test generation even when the code artifact itself is unchanged. Several biases systematically move preferences toward the option the prompt favors, raising accuracy when that option aligns with the gold label but substantially lowering accuracy otherwise. These effects can alter task-level conclusions and change relative model rankings, showing that reported judge performance may reflect prompt artifacts rather than stable assessment ability.

What carries the argument

Controlled prompt interventions that isolate one presentation cue at a time, applied within pointwise judging regimes to measure both consistency across repeated runs and sensitivity to bias.

If this is right

Reported performance numbers for LLM judges may capture prompt artifacts instead of genuine evaluation skill.
Task-level conclusions and relative rankings among code models can reverse depending on the prompt framing chosen.
Studies that use LLM judges for software engineering tasks should report bias sensitivity in addition to raw accuracy.
Explicit controls for prompt bias are required before LLM judges can support trustworthy model comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In agentic workflows that rely on these judges to select patches, small prompt wording choices could steer development toward inferior solutions.
Teams could mitigate the issue by running each judgment multiple times with deliberately varied prompt phrasings and taking the majority vote.
The same sensitivity pattern may appear in non-code domains where LLM judges are used, suggesting a broader need for bias audits.

Load-bearing premise

The gold-standard answers used to calculate accuracy are themselves unbiased and representative of real evaluation needs, and the prompt interventions isolate single cues without introducing other confounds.

What would settle it

Repeating the full set of experiments on the same code artifacts but with gold labels that directly contradict the biased prompt options; if the accuracy shifts disappear, the central sensitivity claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.16790 by Amirreza Esmaeili, Fatemeh Fard, Zixiao Zhao.

**Figure 1.** Figure 1: Overview of our evaluation pipeline. A and B denote the two candidate responses in a pairwise comparison, where A is shown in the first position and B in the second position. A checkmark indicates the ground-truth correct response under that setting. when such cues can systematically shift preferences and distort evaluation outcomes. Although initial evidence shows that code judges can be influenced by sup… view at source ↗

**Figure 2.** Figure 2: The default judge prompt used for our baseline evaluation. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: CodeGen: Overall accuracy by bias, model, and position (A vs. B). [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: CodeRepair: Overall accuracy by bias, model, and position (A vs. B). [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: TestGen: Overall accuracy by bias, model, and position (A vs. B). [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Bias confidence plot for one easy and one hard sample for CodeGen using Qwen2.5-Coder-3B. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: A representative example drawn from our existing runs: the same item under the control prompt [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it can help rank candidate solutions and guide patch selection. While attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving, human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code repair task, and test generation, and we systematically probe prompt-induced biases. Our study considers difficulty levels for repeated runs and controlled prompt interventions that isolate one presentation cue at a time, and it evaluates judges using consistency and sensitivity to bias. We find that judge decisions are highly sensitive to prompt biases even when the underlying code snippet is unchanged. Across all three tasks, several biases systematically shift preferences toward the option favored by the prompt, improving accuracy when that option aligns with the gold answer but substantially reducing it otherwise. In some settings, these effects are large enough to change task-level conclusions and alter relative model rankings. These findings show that reported judge performance may reflect prompt artifacts rather than stable assessment ability, posing a direct threat to the validity and reproducibility of code evaluation. We therefore argue that LLM-as-a-Judge studies should report bias sensitivity alongside accuracy and incorporate explicit controls to support more trustworthy model comparison in software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt biases shift LLM judge preferences on code tasks and can reorder model rankings, but the accuracy drops may not reflect real errors since code often has multiple valid solutions.

read the letter

The core finding is that small prompt changes can make LLM judges favor one code option over another even when the underlying snippet stays the same, and these shifts sometimes flip which models rank higher on generation, repair, and test generation tasks. They measured this with repeated runs and targeted interventions that change one cue at a time while holding the code fixed. That part is straightforward empirical work and worth having on record for anyone using LLM judges in SE pipelines. It shows the sensitivity is real and not just noise in a few cases. The paper does a decent job laying out the three tasks and tracking both consistency and bias effects without overclaiming theory. It stays measurement-focused, which fits the topic. The main soft spot is the gold-label assumption. The claims about accuracy falling when the bias points away from the gold rest on treating the chosen reference as the single correct outcome. In code generation and repair, multiple implementations can pass the same tests or be functionally equivalent, so a judge moving toward another valid answer does not automatically mean the evaluation broke. The stress-test note on this lands cleanly; the paper would be stronger if it checked how often the shifted preference still produced passing or equivalent code. Methods details on exactly how the interventions were isolated and what the statistical thresholds were for calling an effect large enough to change conclusions are also thin in the abstract, though the overall design sounds controlled. This is useful for researchers who run or cite LLM-based SE benchmarks and want to know they need bias checks. It is not a paradigm shift but a focused audit that documents a practical risk. A serious editor should send it to referees so the gold-label robustness and effect-size reporting can be tightened, rather than desk-rejecting it outright.

Referee Report

2 major / 3 minor

Summary. The paper claims that LLM-as-a-Judge systems for software engineering tasks (code generation, code repair, and test generation) exhibit high sensitivity to prompt-induced biases even when code artifacts are unchanged. Using repeated evaluations and controlled interventions that isolate individual presentation cues, the authors measure consistency and bias sensitivity, finding that biases systematically favor the prompt-preferred option, boosting accuracy when aligned with gold labels but substantially lowering it otherwise, with effects large enough in some cases to alter task conclusions and relative model rankings. They conclude that reported judge performance may reflect prompt artifacts and recommend reporting bias sensitivity alongside accuracy.

Significance. If the empirical findings hold under scrutiny, the work is significant for the growing use of LLM judges in SE evaluation and agentic workflows, where reliable ranking of solutions is critical. It provides a measurement-first audit with repeated runs and isolated bias probes, offering concrete evidence that small prompt changes can undermine validity and reproducibility. This directly supports calls for more robust evaluation practices in the field.

major comments (2)

[Evaluation and accuracy measurement (tasks and gold labels)] The central interpretation that biases 'substantially reduce' accuracy and can change task-level conclusions rests on the assumption that gold-standard answers are uniquely correct or canonical. In code generation, repair, and test generation, however, multiple functionally equivalent or passing solutions frequently exist; a bias-induced shift toward another valid artifact would not constitute an accuracy drop in practice. This assumption is load-bearing for the strongest claims about practical harm and altered rankings, yet the manuscript provides no validation (e.g., via multiple gold labels, equivalence checks, or discussion of solution diversity) that the chosen golds are representative or exhaustive.
[Prompt intervention design and bias isolation] The claim that controlled prompt interventions 'successfully isolate single presentation cues without introducing other confounds' is not fully supported by the reported methods. Without explicit checks (e.g., ablation on prompt length, lexical overlap, or semantic drift introduced by the bias phrasing), it remains possible that observed preference shifts partly reflect correlated changes rather than the isolated cue. This directly affects the attribution of effects to specific biases.

minor comments (3)

[Abstract and results overview] The abstract and results summary refer to 'several biases' and 'large enough' effects without providing effect sizes, confidence intervals, or per-bias breakdowns in the high-level overview; including a summary table of bias types, direction, and magnitude would improve readability.
[Experimental setup and analysis] Details on statistical tests for consistency (e.g., agreement metrics across repeated runs) and significance of accuracy shifts are referenced but not fully specified in the provided description; adding exact test names, p-value thresholds, and correction methods would strengthen verifiability.
[Task difficulty stratification] The manuscript would benefit from explicit discussion of how difficulty levels were operationalized and whether bias sensitivity varies systematically with task difficulty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and robustness of our findings on LLM-as-a-Judge bias in software engineering tasks. We address each major comment point by point below, providing the strongest honest defense supported by the manuscript while noting where revisions are warranted.

read point-by-point responses

Referee: [Evaluation and accuracy measurement (tasks and gold labels)] The central interpretation that biases 'substantially reduce' accuracy and can change task-level conclusions rests on the assumption that gold-standard answers are uniquely correct or canonical. In code generation, repair, and test generation, however, multiple functionally equivalent or passing solutions frequently exist; a bias-induced shift toward another valid artifact would not constitute an accuracy drop in practice. This assumption is load-bearing for the strongest claims about practical harm and altered rankings, yet the manuscript provides no validation (e.g., via multiple gold labels, equivalence checks, or discussion of solution diversity) that the chosen golds are representative or exhaustive.

Authors: We acknowledge the validity of this point: in SE tasks, functional equivalence among solutions is common, and our use of single gold labels from standard benchmarks (HumanEval for generation, Defects4J for repair, and established test-generation suites) does not exhaustively rule out other valid artifacts. The manuscript does not include multi-gold validation or full equivalence checking across all tasks. However, the core claims focus on systematic preference shifts toward the prompt-favored option, which we measure via consistency and accuracy relative to the benchmark golds; even if some shifts land on equivalents, the directional bias and its impact on reported performance remain. To strengthen this, we will add a dedicated limitations subsection discussing solution diversity, reference prior work on code equivalence, and include partial equivalence analysis (via test-suite validation) for the repair task where feasible. We have also softened language around 'substantially reduce' accuracy to reflect this caveat. This is a partial revision that preserves the empirical findings while improving interpretability. revision: partial
Referee: [Prompt intervention design and bias isolation] The claim that controlled prompt interventions 'successfully isolate single presentation cues without introducing other confounds' is not fully supported by the reported methods. Without explicit checks (e.g., ablation on prompt length, lexical overlap, or semantic drift introduced by the bias phrasing), it remains possible that observed preference shifts partly reflect correlated changes rather than the isolated cue. This directly affects the attribution of effects to specific biases.

Authors: We agree that stronger evidence for isolation would bolster attribution. The interventions were designed as minimal, targeted edits (e.g., adding a single bias phrase while holding all other prompt structure fixed), and the manuscript reports repeated runs to control for stochasticity. However, no explicit ablations for length, lexical overlap, or semantic drift were included. We will revise the methods section to detail the intervention construction process and add an appendix with supporting analyses: (1) prompt-length ablations showing effects persist at matched lengths, (2) lexical-overlap metrics (e.g., Jaccard similarity) confirming minimal unintended changes, and (3) embedding-based semantic-drift checks demonstrating that bias cues do not introduce broader meaning shifts. These additions directly address the concern and confirm that the observed preference shifts are attributable to the isolated cues. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements with no derivations or self-referential reductions

full rationale

This is a measurement study that runs controlled experiments on LLM judges across code generation, repair, and test generation tasks, measuring consistency and bias sensitivity via prompt interventions. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. All reported effects (accuracy shifts, ranking changes) are direct observations from the experimental runs rather than reductions to prior definitions or citations. The central claims rest on the experimental design and gold labels as external benchmarks, with no self-definitional loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study with no mathematical derivations or new postulated entities; relies on standard assumptions of experimental design in AI evaluation.

pith-pipeline@v0.9.0 · 5582 in / 1082 out tokens · 52854 ms · 2026-05-10T07:25:52.362537+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 31 canonical work pages · 3 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2025. SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement. arXiv:2410.20285 [cs.AI] https://arxiv.org/abs/2410.20285

work page arXiv 2025
[3]

Arda Celik and Qusay H. Mahmoud. 2025. A Review of Large Language Models for Automated Test Case Generation. Machine Learning and Knowledge Extraction7, 3 (2025). doi:10.3390/make7030097

work page doi:10.3390/make7030097 2025
[4]

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567(2025)

work page internal anchor Pith review arXiv 2025
[5]

Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota
[6]

arXiv:2507.16587 [cs.SE] https://arxiv.org/abs/2507.16587

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization. arXiv:2507.16587 [cs.SE] https://arxiv.org/abs/2507.16587

work page arXiv
[7]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE] https: //arxiv.org/abs/2310.03533

work page arXiv 2023
[8]

Gallegos, Ryan A

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguistics 50, 3 (Sept. 2024), 1097–1179. doi:10.1162/coli_a_00524

work page doi:10.1162/coli_a_00524 2024
[9]

Aditya Gulati, Moreno D’Incà, Nicu Sebe, Bruno Lepri, and Nuria Oliver. 2025. Beauty and the Bias: Exploring the Impact of Attractiveness on Multimodal Large Language Models. arXiv:2504.16104 [cs.CY] https://arxiv.org/abs/2504.16104

work page arXiv 2025
[10]

Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https://arxiv.org/abs/ 2509.06216

work page arXiv 2025
[11]

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2025. From Code to Courtroom: LLMs as the New Software Judges. arXiv:2503.02246 [cs.SE] https://arxiv.org/abs/2503.02246

work page arXiv 2025
[12]

Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, and Heming Cui

Dong Huang, Jie M. Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, and Heming Cui. 2025. Bias Testing and Mitigation in LLM-based Code Generation.ACM Trans. Softw. Eng. Methodol.35, 1, Article 5 (Dec. 2025), 31 pages. doi:10.1145/3724117

work page doi:10.1145/3724117 2025
[13]

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung yi Lee, and Robby T. Tan. 2025. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks. arXiv:2507.10535 [cs.CL] https://arxiv.org/abs/2507.10535

work page arXiv 2025
[14]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology(2024)

2024
[15]

Jiho Jin, Woosung Kang, Junho Myung, and Alice Oh. 2025. Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. arXiv:2503.06987 [cs.CL] https://arxiv.org/abs/2503.06987

work page arXiv 2025
[16]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579

work page internal anchor Pith review arXiv 2024
[17]

Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification. arXiv:2502.02747 [cs.RO] https://arxiv.org/abs/2502.02747

work page arXiv 2025
[18]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2024. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939 [cs.LG] https://arxiv.org/abs/2406.11939

work page arXiv 2024
[19]

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. 2025. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776(2025)

work page arXiv 2025
[20]

Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias Unveiled: Investigating Social Bias in LLM-Generated Code.Proceedings of the AAAI Conference on Artificial Intelligence39, 26 (Apr. 2025), 27491–27499. doi:10.1609/aaai. v39i26.34961

work page doi:10.1609/aaai 2025
[21]

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey. arXiv:2409.02977 [cs.SE] https://arxiv.org/abs/ 2409.02977

work page arXiv 2024
[22]

Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, and Kyomin Jung. 2025. Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation. arXiv:2505.16222 [cs.CL] https://arxiv.org/abs/2505.16222

work page arXiv 2025
[23]

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2025. Training Software Engineering Agents and Verifiers with SWE-Gym. arXiv:2412.21139 [cs.SE] https://arxiv.org/abs/2412.21139

work page arXiv 2025
[24]

Mohit Raghavendra, Anisha Gunjal, Bing Liu, and Yunzhong He. 2026. Agentic Rubrics as Contextual Verifiers for SWE Agents. arXiv:2601.04171 [cs.LG] https://arxiv.org/abs/2601.04171 , Vol. 1, No. 1, Article . Publication date: April 2018. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering 31

work page arXiv 2026
[25]

Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. InCompanion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(Cascais, Portugal)(SPLASH 2023). Association for Computing Machinery, New York, NY, USA, 7–9. doi:10.1145/3618305.3623587

work page doi:10.1145/3618305.3623587 2023
[26]

Judgebench: A benchmark for evaluating llm-based judges,

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. 2025. JudgeBench: A Benchmark for Evaluating LLM-based Judges. arXiv:2410.12784 [cs.AI] https://arxiv.org/abs/2410.12784

work page arXiv 2025
[27]

Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, and Foutse Khomh. 2025. Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs.ArXivabs/2511.00382 (2025). https://api.semanticscholar.org/CorpusID:282739579

work page arXiv 2025
[28]

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hup- kes. 2025. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eli...

2025
[29]

Weixi Tong and Tianyi Zhang. 2024. CodeJudge: Evaluating Code Generation with Large Language Models. arXiv:2410.02184 [cs.LG] https://arxiv.org/abs/2410.02184

work page arXiv 2024
[30]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963

work page doi:10.1145/3728963 2025
[31]

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2024. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv:2410.02736 [cs.CL] https://arxiv.org/abs/2410.02736

work page arXiv 2024
[32]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen
[33]

A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223, 2023

A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE] https://arxiv.org/abs/ 2312.15223

work page arXiv
[34]

Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, and Jing Ma. 2024. CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? arXiv:2408.10718 [cs.SE] https://arxiv.org/abs/ 2408.10718

work page arXiv 2024
[35]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levin...

2023
[36]

Terry Yue Zhuo. 2024. ICE-Score: Instructing Large Language Models to Evaluate Code. arXiv:2304.14317 [cs.AI] https://arxiv.org/abs/2304.14317 Received 20 February 2026; revised 2026; accepted June 2026 , Vol. 1, No. 1, Article . Publication date: April 2018

work page arXiv 2024