Recognition: unknown
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
Pith reviewed 2026-05-10 07:25 UTC · model grok-4.3
The pith
LLM judges for code change their verdicts based on how the prompt presents the same snippet.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-as-a-Judge decisions remain highly sensitive to prompt biases across code generation, code repair, and test generation even when the code artifact itself is unchanged. Several biases systematically move preferences toward the option the prompt favors, raising accuracy when that option aligns with the gold label but substantially lowering accuracy otherwise. These effects can alter task-level conclusions and change relative model rankings, showing that reported judge performance may reflect prompt artifacts rather than stable assessment ability.
What carries the argument
Controlled prompt interventions that isolate one presentation cue at a time, applied within pointwise judging regimes to measure both consistency across repeated runs and sensitivity to bias.
If this is right
- Reported performance numbers for LLM judges may capture prompt artifacts instead of genuine evaluation skill.
- Task-level conclusions and relative rankings among code models can reverse depending on the prompt framing chosen.
- Studies that use LLM judges for software engineering tasks should report bias sensitivity in addition to raw accuracy.
- Explicit controls for prompt bias are required before LLM judges can support trustworthy model comparisons.
Where Pith is reading between the lines
- In agentic workflows that rely on these judges to select patches, small prompt wording choices could steer development toward inferior solutions.
- Teams could mitigate the issue by running each judgment multiple times with deliberately varied prompt phrasings and taking the majority vote.
- The same sensitivity pattern may appear in non-code domains where LLM judges are used, suggesting a broader need for bias audits.
Load-bearing premise
The gold-standard answers used to calculate accuracy are themselves unbiased and representative of real evaluation needs, and the prompt interventions isolate single cues without introducing other confounds.
What would settle it
Repeating the full set of experiments on the same code artifacts but with gold labels that directly contradict the biased prompt options; if the accuracy shifts disappear, the central sensitivity claim would be falsified.
Figures
read the original abstract
Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it can help rank candidate solutions and guide patch selection. While attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving, human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code repair task, and test generation, and we systematically probe prompt-induced biases. Our study considers difficulty levels for repeated runs and controlled prompt interventions that isolate one presentation cue at a time, and it evaluates judges using consistency and sensitivity to bias. We find that judge decisions are highly sensitive to prompt biases even when the underlying code snippet is unchanged. Across all three tasks, several biases systematically shift preferences toward the option favored by the prompt, improving accuracy when that option aligns with the gold answer but substantially reducing it otherwise. In some settings, these effects are large enough to change task-level conclusions and alter relative model rankings. These findings show that reported judge performance may reflect prompt artifacts rather than stable assessment ability, posing a direct threat to the validity and reproducibility of code evaluation. We therefore argue that LLM-as-a-Judge studies should report bias sensitivity alongside accuracy and incorporate explicit controls to support more trustworthy model comparison in software engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM-as-a-Judge systems for software engineering tasks (code generation, code repair, and test generation) exhibit high sensitivity to prompt-induced biases even when code artifacts are unchanged. Using repeated evaluations and controlled interventions that isolate individual presentation cues, the authors measure consistency and bias sensitivity, finding that biases systematically favor the prompt-preferred option, boosting accuracy when aligned with gold labels but substantially lowering it otherwise, with effects large enough in some cases to alter task conclusions and relative model rankings. They conclude that reported judge performance may reflect prompt artifacts and recommend reporting bias sensitivity alongside accuracy.
Significance. If the empirical findings hold under scrutiny, the work is significant for the growing use of LLM judges in SE evaluation and agentic workflows, where reliable ranking of solutions is critical. It provides a measurement-first audit with repeated runs and isolated bias probes, offering concrete evidence that small prompt changes can undermine validity and reproducibility. This directly supports calls for more robust evaluation practices in the field.
major comments (2)
- [Evaluation and accuracy measurement (tasks and gold labels)] The central interpretation that biases 'substantially reduce' accuracy and can change task-level conclusions rests on the assumption that gold-standard answers are uniquely correct or canonical. In code generation, repair, and test generation, however, multiple functionally equivalent or passing solutions frequently exist; a bias-induced shift toward another valid artifact would not constitute an accuracy drop in practice. This assumption is load-bearing for the strongest claims about practical harm and altered rankings, yet the manuscript provides no validation (e.g., via multiple gold labels, equivalence checks, or discussion of solution diversity) that the chosen golds are representative or exhaustive.
- [Prompt intervention design and bias isolation] The claim that controlled prompt interventions 'successfully isolate single presentation cues without introducing other confounds' is not fully supported by the reported methods. Without explicit checks (e.g., ablation on prompt length, lexical overlap, or semantic drift introduced by the bias phrasing), it remains possible that observed preference shifts partly reflect correlated changes rather than the isolated cue. This directly affects the attribution of effects to specific biases.
minor comments (3)
- [Abstract and results overview] The abstract and results summary refer to 'several biases' and 'large enough' effects without providing effect sizes, confidence intervals, or per-bias breakdowns in the high-level overview; including a summary table of bias types, direction, and magnitude would improve readability.
- [Experimental setup and analysis] Details on statistical tests for consistency (e.g., agreement metrics across repeated runs) and significance of accuracy shifts are referenced but not fully specified in the provided description; adding exact test names, p-value thresholds, and correction methods would strengthen verifiability.
- [Task difficulty stratification] The manuscript would benefit from explicit discussion of how difficulty levels were operationalized and whether bias sensitivity varies systematically with task difficulty.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and robustness of our findings on LLM-as-a-Judge bias in software engineering tasks. We address each major comment point by point below, providing the strongest honest defense supported by the manuscript while noting where revisions are warranted.
read point-by-point responses
-
Referee: [Evaluation and accuracy measurement (tasks and gold labels)] The central interpretation that biases 'substantially reduce' accuracy and can change task-level conclusions rests on the assumption that gold-standard answers are uniquely correct or canonical. In code generation, repair, and test generation, however, multiple functionally equivalent or passing solutions frequently exist; a bias-induced shift toward another valid artifact would not constitute an accuracy drop in practice. This assumption is load-bearing for the strongest claims about practical harm and altered rankings, yet the manuscript provides no validation (e.g., via multiple gold labels, equivalence checks, or discussion of solution diversity) that the chosen golds are representative or exhaustive.
Authors: We acknowledge the validity of this point: in SE tasks, functional equivalence among solutions is common, and our use of single gold labels from standard benchmarks (HumanEval for generation, Defects4J for repair, and established test-generation suites) does not exhaustively rule out other valid artifacts. The manuscript does not include multi-gold validation or full equivalence checking across all tasks. However, the core claims focus on systematic preference shifts toward the prompt-favored option, which we measure via consistency and accuracy relative to the benchmark golds; even if some shifts land on equivalents, the directional bias and its impact on reported performance remain. To strengthen this, we will add a dedicated limitations subsection discussing solution diversity, reference prior work on code equivalence, and include partial equivalence analysis (via test-suite validation) for the repair task where feasible. We have also softened language around 'substantially reduce' accuracy to reflect this caveat. This is a partial revision that preserves the empirical findings while improving interpretability. revision: partial
-
Referee: [Prompt intervention design and bias isolation] The claim that controlled prompt interventions 'successfully isolate single presentation cues without introducing other confounds' is not fully supported by the reported methods. Without explicit checks (e.g., ablation on prompt length, lexical overlap, or semantic drift introduced by the bias phrasing), it remains possible that observed preference shifts partly reflect correlated changes rather than the isolated cue. This directly affects the attribution of effects to specific biases.
Authors: We agree that stronger evidence for isolation would bolster attribution. The interventions were designed as minimal, targeted edits (e.g., adding a single bias phrase while holding all other prompt structure fixed), and the manuscript reports repeated runs to control for stochasticity. However, no explicit ablations for length, lexical overlap, or semantic drift were included. We will revise the methods section to detail the intervention construction process and add an appendix with supporting analyses: (1) prompt-length ablations showing effects persist at matched lengths, (2) lexical-overlap metrics (e.g., Jaccard similarity) confirming minimal unintended changes, and (3) embedding-based semantic-drift checks demonstrating that bias cues do not introduce broader meaning shifts. These additions directly address the concern and confirm that the observed preference shifts are attributable to the isolated cues. revision: yes
Circularity Check
No circularity: direct empirical measurements with no derivations or self-referential reductions
full rationale
This is a measurement study that runs controlled experiments on LLM judges across code generation, repair, and test generation tasks, measuring consistency and bias sensitivity via prompt interventions. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. All reported effects (accuracy shifts, ranking changes) are direct observations from the experimental runs rather than reductions to prior definitions or citations. The central claims rest on the experimental design and gold labels as external benchmarks, with no self-definitional loops or ansatz smuggling.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Arda Celik and Qusay H. Mahmoud. 2025. A Review of Large Language Models for Automated Test Case Generation. Machine Learning and Knowledge Extraction7, 3 (2025). doi:10.3390/make7030097
-
[4]
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567(2025)
work page internal anchor Pith review arXiv 2025
-
[5]
Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota
-
[6]
arXiv:2507.16587 [cs.SE] https://arxiv.org/abs/2507.16587
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization. arXiv:2507.16587 [cs.SE] https://arxiv.org/abs/2507.16587
- [7]
-
[8]
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguistics 50, 3 (Sept. 2024), 1097–1179. doi:10.1162/coli_a_00524
- [9]
-
[10]
Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu
Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https://arxiv.org/abs/ 2509.06216
- [11]
-
[12]
Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, and Heming Cui
Dong Huang, Jie M. Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, and Heming Cui. 2025. Bias Testing and Mitigation in LLM-based Code Generation.ACM Trans. Softw. Eng. Methodol.35, 1, Article 5 (Dec. 2025), 31 pages. doi:10.1145/3724117
- [13]
-
[14]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology(2024)
2024
- [15]
-
[16]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
- [17]
-
[18]
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2024. From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939 [cs.LG] https://arxiv.org/abs/2406.11939
- [19]
-
[20]
Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias Unveiled: Investigating Social Bias in LLM-Generated Code.Proceedings of the AAAI Conference on Artificial Intelligence39, 26 (Apr. 2025), 27491–27499. doi:10.1609/aaai. v39i26.34961
- [21]
- [22]
- [23]
-
[24]
Mohit Raghavendra, Anisha Gunjal, Bing Liu, and Yunzhong He. 2026. Agentic Rubrics as Contextual Verifiers for SWE Agents. arXiv:2601.04171 [cs.LG] https://arxiv.org/abs/2601.04171 , Vol. 1, No. 1, Article . Publication date: April 2018. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering 31
-
[25]
Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. InCompanion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity(Cascais, Portugal)(SPLASH 2023). Association for Computing Machinery, New York, NY, USA, 7–9. doi:10.1145/3618305.3623587
-
[26]
Judgebench: A benchmark for evaluating llm-based judges,
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. 2025. JudgeBench: A Benchmark for Evaluating LLM-based Judges. arXiv:2410.12784 [cs.AI] https://arxiv.org/abs/2410.12784
-
[27]
Mina Taraghi, Yann Pequignot, Amin Nikanjam, Mohamed Amine Merzouk, and Foutse Khomh. 2025. Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs.ArXivabs/2511.00382 (2025). https://api.semanticscholar.org/CorpusID:282739579
-
[28]
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hup- kes. 2025. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eli...
2025
- [29]
-
[30]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages. doi:10.1145/3728963
-
[31]
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2024. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv:2410.02736 [cs.CL] https://arxiv.org/abs/2410.02736
-
[32]
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen
-
[33]
A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223, 2023
A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE] https://arxiv.org/abs/ 2312.15223
- [34]
-
[35]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levin...
2023
- [36]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.