pith. sign in

arxiv: 2605.21748 · v1 · pith:J4DCT3NRnew · submitted 2026-05-20 · 💻 cs.CL

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Pith reviewed 2026-05-22 08:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM-as-a-judgemulti-turn evaluationsynthetic benchmarkflaw injectionBradley-Terry rankingconversation pairsreference grounding
0
0 comments X

The pith

RankJudge generates paired multi-turn conversations differing by one injected flaw to give LLM judges an unambiguous correctness signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generator that produces reference-grounded conversation pairs for testing LLM judges on multi-turn tasks. One conversation in each pair receives a single targeted flaw in one turn, so the better version is known by construction and the error type is isolated to that turn. This setup supports a strict joint correctness criterion and lets developers rank judges across machine learning, biomedicine, and finance domains using the Bradley-Terry model. Rankings remain stable when judges see only partial context, when correctness is defined more coarsely, or when an alternative random-walk scorer is substituted. The method also assigns difficulty ratings to pairs, which are used to curate lower-noise evaluation slices that human annotators confirm.

Core claim

By injecting exactly one flaw into one turn of an otherwise identical conversation pair grounded in a reference document, RankJudge produces labeled examples that isolate specific failure categories and enable unambiguous better/worse judgments for LLM-as-a-judge evaluation.

What carries the argument

The RankJudge synthetic benchmark generator that creates conversation pairs differing by a single flaw in one turn.

If this is right

  • Frontier LLM judges can be ranked by how often they correctly identify the flawless conversation in each pair.
  • Difficulty ratings derived from the same pairs allow dynamic selection of evaluation subsets with reduced label noise.
  • Judge rankings stay consistent under partial conversation visibility and under coarser or alternative scoring rules.
  • The same construction supports precise diagnosis of which error types individual judges fail to detect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The generator could be extended to inject multiple interacting flaws or to simulate longer conversation histories.
  • Developers could use the per-turn isolation to create targeted training data that improves judges on specific weaknesses.
  • The approach might generalize to non-document-grounded conversations if reference material can be replaced by verifiable external facts.

Load-bearing premise

Injecting a single flaw into one turn produces pairs that are unambiguously better or worse without creating other unintended differences that would confuse the label.

What would settle it

Human annotators systematically disagree on which conversation is better in a large fraction of the generated pairs.

Figures

Figures reproduced from arXiv: 2605.21748 by Jesse C. Cresswell, Keyvan Golestan, Rasa Hosseinzadeh, Tongzi Wu, Zhaoyan Liu, Zhenwei Tang.

Figure 1
Figure 1. Figure 1: Overview of RANKJUDGE, a benchmark generator for multi-turn judge evaluation. the two apart. Lastly, static accuracy on a fixed pool offers no principled way to identify which items actually separate strong judges from weak ones [5, 6, 7]. In this paper, we introduce RANKJUDGE, a benchmark generator for multi-turn, reference-grounded judge evaluation. Each item is a pair of conversations sampled independen… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Cumulative fraction of samples annotated by humans as having ambiguous or noisy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Elo scores of 21 judges on the combined dataset. Black circles give the combined Elo with 95% CI; colored markers show per-domain Elo scores. Tick-label color denotes proprietary (blue) vs. open-source (orange) judges [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Judge Elo against per-match compute. Top: mean completion tokens per match (linear). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-class prediction bias for each judge. Each cell gives the difference in percentage points [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Judge Elo with and without failure-type correctness as a criterion. (b) Accuracy vs. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-pair Elo on the combined dataset, grouped by (a) assistant failure type and (b) user [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Same-model preference does not distort the rest of the leaderboard. Each panel is a slope [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-class prediction bias for each judge across the three domains. Each cell gives the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-judge confusion of assistant failure type predictions, row-normalized so each [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-domain version of Figure 6(a). Judge Elo with the full correctness criterion (x-axis) [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-domain version of Figure 6(c). Each panel reports Spearman [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Judge ranks with (x-axis) and without (y-axis) the top-ranked pair removal, one panel per [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Judge ranks under Bradley–Terry Elo (x-axis) and Empirical Interaction Propagation [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pointwise vs. pairwise judging on a 100-pair stratified sample. (a) Each judge’s pointwise Elo plotted against its pairwise Elo recomputed on the same pair set; the dashed line marks y = x. (b) Per-judge pointwise score-gap: the mean Likert score on the good conversation in each pair minus the mean score on the flawed one, sorted descending. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The pair-audit interface used for the human label-noise audit. The annotator inspects the [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗
read the original abstract

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. It creates pairs of conversations differing by a single flaw injected into one turn, enabling unambiguous better/worse labels and precise isolation of failure categories to individual turns for a strict joint correctness criterion. The method is implemented across machine learning, biomedicine, and finance domains; 21 frontier LLM judges are evaluated and ranked via the Bradley-Terry model. Difficulty ratings allow dynamic curation of evaluation slices to reduce label noise (confirmed by human annotation), and judge rankings are reported as stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

Significance. If the single-flaw construction reliably isolates quality differences, RankJudge offers a scalable, low-cost way to benchmark judges on complex conversational tasks where human evaluation is impractical. Strengths include the use of Bradley-Terry ranking, human validation of the curation process, explicit stability tests across multiple conditions, and domain-specific implementations. This directly addresses the gap in existing LLM-judge benchmarks that focus primarily on single-turn Q&A.

major comments (2)
  1. [§3 (RankJudge Construction)] §3 (RankJudge Construction): The central claim that single-flaw injection into one turn produces pairs that can be labeled unambiguously as better or worse, while isolating failure categories, assumes the injected flaw remains the sole source of quality difference without being masked, compensated, or compounded by surrounding turns or the reference document. No post-injection consistency checks, turn-isolation metrics, or verification that context does not alter the net quality delta are described, which is load-bearing for the strict joint correctness criterion.
  2. [§5 (Evaluation Protocols)] §5 (Evaluation Protocols): The exact prompting templates and decision rules used to apply the joint correctness criterion when scoring judge outputs on the paired conversations are not specified in sufficient detail to reproduce the isolation of failure categories or to confirm that multi-turn context effects do not undermine the labeling.
minor comments (2)
  1. [§4.3] The description of how difficulty ratings are computed from the Bradley-Terry model and then used for dynamic curation could be expanded with a short pseudocode or equation for clarity.
  2. [Table 1] Table 1 (domain statistics) would benefit from an additional column reporting the number of unique reference documents per domain to help readers assess diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where additional detail would strengthen the manuscript. We address each major comment below with clarifications based on the existing construction and indicate revisions that will be incorporated.

read point-by-point responses
  1. Referee: [§3 (RankJudge Construction)] §3 (RankJudge Construction): The central claim that single-flaw injection into one turn produces pairs that can be labeled unambiguously as better or worse, while isolating failure categories, assumes the injected flaw remains the sole source of quality difference without being masked, compensated, or compounded by surrounding turns or the reference document. No post-injection consistency checks, turn-isolation metrics, or verification that context does not alter the net quality delta are described, which is load-bearing for the strict joint correctness criterion.

    Authors: The RankJudge construction generates each pair from an identical reference document and base conversation, modifying only a single targeted turn in one member of the pair with a flaw drawn from a predefined category. This design ensures that any quality difference is localized to that turn by construction. We acknowledge that explicit post-injection verification steps were not detailed in the initial submission. In the revision we will add a dedicated subsection describing sampling-based consistency checks (including human review of a subset of pairs) to confirm that surrounding context does not mask or compound the injected flaw, thereby reinforcing the foundation for the strict joint correctness criterion. revision: yes

  2. Referee: [§5 (Evaluation Protocols)] §5 (Evaluation Protocols): The exact prompting templates and decision rules used to apply the joint correctness criterion when scoring judge outputs on the paired conversations are not specified in sufficient detail to reproduce the isolation of failure categories or to confirm that multi-turn context effects do not undermine the labeling.

    Authors: We agree that full reproducibility requires the precise prompting templates and decision rules. Section 5 presents the joint correctness criterion at the conceptual level, but the concrete implementation details were omitted. In the revised manuscript we will append the complete judge prompts (including how the paired conversations and reference document are formatted) and the exact decision rules for applying the joint correctness criterion, explicitly addressing how multi-turn context is handled during scoring. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RankJudge benchmark construction

full rationale

The paper introduces RankJudge as an explicit synthetic benchmark generator that creates conversation pairs by injecting a single flaw into one turn of a reference-grounded multi-turn dialogue. This construction is presented as an external process for producing labeled data rather than a derivation from fitted parameters, self-referential equations, or self-citation chains. The subsequent ranking of LLM judges via the standard Bradley-Terry model is applied to the generated pairs without any reduction of the central claims back to the inputs by definition. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the described approach; the method remains self-contained with independent content from the flaw-injection mechanism itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction depends on domain assumptions about flaw injection rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Injecting a single flaw into one turn creates pairs that can be labeled unambiguously as better or worse while isolating failure categories
    This premise is required for the strict joint correctness criterion and precise isolation described in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1254 out tokens · 46788 ms · 2026-05-22T08:47:08.917275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

  1. [1]

    Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023

  2. [2]

    Gonzalez, and Ion Stoica

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st International Conference on Machine Learning, 2024

  3. [3]

    Gonzalez, and Ion Stoica

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  4. [4]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation.arXiv:2505.06120, 2025

  5. [5]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations, 2021

  6. [6]

    Pervasive label errors in test sets destabilize machine learning benchmarks

    Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. InAdvances in Neural Information Processing Systems, volume 34, 2021

  7. [7]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with MMLU? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

  8. [8]

    EIP: Weighted ranking of LLMs by quantifying question difficulty

    Xingjian Hu, Ziqian Zhang, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, and Lichao Sun. EIP: Weighted ranking of LLMs by quantifying question difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

  10. [10]

    MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. InThe Twelfth International Conference on Learning Representations, 2024

  11. [11]

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs

    Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

  12. [12]

    MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

    Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024

  13. [13]

    Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

    Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

  14. [14]

    MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026

    Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, and Mirella Lapata. MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026. 11

  15. [15]

    Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

    Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv:2504.04717, 2025

  16. [16]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  17. [17]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

  18. [18]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, volume 33, pages 3008–3021, 2020

  19. [19]

    Ask a Strong LLM Judge when Your Reward Model is Uncertain

    Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, and Tuo Zhao. Ask a Strong LLM Judge when Your Reward Model is Uncertain. InAdvances in Neural Information Processing Systems, volume 38, pages 74639–74664, 2025

  20. [20]

    Is ChatGPT a Good NLG Evaluator? A Preliminary Study

    Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11. Association for Computational Linguistics, 2023

  21. [21]

    G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 2023

  22. [22]

    RLAIF vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...

  23. [23]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2024

  24. [24]

    Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

    Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, and Tianlu Wang. Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 52565–52583, 2025

  25. [25]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 6562–6595, 2024

  27. [27]

    MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025

    Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, and Guil- laume Sartoretti. MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025. 12

  28. [28]

    J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning

    Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason E Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026

  29. [29]

    RM-R1: Reward Modeling as Reasoning

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. RM-R1: Reward Modeling as Reasoning. InThe Fourteenth International Conference on Learning Representa- tions, 2026

  30. [30]

    Evaluat- ing large language models at evaluating instruction following

    Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluat- ing large language models at evaluating instruction following. InThe Twelfth International Conference on Learning Representations, 2024

  31. [31]

    JudgeBench: A Benchmark for Evaluating LLM- Based Judges

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. JudgeBench: A Benchmark for Evaluating LLM- Based Judges. InThe Thirteenth International Conference on Learning Representations, 2025

  32. [32]

    Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings

    Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, and Shafiq Joty. Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

  33. [33]

    DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025

    Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, April 2025

  34. [34]

    ReIFE: Re-evaluating instruction-following evaluation

    Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien-Sheng Wu, Shafiq Joty, and Arman Cohan. ReIFE: Re-evaluating instruction-following evaluation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers). Associat...

  35. [35]

    JuStRank: Benchmarking LLM judges for system ranking

    Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

  36. [36]

    Judge Arena: Benchmarking LLMs as Evaluators, 2024

    Kyle Dai, Maurice Burger, Roman Engeler, Max Bartolo, Clémentine Fourrier, Toby Drane, Mathias Leys, and Jake Golden. Judge Arena: Benchmarking LLMs as Evaluators, 2024. Accessed: 2026-05-01

  37. [37]

    MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

    John Mendonça, Alon Lavie, and Isabel Trancoso. MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. InFindings of the Association for Computational Linguistics: EACL 2026, 2026

  38. [38]

    MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associa...

  39. [39]

    AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

    Ruiqi Wang, Xinchen Wang, Cuiyun Gao, Chun Yong Chong, Xin Xia, and Qing Liao. AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

  40. [40]

    LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

    Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

  41. [41]

    Judging the judges: A systematic study of position bias in LLM-as-a-judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in LLM-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of 13 the Asia-Pacific Chapter of the Association for Computational Linguistics. The As...

  42. [42]

    Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

    Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

  43. [43]

    Reasoning Gets Harder for LLMs Inside A Dialogue

    Ivan Kartáˇc, Mateusz Lango, and Ondˇrej Dušek. Reasoning Gets Harder for LLMs Inside A Dialogue.arXiv:2603.20133, 2026

  44. [44]

    Gemini 3.1 pro

    Google. Gemini 3.1 pro. https://gemini.google.com/, May 2026. Large language model

  45. [45]

    Cresswell

    Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. Classifying and Addressing the Diversity of Errors in Retrieval- Augmented Generation Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3185–...

  46. [46]

    HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464. Association for Computational Linguistics, 2023

  47. [47]

    Aegis: Automated error generation and attribution for multi-agent systems

    Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems. InThe Fourteenth International Conference on Learning Representations, 2026

  48. [48]

    Self-critiquing models for assisting human evaluators

    William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv:2206.05802, 2022

  49. [49]

    Let’s Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, 2024

  50. [50]

    Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

    David R Hunter and Runze Li. Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

  51. [51]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

  52. [52]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022

  53. [53]

    RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

    Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, and Juanzi Li. RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension.arXiv:2601.14289, 2026

  54. [54]

    PubMedQA: A Dataset for Biomedical Research Question Answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computa...

  55. [55]

    SP500-EDGAR-10K, 2026

    Jerry Loh. SP500-EDGAR-10K, 2026. Accessed: 2026-05-06

  56. [56]

    OpenRouter, 2026

    OpenRouter. OpenRouter, 2026. Accessed: 2026-05-06

  57. [57]

    Think-J: Learning to Think for Generative LLM-as-a-Judge

    Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Jiaheng Liu, and Wenbo Su. Think-J: Learning to Think for Generative LLM-as-a-Judge. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31158–31166, 2026. 14

  58. [58]

    Claude Opus 4.7 System Card

    Anthropic. Claude Opus 4.7 System Card. https://www.anthropic.com/system-cards,

  59. [59]

    Accessed: 2026-05-06

  60. [60]

    Language models with conformal factuality guarantees

    Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. InProceedings of the 41st International Conference on Machine Learning, 2024

  61. [61]

    Cresswell

    Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Yapeter, Ilya Stanevich, Felipe Perez, and Jesse C. Cresswell. Document summarization with conformal importance guarantees. InAdvances in Neural Information Processing Systems, volume 38, pages 67107–67152, 2025

  62. [62]

    Cresswell

    Brendan Leigh Ross, Noël V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, and Jesse C. Cresswell. Textual bayes: Quantifying prompt uncertainty in LLM-based systems. InThe Fourteenth International Conference on Learning Representations, 2026. 15 A Additional Results A.1 Detail...

  63. [63]

    Round 2 is mostly clear and foregrounds the main correction, so it does not actually exhibit the required disorganized flaw

    and disorganized (1475) consistently beat judges, while unnecessary_refusal and fabricated_answer (both 849) are the easiest to catch: refusals stand out in context and fab- rications are directly checkable against the grounding document. User behavior, by contrast, barely shifts pair difficulty: the seven categories span only 979 to 1247 in median Elo wi...

  64. [64]

    Triage by judge-disagreement signal.Open the Overview tab and scan the per-judge accuracy summary. Pairs that nearly all judges miss, or that split close to 50/50 between the two candidate conversations, are the most likely to be mislabelled and are inspected first; pairs on which the ensemble agrees with the declared verdict are quickly skimmed and moved through

  65. [65]

    The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw

    Confirm the injected flaw is on-target.On the Plan tab, read plan.bad and verify that the planned weakness is consistent with the declared assistant_behavior_type. The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw. Hold the planned weakness in mind for the r...

  66. [66]

    If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned flaw, labelnoise

    Confirm the bad conversation executes the plan.On the Conversations tab, locate the turn at bad_round_index in convo_b and verify that the assistant response realises the 44 planned weakness, neither a different category nor a milder version. If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned ...

  67. [67]

    Cross-check using the judge ensemble.Open the Judges tab. If the strongest judges in the registry (gemini-3.1-pro, gpt-5.5, opus-4.7) consistently miss the verdict while several weaker judges (gpt-oss-20b, gemma-4-31b) get it right, treat the pair as suspicious. The strong judges’ reasoning text usually points at a competing flaw inconvo_a, which is the n...

  68. [68]

    those little indicator things,

    Verify the good conversation has no competing flaw.For each strong judge that picked the wrong side: (i) locate the disputed turn in convo_a on the Conversations tab, (ii) retrieve the relevant span from metadata.context (the source document), and (iii) consult an external LLM with a focused fact-check prompt that contains only the disputed turn, the matc...