RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Jesse C. Cresswell; Keyvan Golestan; Rasa Hosseinzadeh; Tongzi Wu; Zhaoyan Liu; Zhenwei Tang

arxiv: 2605.21748 · v1 · pith:J4DCT3NRnew · submitted 2026-05-20 · 💻 cs.CL

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Zhenwei Tang , Zhaoyan Liu , Rasa Hosseinzadeh , Tongzi Wu , Keyvan Golestan , Jesse C. Cresswell This is my paper

Pith reviewed 2026-05-22 08:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-as-a-judgemulti-turn evaluationsynthetic benchmarkflaw injectionBradley-Terry rankingconversation pairsreference grounding

0 comments

The pith

RankJudge generates paired multi-turn conversations differing by one injected flaw to give LLM judges an unambiguous correctness signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generator that produces reference-grounded conversation pairs for testing LLM judges on multi-turn tasks. One conversation in each pair receives a single targeted flaw in one turn, so the better version is known by construction and the error type is isolated to that turn. This setup supports a strict joint correctness criterion and lets developers rank judges across machine learning, biomedicine, and finance domains using the Bradley-Terry model. Rankings remain stable when judges see only partial context, when correctness is defined more coarsely, or when an alternative random-walk scorer is substituted. The method also assigns difficulty ratings to pairs, which are used to curate lower-noise evaluation slices that human annotators confirm.

Core claim

By injecting exactly one flaw into one turn of an otherwise identical conversation pair grounded in a reference document, RankJudge produces labeled examples that isolate specific failure categories and enable unambiguous better/worse judgments for LLM-as-a-judge evaluation.

What carries the argument

The RankJudge synthetic benchmark generator that creates conversation pairs differing by a single flaw in one turn.

If this is right

Frontier LLM judges can be ranked by how often they correctly identify the flawless conversation in each pair.
Difficulty ratings derived from the same pairs allow dynamic selection of evaluation subsets with reduced label noise.
Judge rankings stay consistent under partial conversation visibility and under coarser or alternative scoring rules.
The same construction supports precise diagnosis of which error types individual judges fail to detect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generator could be extended to inject multiple interacting flaws or to simulate longer conversation histories.
Developers could use the per-turn isolation to create targeted training data that improves judges on specific weaknesses.
The approach might generalize to non-document-grounded conversations if reference material can be replaced by verifiable external facts.

Load-bearing premise

Injecting a single flaw into one turn produces pairs that are unambiguously better or worse without creating other unintended differences that would confuse the label.

What would settle it

Human annotators systematically disagree on which conversation is better in a large fraction of the generated pairs.

Figures

Figures reproduced from arXiv: 2605.21748 by Jesse C. Cresswell, Keyvan Golestan, Rasa Hosseinzadeh, Tongzi Wu, Zhaoyan Liu, Zhenwei Tang.

**Figure 1.** Figure 1: Overview of RANKJUDGE, a benchmark generator for multi-turn judge evaluation. the two apart. Lastly, static accuracy on a fixed pool offers no principled way to identify which items actually separate strong judges from weak ones [5, 6, 7]. In this paper, we introduce RANKJUDGE, a benchmark generator for multi-turn, reference-grounded judge evaluation. Each item is a pair of conversations sampled independen… view at source ↗

**Figure 2.** Figure 2: (Left) Cumulative fraction of samples annotated by humans as having ambiguous or noisy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Elo scores of 21 judges on the combined dataset. Black circles give the combined Elo with 95% CI; colored markers show per-domain Elo scores. Tick-label color denotes proprietary (blue) vs. open-source (orange) judges [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Judge Elo against per-match compute. Top: mean completion tokens per match (linear). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Per-class prediction bias for each judge. Each cell gives the difference in percentage points [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Judge Elo with and without failure-type correctness as a criterion. (b) Accuracy vs. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Per-pair Elo on the combined dataset, grouped by (a) assistant failure type and (b) user [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Same-model preference does not distort the rest of the leaderboard. Each panel is a slope [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Per-class prediction bias for each judge across the three domains. Each cell gives the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Per-judge confusion of assistant failure type predictions, row-normalized so each [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Per-domain version of Figure 6(a). Judge Elo with the full correctness criterion (x-axis) [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Per-domain version of Figure 6(c). Each panel reports Spearman [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Judge ranks with (x-axis) and without (y-axis) the top-ranked pair removal, one panel per [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Judge ranks under Bradley–Terry Elo (x-axis) and Empirical Interaction Propagation [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Pointwise vs. pairwise judging on a 100-pair stratified sample. (a) Each judge’s pointwise Elo plotted against its pairwise Elo recomputed on the same pair set; the dashed line marks y = x. (b) Per-judge pointwise score-gap: the mean Likert score on the good conversation in each pair minus the mean score on the flawed one, sorted descending. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: The pair-audit interface used for the human label-noise audit. The annotator inspects the [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗

read the original abstract

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RankJudge gives a concrete generator for multi-turn judge benchmarks by single-flaw injection into reference-grounded conversations, but the claim of clean isolation needs more evidence against conversational side effects.

read the letter

The punchline is that this paper presents a benchmark generator called RankJudge that creates paired multi-turn conversations for testing LLM judges by injecting one specific flaw into a single turn of an otherwise solid dialogue grounded in reference material. This is new because most existing LLM-as-a-judge benchmarks stick to single-turn Q&A, while this targets the multi-turn case that matters for real chat applications. They generate data in machine learning, biomedicine, and finance, then evaluate 21 different frontier models as judges. The rankings come from the Bradley-Terry model, and they show the order stays consistent even with partial conversation views or a different rating approach. They also use difficulty scores to select a subset and back it with human checks to cut down on label errors. That combination of generation method and validation steps is the practical part that stands out. The soft spot sits in the core construction. The claim is that the single flaw makes the labeling unambiguous and pins the failure to one turn. But conversations are connected, so an issue in one turn can shift the information or expectations for the next turns, or the response might compensate in ways that change the overall quality without the flaw being the clear cause. The abstract does not describe any additional checks after injection to confirm the difference stays isolated, which leaves the strict joint correctness criterion on shaky ground if those interactions happen often. This paper is for people who build or tune LLM judges for conversational systems in technical domains. It gives them a way to create controlled test cases at scale without relying entirely on human raters for every example. I would send this to peer review. The method addresses a clear need, the experiments are reasonably broad, and referees can help sort out whether the flaw isolation works as intended in the generated data.

Referee Report

2 major / 2 minor

Summary. The paper introduces RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. It creates pairs of conversations differing by a single flaw injected into one turn, enabling unambiguous better/worse labels and precise isolation of failure categories to individual turns for a strict joint correctness criterion. The method is implemented across machine learning, biomedicine, and finance domains; 21 frontier LLM judges are evaluated and ranked via the Bradley-Terry model. Difficulty ratings allow dynamic curation of evaluation slices to reduce label noise (confirmed by human annotation), and judge rankings are reported as stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

Significance. If the single-flaw construction reliably isolates quality differences, RankJudge offers a scalable, low-cost way to benchmark judges on complex conversational tasks where human evaluation is impractical. Strengths include the use of Bradley-Terry ranking, human validation of the curation process, explicit stability tests across multiple conditions, and domain-specific implementations. This directly addresses the gap in existing LLM-judge benchmarks that focus primarily on single-turn Q&A.

major comments (2)

[§3 (RankJudge Construction)] §3 (RankJudge Construction): The central claim that single-flaw injection into one turn produces pairs that can be labeled unambiguously as better or worse, while isolating failure categories, assumes the injected flaw remains the sole source of quality difference without being masked, compensated, or compounded by surrounding turns or the reference document. No post-injection consistency checks, turn-isolation metrics, or verification that context does not alter the net quality delta are described, which is load-bearing for the strict joint correctness criterion.
[§5 (Evaluation Protocols)] §5 (Evaluation Protocols): The exact prompting templates and decision rules used to apply the joint correctness criterion when scoring judge outputs on the paired conversations are not specified in sufficient detail to reproduce the isolation of failure categories or to confirm that multi-turn context effects do not undermine the labeling.

minor comments (2)

[§4.3] The description of how difficulty ratings are computed from the Bradley-Terry model and then used for dynamic curation could be expanded with a short pseudocode or equation for clarity.
[Table 1] Table 1 (domain statistics) would benefit from an additional column reporting the number of unique reference documents per domain to help readers assess diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where additional detail would strengthen the manuscript. We address each major comment below with clarifications based on the existing construction and indicate revisions that will be incorporated.

read point-by-point responses

Referee: [§3 (RankJudge Construction)] §3 (RankJudge Construction): The central claim that single-flaw injection into one turn produces pairs that can be labeled unambiguously as better or worse, while isolating failure categories, assumes the injected flaw remains the sole source of quality difference without being masked, compensated, or compounded by surrounding turns or the reference document. No post-injection consistency checks, turn-isolation metrics, or verification that context does not alter the net quality delta are described, which is load-bearing for the strict joint correctness criterion.

Authors: The RankJudge construction generates each pair from an identical reference document and base conversation, modifying only a single targeted turn in one member of the pair with a flaw drawn from a predefined category. This design ensures that any quality difference is localized to that turn by construction. We acknowledge that explicit post-injection verification steps were not detailed in the initial submission. In the revision we will add a dedicated subsection describing sampling-based consistency checks (including human review of a subset of pairs) to confirm that surrounding context does not mask or compound the injected flaw, thereby reinforcing the foundation for the strict joint correctness criterion. revision: yes
Referee: [§5 (Evaluation Protocols)] §5 (Evaluation Protocols): The exact prompting templates and decision rules used to apply the joint correctness criterion when scoring judge outputs on the paired conversations are not specified in sufficient detail to reproduce the isolation of failure categories or to confirm that multi-turn context effects do not undermine the labeling.

Authors: We agree that full reproducibility requires the precise prompting templates and decision rules. Section 5 presents the joint correctness criterion at the conceptual level, but the concrete implementation details were omitted. In the revised manuscript we will append the complete judge prompts (including how the paired conversations and reference document are formatted) and the exact decision rules for applying the joint correctness criterion, explicitly addressing how multi-turn context is handled during scoring. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RankJudge benchmark construction

full rationale

The paper introduces RankJudge as an explicit synthetic benchmark generator that creates conversation pairs by injecting a single flaw into one turn of a reference-grounded multi-turn dialogue. This construction is presented as an external process for producing labeled data rather than a derivation from fitted parameters, self-referential equations, or self-citation chains. The subsequent ranking of LLM judges via the standard Bradley-Terry model is applied to the generated pairs without any reduction of the central claims back to the inputs by definition. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the described approach; the method remains self-contained with independent content from the flaw-injection mechanism itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction depends on domain assumptions about flaw injection rather than new free parameters or invented entities.

axioms (1)

domain assumption Injecting a single flaw into one turn creates pairs that can be labeled unambiguously as better or worse while isolating failure categories
This premise is required for the strict joint correctness criterion and precise isolation described in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1254 out tokens · 46788 ms · 2026-05-22T08:47:08.917275+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

[1]

Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023

work page 2023
[2]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[3]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[4]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation.arXiv:2505.06120, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[6]

Pervasive label errors in test sets destabilize machine learning benchmarks

Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[7]

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with MMLU? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page 2025
[8]

EIP: Weighted ranking of LLMs by quantifying question difficulty

Xingjian Hu, Ziqian Zhang, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, and Lichao Sun. EIP: Weighted ranking of LLMs by quantifying question difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[11]

Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

work page 2025
[12]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024

work page 2024
[13]

Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

work page arXiv 2026
[14]

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026

Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, and Mirella Lapata. MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026. 11

work page arXiv 2026
[15]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv:2504.04717, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[17]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[18]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, volume 33, pages 3008–3021, 2020

work page 2020
[19]

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, and Tuo Zhao. Ask a Strong LLM Judge when Your Reward Model is Uncertain. InAdvances in Neural Information Processing Systems, volume 38, pages 74639–74664, 2025

work page 2025
[20]

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11. Association for Computational Linguistics, 2023

work page 2023
[21]

G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 2023

work page 2023
[22]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...

work page 2024
[23]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[24]

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, and Tianlu Wang. Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 52565–52583, 2025

work page 2025
[25]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 6562–6595, 2024

work page 2024
[27]

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, and Guil- laume Sartoretti. MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025. 12

work page arXiv 2025
[28]

J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning

Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason E Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[29]

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. RM-R1: Reward Modeling as Reasoning. InThe Fourteenth International Conference on Learning Representa- tions, 2026

work page 2026
[30]

Evaluat- ing large language models at evaluating instruction following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluat- ing large language models at evaluating instruction following. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[31]

JudgeBench: A Benchmark for Evaluating LLM- Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. JudgeBench: A Benchmark for Evaluating LLM- Based Judges. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings

Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, and Shafiq Joty. Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

work page 2025
[33]

DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025

Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, April 2025

work page 2025
[34]

ReIFE: Re-evaluating instruction-following evaluation

Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien-Sheng Wu, Shafiq Joty, and Arman Cohan. ReIFE: Re-evaluating instruction-following evaluation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers). Associat...

work page 2025
[35]

JuStRank: Benchmarking LLM judges for system ranking

Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

work page 2025
[36]

Judge Arena: Benchmarking LLMs as Evaluators, 2024

Kyle Dai, Maurice Burger, Roman Engeler, Max Bartolo, Clémentine Fourrier, Toby Drane, Mathias Leys, and Jake Golden. Judge Arena: Benchmarking LLMs as Evaluators, 2024. Accessed: 2026-05-01

work page 2024
[37]

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendonça, Alon Lavie, and Isabel Trancoso. MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. InFindings of the Association for Computational Linguistics: EACL 2026, 2026

work page 2026
[38]

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associa...

work page 2024
[39]

AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

Ruiqi Wang, Xinchen Wang, Cuiyun Gao, Chun Yong Chong, Xin Xia, and Qing Liao. AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

work page arXiv 2025
[40]

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

work page arXiv 2025
[41]

Judging the judges: A systematic study of position bias in LLM-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in LLM-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of 13 the Asia-Pacific Chapter of the Association for Computational Linguistics. The As...

work page 2025
[42]

Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

work page arXiv 2023
[43]

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartáˇc, Mateusz Lango, and Ondˇrej Dušek. Reasoning Gets Harder for LLMs Inside A Dialogue.arXiv:2603.20133, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://gemini.google.com/, May 2026. Large language model

work page 2026
[45]

Cresswell

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. Classifying and Addressing the Diversity of Errors in Retrieval- Augmented Generation Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3185–...

work page 2026
[46]

HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464. Association for Computational Linguistics, 2023

work page 2023
[47]

Aegis: Automated error generation and attribution for multi-agent systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[48]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv:2206.05802, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[50]

Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

David R Hunter and Runze Li. Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

work page 2005
[51]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

work page 2026
[52]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022

work page 2022
[53]

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, and Juanzi Li. RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension.arXiv:2601.14289, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computa...

work page 2019
[55]

SP500-EDGAR-10K, 2026

Jerry Loh. SP500-EDGAR-10K, 2026. Accessed: 2026-05-06

work page 2026
[56]

OpenRouter, 2026

OpenRouter. OpenRouter, 2026. Accessed: 2026-05-06

work page 2026
[57]

Think-J: Learning to Think for Generative LLM-as-a-Judge

Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Jiaheng Liu, and Wenbo Su. Think-J: Learning to Think for Generative LLM-as-a-Judge. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31158–31166, 2026. 14

work page 2026
[58]

Claude Opus 4.7 System Card

Anthropic. Claude Opus 4.7 System Card. https://www.anthropic.com/system-cards,

work page
[59]

Accessed: 2026-05-06

work page 2026
[60]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[61]

Cresswell

Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Yapeter, Ilya Stanevich, Felipe Perez, and Jesse C. Cresswell. Document summarization with conformal importance guarantees. InAdvances in Neural Information Processing Systems, volume 38, pages 67107–67152, 2025

work page 2025
[62]

Cresswell

Brendan Leigh Ross, Noël V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, and Jesse C. Cresswell. Textual bayes: Quantifying prompt uncertainty in LLM-based systems. InThe Fourteenth International Conference on Learning Representations, 2026. 15 A Additional Results A.1 Detail...

work page 2026
[63]

Round 2 is mostly clear and foregrounds the main correction, so it does not actually exhibit the required disorganized flaw

and disorganized (1475) consistently beat judges, while unnecessary_refusal and fabricated_answer (both 849) are the easiest to catch: refusals stand out in context and fab- rications are directly checkable against the grounding document. User behavior, by contrast, barely shifts pair difficulty: the seven categories span only 979 to 1247 in median Elo wi...

work page 2000
[64]

Triage by judge-disagreement signal.Open the Overview tab and scan the per-judge accuracy summary. Pairs that nearly all judges miss, or that split close to 50/50 between the two candidate conversations, are the most likely to be mislabelled and are inspected first; pairs on which the ensemble agrees with the declared verdict are quickly skimmed and moved through

work page
[65]

The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw

Confirm the injected flaw is on-target.On the Plan tab, read plan.bad and verify that the planned weakness is consistent with the declared assistant_behavior_type. The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw. Hold the planned weakness in mind for the r...

work page
[66]

If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned flaw, labelnoise

Confirm the bad conversation executes the plan.On the Conversations tab, locate the turn at bad_round_index in convo_b and verify that the assistant response realises the 44 planned weakness, neither a different category nor a milder version. If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned ...

work page
[67]

Cross-check using the judge ensemble.Open the Judges tab. If the strongest judges in the registry (gemini-3.1-pro, gpt-5.5, opus-4.7) consistently miss the verdict while several weaker judges (gpt-oss-20b, gemma-4-31b) get it right, treat the pair as suspicious. The strong judges’ reasoning text usually points at a competing flaw inconvo_a, which is the n...

work page
[68]

those little indicator things,

Verify the good conversation has no competing flaw.For each strong judge that picked the wrong side: (i) locate the disputed turn in convo_a on the Conversations tab, (ii) retrieve the relevant span from metadata.context (the source document), and (iii) consult an external LLM with a focused fact-check prompt that contains only the disputed turn, the matc...

work page 2022

[1] [1]

Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023

work page 2023

[2] [2]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[3] [3]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025

[4] [4]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation.arXiv:2505.06120, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[6] [6]

Pervasive label errors in test sets destabilize machine learning benchmarks

Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[7] [7]

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with MMLU? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page 2025

[8] [8]

EIP: Weighted ranking of LLMs by quantifying question difficulty

Xingjian Hu, Ziqian Zhang, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, and Lichao Sun. EIP: Weighted ranking of LLMs by quantifying question difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[11] [11]

Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18632–...

work page 2025

[12] [12]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024

work page 2024

[13] [13]

Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. Hal- luHard: A Hard Multi-Turn Hallucination Benchmark.arXiv:2602.01031, 2026

work page arXiv 2026

[14] [14]

MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026

Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, and Mirella Lapata. MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games.arXiv:2602.24188, 2026. 11

work page arXiv 2026

[15] [15]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv:2504.04717, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[17] [17]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[18] [18]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, volume 33, pages 3008–3021, 2020

work page 2020

[19] [19]

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, and Tuo Zhao. Ask a Strong LLM Judge when Your Reward Model is Uncertain. InAdvances in Neural Information Processing Systems, volume 38, pages 74639–74664, 2025

work page 2025

[20] [20]

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11. Association for Computational Linguistics, 2023

work page 2023

[21] [21]

G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 2023

work page 2023

[22] [22]

RLAIF vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InProceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings o...

work page 2024

[23] [23]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[24] [24]

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, and Tianlu Wang. Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 52565–52583, 2025

work page 2025

[25] [25]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[26] [26]

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 6562–6595, 2024

work page 2024

[27] [27]

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, and Guil- laume Sartoretti. MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation.arXiv:2502.12468, 2025. 12

work page arXiv 2025

[28] [28]

J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning

Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason E Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026

[29] [29]

RM-R1: Reward Modeling as Reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. RM-R1: Reward Modeling as Reasoning. InThe Fourteenth International Conference on Learning Representa- tions, 2026

work page 2026

[30] [30]

Evaluat- ing large language models at evaluating instruction following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluat- ing large language models at evaluating instruction following. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[31] [31]

JudgeBench: A Benchmark for Evaluating LLM- Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. JudgeBench: A Benchmark for Evaluating LLM- Based Judges. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[32] [32]

Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings

Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, and Shafiq Joty. Does context matter? ContextualJudgeBench for evaluating LLM-based judges in contextual settings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

work page 2025

[33] [33]

DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025

Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. DHP benchmark: Are LLMs good NLG evaluators? InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, April 2025

work page 2025

[34] [34]

ReIFE: Re-evaluating instruction-following evaluation

Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien-Sheng Wu, Shafiq Joty, and Arman Cohan. ReIFE: Re-evaluating instruction-following evaluation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers). Associat...

work page 2025

[35] [35]

JuStRank: Benchmarking LLM judges for system ranking

Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, July 2025

work page 2025

[36] [36]

Judge Arena: Benchmarking LLMs as Evaluators, 2024

Kyle Dai, Maurice Burger, Roman Engeler, Max Bartolo, Clémentine Fourrier, Toby Drane, Mathias Leys, and Jake Golden. Judge Arena: Benchmarking LLMs as Evaluators, 2024. Accessed: 2026-05-01

work page 2024

[37] [37]

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators

John Mendonça, Alon Lavie, and Isabel Trancoso. MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators. InFindings of the Association for Computational Linguistics: EACL 2026, 2026

work page 2026

[38] [38]

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associa...

work page 2024

[39] [39]

AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

Ruiqi Wang, Xinchen Wang, Cuiyun Gao, Chun Yong Chong, Xin Xia, and Qing Liao. AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration.arXiv:2512.20159, 2025

work page arXiv 2025

[40] [40]

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, and Shouling Ji. LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge.arXiv:2506.09443, 2025

work page arXiv 2025

[41] [41]

Judging the judges: A systematic study of position bias in LLM-as-a-judge

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in LLM-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of 13 the Asia-Pacific Chapter of the Association for Computational Linguistics. The As...

work page 2025

[42] [42]

Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators.arXiv:2308.01862, 2023

work page arXiv 2023

[43] [43]

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartáˇc, Mateusz Lango, and Ondˇrej Dušek. Reasoning Gets Harder for LLMs Inside A Dialogue.arXiv:2603.20133, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://gemini.google.com/, May 2026. Large language model

work page 2026

[45] [45]

Cresswell

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. Classifying and Addressing the Diversity of Errors in Retrieval- Augmented Generation Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3185–...

work page 2026

[46] [46]

HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large- Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464. Association for Computational Linguistics, 2023

work page 2023

[47] [47]

Aegis: Automated error generation and attribution for multi-agent systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[48] [48]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv:2206.05802, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[50] [50]

Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

David R Hunter and Runze Li. Variable Selection using MM Algorithms.Annals of Statistics, 33(4):1617, 2005

work page 2005

[51] [51]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

work page 2026

[52] [52]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations, 2022

work page 2022

[53] [53]

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, and Juanzi Li. RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension.arXiv:2601.14289, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computa...

work page 2019

[55] [55]

SP500-EDGAR-10K, 2026

Jerry Loh. SP500-EDGAR-10K, 2026. Accessed: 2026-05-06

work page 2026

[56] [56]

OpenRouter, 2026

OpenRouter. OpenRouter, 2026. Accessed: 2026-05-06

work page 2026

[57] [57]

Think-J: Learning to Think for Generative LLM-as-a-Judge

Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Jiaheng Liu, and Wenbo Su. Think-J: Learning to Think for Generative LLM-as-a-Judge. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31158–31166, 2026. 14

work page 2026

[58] [58]

Claude Opus 4.7 System Card

Anthropic. Claude Opus 4.7 System Card. https://www.anthropic.com/system-cards,

work page

[59] [59]

Accessed: 2026-05-06

work page 2026

[60] [60]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[61] [61]

Cresswell

Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Yapeter, Ilya Stanevich, Felipe Perez, and Jesse C. Cresswell. Document summarization with conformal importance guarantees. InAdvances in Neural Information Processing Systems, volume 38, pages 67107–67152, 2025

work page 2025

[62] [62]

Cresswell

Brendan Leigh Ross, Noël V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, and Jesse C. Cresswell. Textual bayes: Quantifying prompt uncertainty in LLM-based systems. InThe Fourteenth International Conference on Learning Representations, 2026. 15 A Additional Results A.1 Detail...

work page 2026

[63] [63]

Round 2 is mostly clear and foregrounds the main correction, so it does not actually exhibit the required disorganized flaw

and disorganized (1475) consistently beat judges, while unnecessary_refusal and fabricated_answer (both 849) are the easiest to catch: refusals stand out in context and fab- rications are directly checkable against the grounding document. User behavior, by contrast, barely shifts pair difficulty: the seven categories span only 979 to 1247 in median Elo wi...

work page 2000

[64] [64]

Triage by judge-disagreement signal.Open the Overview tab and scan the per-judge accuracy summary. Pairs that nearly all judges miss, or that split close to 50/50 between the two candidate conversations, are the most likely to be mislabelled and are inspected first; pairs on which the ensemble agrees with the declared verdict are quickly skimmed and moved through

work page

[65] [65]

The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw

Confirm the injected flaw is on-target.On the Plan tab, read plan.bad and verify that the planned weakness is consistent with the declared assistant_behavior_type. The plan is a single narrative paragraph rather than a turn-by-turn script, but it must be specific enough to determine which turn realises the flaw. Hold the planned weakness in mind for the r...

work page

[66] [66]

If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned flaw, labelnoise

Confirm the bad conversation executes the plan.On the Conversations tab, locate the turn at bad_round_index in convo_b and verify that the assistant response realises the 44 planned weakness, neither a different category nor a milder version. If convo_b exhibits a different assistant_behavior_type than the one declared, or if no turn realises the planned ...

work page

[67] [67]

Cross-check using the judge ensemble.Open the Judges tab. If the strongest judges in the registry (gemini-3.1-pro, gpt-5.5, opus-4.7) consistently miss the verdict while several weaker judges (gpt-oss-20b, gemma-4-31b) get it right, treat the pair as suspicious. The strong judges’ reasoning text usually points at a competing flaw inconvo_a, which is the n...

work page

[68] [68]

those little indicator things,

Verify the good conversation has no competing flaw.For each strong judge that picked the wrong side: (i) locate the disputed turn in convo_a on the Conversations tab, (ii) retrieve the relevant span from metadata.context (the source document), and (iii) consult an external LLM with a focused fact-check prompt that contains only the disputed turn, the matc...

work page 2022