Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

· 2026 · cs.AI · arXiv 2604.15302

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

representative citing papers

Can LLMs Rank? A Tale of Triads and Triage

cs.CY · 2026-06-29 · unverdicted · novelty 5.0

LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.

citing papers explorer

Showing 1 of 1 citing paper.

Can LLMs Rank? A Tale of Triads and Triage cs.CY · 2026-06-29 · unverdicted · none · ref 36 · internal anchor
LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

fields

years

verdicts

representative citing papers

citing papers explorer