arxiv: 2604.15302 · v1 · submitted 2026-04-16 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta , Dhruv Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM-as-a-judgeconformal predictiontransitivity violationsNLG evaluationreliability diagnosticsSummEvalprediction setsLikert scores

0 comments

The pith

Conformal prediction sets and transitivity analysis diagnose per-document reliability of LLM judges on NLG tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two tools to measure when LLM judges produce trustworthy scores on individual documents rather than just aggregate agreement. A transitivity check uncovers hidden inconsistencies, finding that one-third to two-thirds of documents contain at least one cyclic violation even when overall violation rates stay low. Conformal prediction then constructs guaranteed coverage sets over Likert scores, with narrower sets marking more reliable judgments. These set widths agree across judges, indicating they track document difficulty instead of judge-specific bias. The diagnostics show that reliability depends far more on the evaluation criterion than on which judge model is used.

Core claim

Split conformal prediction applied to LLM judge scores on SummEval produces sets with at least (1-α) coverage whose widths correlate with cross-judge agreement at rs = +0.576, while transitivity analysis reveals 33-67% of documents contain directed 3-cycles despite aggregate violation rates of only 0.8-4.1%. Both methods converge on the finding that prediction set widths remain consistent across judges (r-bar = 0.32-0.38) and that relevance yields the narrowest sets (average size ≈3.0) while fluency and consistency yield the widest (≈4.9).

What carries the argument

The central mechanism is the joint use of transitivity violation detection on judgment graphs and split conformal prediction sets over 1-5 Likert scores, where set width functions as a per-instance reliability signal.

If this is right

Prediction set width supplies a practical, single-judge signal for flagging unreliable document-level scores.
Evaluation frameworks can prioritize or reweight criteria according to measured reliability, with relevance treated as more trustworthy than fluency or consistency.
Aggregate agreement metrics mask substantial per-document inconsistency that transitivity analysis can surface.
Criterion-level differences in set size suggest that judge reliability is task- and aspect-dependent rather than model-dependent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same toolkit could be applied to other NLG evaluation datasets to map which aspects of text are systematically harder for automated judges to score consistently.
If set widths also predict error rates when the judgments are fed into larger pipelines, they could serve as filters before those pipelines run.
Extending the conformal sets to non-Likert or multi-turn judgment formats would test whether the reliability signal generalizes beyond the current 1-5 scale setup.

Load-bearing premise

LLM judge scores on different documents satisfy the exchangeability condition required for conformal prediction to deliver valid coverage guarantees, and transitivity violations in those scores reflect genuine practical unreliability.

What would settle it

A new collection of documents in which prediction set width shows no correlation with actual cross-judge agreement rates or in which high-transitivity-violation documents produce no measurable drop in downstream task performance when their judgments are used.

Figures

Figures reproduced from arXiv: 2604.15302 by Dhruv Kumar, Manan Gupta.

**Figure 1.** Figure 1: Two-pronged diagnostic pipeline. SummEval documents are evaluated by four LLM judges under two protocols. The pairwise protocol (40,320 API calls) feeds the transitivity diagnostic, which measures directed 3-cycle violation rates ρ(x) per input and tests whether MFAS ranking repair improves agreement with human rankings. The direct scoring protocol (3,840 API calls) feeds the conformal diagnostic, which pr… view at source ↗

**Figure 2.** Figure 2: Per-document violation rate distributions. Each violin shows the distribution of ρ(x) across 30 documents for one judge–criterion pair. Dashed horizontal line: randombaseline rate (0.25). All distributions are right-tailed with median = 0, but the upper tails, where a single document can expose > 30% violation rates, are practically significant. Fluency consistently shows the widest tails. Method LLaMA Qw… view at source ↗

**Figure 3.** Figure 3: Average prediction set size at α=0.10 (green = small = reliable; red = large = unreliable). Each cell shows average set size (larger text) and empirical coverage (smaller text). The criterion axis drives variation far more than the judge axis: coherence and relevance (left two columns) are reliably judged (≈3.0), while fluency and consistency are near-maximally uncertain (≈5.0). All 16 cells meet the 90% … view at source ↗

**Figure 4.** Figure 4: Pooled reliability diagrams (all four judges, α=0.10). x-axis: prediction set width; y-axis: mean absolute error (MAE) vs. human score. Error bars: 95% CI. Annotations: sample count per width. Spearman rs and p-value shown per panel. Consistency shows the clearest signal (rs = +0.34, p<0.0001); relevance is the exception (rs ≈ 0, p=0.86). Judge Pair Coh. Con. Flu. Rel. GPT / LLaMA +0.12 +0.22† +0.45‡ +0.28… view at source ↗

**Figure 5.** Figure 5: Inter-judge width agreement matrices (Spearman r, α=0.10). Rows/columns: the four judges. Diagonal forced to 1.0. Coherence (leftmost) shows predominantly near-zero off-diagonal entries; fluency and relevance show consistently positive agreement, confirming that prediction width tracks document-level difficulty across model families. justify this approach: a wide set from any judge is a warning about the d… view at source ↗

**Figure 6.** Figure 6: Empirical coverage vs. α. Shaded bands: ±1 std across the four judges. Dashed line: theoretical guarantee 1−α. Coverage meets or exceeds the guarantee at every operating point for all four criteria. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs transitivity cycle counts with conformal set widths to flag unreliable LLM judge outputs per document on SummEval, and the cross-judge agreement on widths is the part worth checking.

read the letter

The core observation is that aggregate violation rates hide a lot of per-document inconsistency: 33-67% of SummEval items show at least one directed 3-cycle across judges, while overall rates stay low. The conformal sets on 1-5 Likert scores then give a width that correlates with cross-judge agreement at r-bar 0.32-0.38, and the authors argue this width tracks document difficulty rather than judge-specific noise. Criterion effects also stand out, with relevance producing narrower sets than fluency or consistency. Releasing the code and cached results makes it easy to test the same pipeline on other data.

Referee Report

2 major / 2 minor

Summary. The paper introduces a two-pronged diagnostic for LLM judge reliability on SummEval: (1) transitivity analysis revealing low aggregate violation rates (0.8-4.1%) but 33-67% of documents with at least one directed 3-cycle, indicating masked per-input inconsistencies; and (2) split conformal prediction sets on 1-5 Likert scores that provide guaranteed coverage, with set width proposed as a per-instance reliability signal (pooled r_s = +0.576). It reports consistent cross-judge agreement on set widths (r-bar = 0.32-0.38) and concludes that criterion (relevance most reliable, fluency/consistency least) matters more than judge choice. All code, prompts, and results are released.

Significance. If the conformal sets are valid, the work supplies a theoretically grounded per-instance uncertainty measure for LLM judges that goes beyond aggregate correlations, with the cross-judge width agreement and criterion differences offering actionable guidance for NLG evaluation. The open release of code and cached results is a clear strength for reproducibility.

major comments (2)

[Split Conformal Prediction section] Split Conformal Prediction section: The claimed ≥(1-α) coverage and use of set width as a reliability indicator rest on the exchangeability assumption between calibration and test points. The manuscript applies split conformal prediction to LLM judge scores on heterogeneous SummEval documents without verifying or discussing this assumption (e.g., via permutation tests or sensitivity checks for document difficulty distributions or prompt sensitivity). This is load-bearing for the coverage guarantees and for interpreting the cross-judge r-bar = 0.32-0.38 as evidence that width captures intrinsic difficulty rather than shared biases.
[Results on criterion vs. judge effects] Results on criterion vs. judge effects: The claim that criterion matters more than judge is supported by average set sizes (relevance ≈3.0, fluency/consistency ≈4.9) and the cross-judge width correlations, but the manuscript does not report a direct statistical comparison (e.g., ANOVA or effect-size contrast) of variance attributable to criteria versus judges. Without this, the relative importance conclusion is not fully substantiated by the data.

minor comments (2)

[Abstract] Abstract: The pooled Spearman correlation r_s = +0.576 is presented as linking set width to a 'per-instance reliability indicator,' but the exact quantity being correlated (e.g., human agreement, score variance) is not defined here; this should be stated explicitly.
[Abstract] Notation: The symbols r-bar (cross-judge agreement) and rho-bar (violation rates) are used without immediate definition in the abstract; adding a short parenthetical or reference to the methods section would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the assumptions and strengthen the statistical support for our claims. We address each major point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Split Conformal Prediction section] Split Conformal Prediction section: The claimed ≥(1-α) coverage and use of set width as a reliability indicator rest on the exchangeability assumption between calibration and test points. The manuscript applies split conformal prediction to LLM judge scores on heterogeneous SummEval documents without verifying or discussing this assumption (e.g., via permutation tests or sensitivity checks for document difficulty distributions or prompt sensitivity). This is load-bearing for the coverage guarantees and for interpreting the cross-judge r-bar = 0.32-0.38 as evidence that width captures intrinsic difficulty rather than shared biases.

Authors: We agree that the exchangeability assumption is foundational for the coverage guarantees of split conformal prediction and that its applicability to heterogeneous documents merits explicit discussion. The original manuscript applied the standard split-CP procedure without additional verification steps for this dataset. In revision, we will add a dedicated paragraph in the methods section acknowledging the assumption, its potential sensitivity to document heterogeneity and prompt variations, and its implications for interpreting cross-judge width correlations. We will also include an empirical sensitivity check: randomly re-splitting the data multiple times, recomputing coverage on held-out points, and reporting stability of the observed coverage rates. These additions will make the load-bearing nature of the assumption transparent while providing evidence that the guarantees remain practically reliable in this setting. revision: yes
Referee: [Results on criterion vs. judge effects] Results on criterion vs. judge effects: The claim that criterion matters more than judge is supported by average set sizes (relevance ≈3.0, fluency/consistency ≈4.9) and the cross-judge width correlations, but the manuscript does not report a direct statistical comparison (e.g., ANOVA or effect-size contrast) of variance attributable to criteria versus judges. Without this, the relative importance conclusion is not fully substantiated by the data.

Authors: The referee correctly identifies that a direct statistical contrast would provide stronger substantiation for the claim that criterion effects dominate judge effects. The manuscript currently relies on descriptive averages and cross-judge correlations. In revision, we will add an ANOVA (or mixed-effects model) on prediction-set widths with fixed factors for criterion and judge (and their interaction), reporting partial eta-squared effect sizes, F-statistics, and p-values. This analysis will be performed on the pooled data and per-judge subsets, directly quantifying the relative variance explained by each factor and confirming that criterion accounts for substantially more variance than judge identity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rely on direct empirical measurements and standard conformal guarantees

full rationale

The paper's core results—transitivity violation rates computed directly from pairwise judge comparisons on SummEval documents, and prediction set widths obtained via standard split conformal prediction on Likert scores—are independent of the target claims. Cross-judge agreement on set widths (r-bar = 0.32-0.38) is an observed Spearman correlation, not a quantity forced by construction or by fitting parameters to the same data. No step renames a known result, imports uniqueness via self-citation, or defines reliability in terms of the width it then 'predicts.' The exchangeability assumption for coverage is an unverified modeling choice (a validity issue), but the reported statistics and correlations do not reduce to it tautologically. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard exchangeability assumption for split conformal prediction to deliver coverage guarantees and treats transitivity as a diagnostic lens without introducing new fitted parameters or postulated entities.

axioms (1)

standard math Data points are exchangeable so that split conformal prediction delivers valid marginal coverage of at least 1-alpha
This is the core theoretical assumption invoked to guarantee the prediction sets contain the true Likert score with the stated probability.

pith-pipeline@v0.9.0 · 5559 in / 1211 out tokens · 28183 ms · 2026-05-10T10:38:38.146846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 5 canonical work pages · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Aggregating inconsistent information: Ranking and clustering

Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the ACM , 55 0 (5): 0 1--27, 2008

2008
[3]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I . the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

1952
[5]

Essai sur l'application de l'analyse \`a la probabilit \'e des d \'e cisions rendues \`a la pluralit \'e des voix

Marie Jean Antoine Nicolas Caritat de Condorcet. Essai sur l'application de l'analyse \`a la probabilit \'e des d \'e cisions rendues \`a la pluralit \'e des voix. Imprimerie Royale, 1785
[6]

Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval : Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 0 391--409, 2021

2021
[7]

Martins, Graham Neubig, Ankush Garg, Jonathan H

Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e F.T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, and Orhan Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pp.\ 1066--10...

2023
[8]

Unsupervised quality estimation for neural machine translation

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation for neural machine translation. In Transactions of the Association for Computational Linguistics, volume 8, pp.\ 539--555, 2020

2020
[10]

Benchmarking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012,

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023

work page arXiv 2023
[11]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023

2023
[12]

Conformal prediction with large language models for multi-choice question answering

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023
[13]

G-Eval : NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522. Association for Computational Linguistics, 2023

2023
[14]

BERT -based conformal predictor for intent classification

Lysimachos Maltoudoglou, Andreas Paisios, and Harris Sakkas. BERT -based conformal predictor for intent classification. In Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, pp.\ 178--193, 2020

2020
[15]

John W. Moon. Topics on Tournaments. Holt, Rinehart and Winston, 1968

1968
[16]

Inductive confidence machines for regression

Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pp.\ 345--356. Springer, 2002

2002
[17]

Large language models are effective text rankers with pairwise ranking prompting

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Shen, Tianyi Liu, Jiaming Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting. In Findings of the Association for Computational Linguistics: NAACL 2024, 2024

2024
[18]

Jaakkola, and Regina Barzilay

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In The Twelfth International Conference on Learning Representations, 2024

2024
[19]

arXiv preprint arXiv:2310.10076 , year=

Keita Saito, Saku Sugawara, and Kentaro Inui. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076, 2023

work page arXiv 2023
[20]

A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method

Markus Schulze. A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method. Social Choice and Welfare, 36 0 (2): 0 267--303, 2011

2011
[21]

Tibshirani, Rina Foygel Barber, Emmanuel Cand \`e s, and Aaditya Ramdas

Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel Cand \`e s, and Aaditya Ramdas. Conformal prediction under covariate shift. Advances in Neural Information Processing Systems, 32, 2019

2019
[22]

Algorithmic Learning in a Random World

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

2005
[23]

Large language models are not yet human-level evaluators for abstractive summarization

Chenhui Wang, Yutai Yang, Chenghao Dang, and Wanxiang Che. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 4215--4233. Association for Computational Linguistics, 2023

2023
[24]

FLASK : Fine-grained language model evaluation based on alignment skill sets

Seonghyeon Ye, Doyoung Kim, Sungdong Jang, Hyungjoo Shin, Youngjae Baek, Juho Song, Dongha Park, and Minjoon Seo. FLASK : Fine-grained language model evaluation based on alignment skill sets. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

2024
[25]

Peyton Young

H. Peyton Young. Condorcet's theory of voting. American Political Science Review, 82 0 (4): 0 1231--1244, 1988

1988
[26]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[27]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[28]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[29]

Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2026