Recognition: unknown
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Pith reviewed 2026-05-10 10:38 UTC · model grok-4.3
The pith
Conformal prediction sets and transitivity analysis diagnose per-document reliability of LLM judges on NLG tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Split conformal prediction applied to LLM judge scores on SummEval produces sets with at least (1-α) coverage whose widths correlate with cross-judge agreement at rs = +0.576, while transitivity analysis reveals 33-67% of documents contain directed 3-cycles despite aggregate violation rates of only 0.8-4.1%. Both methods converge on the finding that prediction set widths remain consistent across judges (r-bar = 0.32-0.38) and that relevance yields the narrowest sets (average size ≈3.0) while fluency and consistency yield the widest (≈4.9).
What carries the argument
The central mechanism is the joint use of transitivity violation detection on judgment graphs and split conformal prediction sets over 1-5 Likert scores, where set width functions as a per-instance reliability signal.
If this is right
- Prediction set width supplies a practical, single-judge signal for flagging unreliable document-level scores.
- Evaluation frameworks can prioritize or reweight criteria according to measured reliability, with relevance treated as more trustworthy than fluency or consistency.
- Aggregate agreement metrics mask substantial per-document inconsistency that transitivity analysis can surface.
- Criterion-level differences in set size suggest that judge reliability is task- and aspect-dependent rather than model-dependent.
Where Pith is reading between the lines
- The same toolkit could be applied to other NLG evaluation datasets to map which aspects of text are systematically harder for automated judges to score consistently.
- If set widths also predict error rates when the judgments are fed into larger pipelines, they could serve as filters before those pipelines run.
- Extending the conformal sets to non-Likert or multi-turn judgment formats would test whether the reliability signal generalizes beyond the current 1-5 scale setup.
Load-bearing premise
LLM judge scores on different documents satisfy the exchangeability condition required for conformal prediction to deliver valid coverage guarantees, and transitivity violations in those scores reflect genuine practical unreliability.
What would settle it
A new collection of documents in which prediction set width shows no correlation with actual cross-judge agreement rates or in which high-transitivity-violation documents produce no measurable drop in downstream task performance when their judgments are used.
Figures
read the original abstract
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a two-pronged diagnostic for LLM judge reliability on SummEval: (1) transitivity analysis revealing low aggregate violation rates (0.8-4.1%) but 33-67% of documents with at least one directed 3-cycle, indicating masked per-input inconsistencies; and (2) split conformal prediction sets on 1-5 Likert scores that provide guaranteed coverage, with set width proposed as a per-instance reliability signal (pooled r_s = +0.576). It reports consistent cross-judge agreement on set widths (r-bar = 0.32-0.38) and concludes that criterion (relevance most reliable, fluency/consistency least) matters more than judge choice. All code, prompts, and results are released.
Significance. If the conformal sets are valid, the work supplies a theoretically grounded per-instance uncertainty measure for LLM judges that goes beyond aggregate correlations, with the cross-judge width agreement and criterion differences offering actionable guidance for NLG evaluation. The open release of code and cached results is a clear strength for reproducibility.
major comments (2)
- [Split Conformal Prediction section] Split Conformal Prediction section: The claimed ≥(1-α) coverage and use of set width as a reliability indicator rest on the exchangeability assumption between calibration and test points. The manuscript applies split conformal prediction to LLM judge scores on heterogeneous SummEval documents without verifying or discussing this assumption (e.g., via permutation tests or sensitivity checks for document difficulty distributions or prompt sensitivity). This is load-bearing for the coverage guarantees and for interpreting the cross-judge r-bar = 0.32-0.38 as evidence that width captures intrinsic difficulty rather than shared biases.
- [Results on criterion vs. judge effects] Results on criterion vs. judge effects: The claim that criterion matters more than judge is supported by average set sizes (relevance ≈3.0, fluency/consistency ≈4.9) and the cross-judge width correlations, but the manuscript does not report a direct statistical comparison (e.g., ANOVA or effect-size contrast) of variance attributable to criteria versus judges. Without this, the relative importance conclusion is not fully substantiated by the data.
minor comments (2)
- [Abstract] Abstract: The pooled Spearman correlation r_s = +0.576 is presented as linking set width to a 'per-instance reliability indicator,' but the exact quantity being correlated (e.g., human agreement, score variance) is not defined here; this should be stated explicitly.
- [Abstract] Notation: The symbols r-bar (cross-judge agreement) and rho-bar (violation rates) are used without immediate definition in the abstract; adding a short parenthetical or reference to the methods section would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the assumptions and strengthen the statistical support for our claims. We address each major point below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Split Conformal Prediction section] Split Conformal Prediction section: The claimed ≥(1-α) coverage and use of set width as a reliability indicator rest on the exchangeability assumption between calibration and test points. The manuscript applies split conformal prediction to LLM judge scores on heterogeneous SummEval documents without verifying or discussing this assumption (e.g., via permutation tests or sensitivity checks for document difficulty distributions or prompt sensitivity). This is load-bearing for the coverage guarantees and for interpreting the cross-judge r-bar = 0.32-0.38 as evidence that width captures intrinsic difficulty rather than shared biases.
Authors: We agree that the exchangeability assumption is foundational for the coverage guarantees of split conformal prediction and that its applicability to heterogeneous documents merits explicit discussion. The original manuscript applied the standard split-CP procedure without additional verification steps for this dataset. In revision, we will add a dedicated paragraph in the methods section acknowledging the assumption, its potential sensitivity to document heterogeneity and prompt variations, and its implications for interpreting cross-judge width correlations. We will also include an empirical sensitivity check: randomly re-splitting the data multiple times, recomputing coverage on held-out points, and reporting stability of the observed coverage rates. These additions will make the load-bearing nature of the assumption transparent while providing evidence that the guarantees remain practically reliable in this setting. revision: yes
-
Referee: [Results on criterion vs. judge effects] Results on criterion vs. judge effects: The claim that criterion matters more than judge is supported by average set sizes (relevance ≈3.0, fluency/consistency ≈4.9) and the cross-judge width correlations, but the manuscript does not report a direct statistical comparison (e.g., ANOVA or effect-size contrast) of variance attributable to criteria versus judges. Without this, the relative importance conclusion is not fully substantiated by the data.
Authors: The referee correctly identifies that a direct statistical contrast would provide stronger substantiation for the claim that criterion effects dominate judge effects. The manuscript currently relies on descriptive averages and cross-judge correlations. In revision, we will add an ANOVA (or mixed-effects model) on prediction-set widths with fixed factors for criterion and judge (and their interaction), reporting partial eta-squared effect sizes, F-statistics, and p-values. This analysis will be performed on the pooled data and per-judge subsets, directly quantifying the relative variance explained by each factor and confirming that criterion accounts for substantially more variance than judge identity. revision: yes
Circularity Check
No significant circularity; derivations rely on direct empirical measurements and standard conformal guarantees
full rationale
The paper's core results—transitivity violation rates computed directly from pairwise judge comparisons on SummEval documents, and prediction set widths obtained via standard split conformal prediction on Likert scores—are independent of the target claims. Cross-judge agreement on set widths (r-bar = 0.32-0.38) is an observed Spearman correlation, not a quantity forced by construction or by fitting parameters to the same data. No step renames a known result, imports uniqueness via self-citation, or defines reliability in terms of the width it then 'predicts.' The exchangeability assumption for coverage is an unverified modeling choice (a validity issue), but the reported statistics and correlations do not reduce to it tautologically. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Data points are exchangeable so that split conformal prediction delivers valid marginal coverage of at least 1-alpha
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Aggregating inconsistent information: Ranking and clustering
Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the ACM , 55 0 (5): 0 1--27, 2008
2008
-
[3]
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I . the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952
1952
-
[5]
Essai sur l'application de l'analyse \`a la probabilit \'e des d \'e cisions rendues \`a la pluralit \'e des voix
Marie Jean Antoine Nicolas Caritat de Condorcet. Essai sur l'application de l'analyse \`a la probabilit \'e des d \'e cisions rendues \`a la pluralit \'e des voix. Imprimerie Royale, 1785
-
[6]
Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev
Alexander R. Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval : Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9: 0 391--409, 2021
2021
-
[7]
Martins, Graham Neubig, Ankush Garg, Jonathan H
Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e F.T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, and Orhan Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pp.\ 1066--10...
2023
-
[8]
Unsupervised quality estimation for neural machine translation
Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation for neural machine translation. In Transactions of the Association for Computational Linguistics, volume 8, pp.\ 539--555, 2020
2020
-
[10]
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023
-
[11]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023
2023
-
[12]
Conformal prediction with large language models for multi-choice question answering
Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023
-
[13]
G-Eval : NLG evaluation using GPT-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522. Association for Computational Linguistics, 2023
2023
-
[14]
BERT -based conformal predictor for intent classification
Lysimachos Maltoudoglou, Andreas Paisios, and Harris Sakkas. BERT -based conformal predictor for intent classification. In Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, pp.\ 178--193, 2020
2020
-
[15]
John W. Moon. Topics on Tournaments. Holt, Rinehart and Winston, 1968
1968
-
[16]
Inductive confidence machines for regression
Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pp.\ 345--356. Springer, 2002
2002
-
[17]
Large language models are effective text rankers with pairwise ranking prompting
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Shen, Tianyi Liu, Jiaming Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting. In Findings of the Association for Computational Linguistics: NAACL 2024, 2024
2024
-
[18]
Jaakkola, and Regina Barzilay
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[19]
arXiv preprint arXiv:2310.10076 , year=
Keita Saito, Saku Sugawara, and Kentaro Inui. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076, 2023
-
[20]
A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method
Markus Schulze. A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method. Social Choice and Welfare, 36 0 (2): 0 267--303, 2011
2011
-
[21]
Tibshirani, Rina Foygel Barber, Emmanuel Cand \`e s, and Aaditya Ramdas
Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel Cand \`e s, and Aaditya Ramdas. Conformal prediction under covariate shift. Advances in Neural Information Processing Systems, 32, 2019
2019
-
[22]
Algorithmic Learning in a Random World
Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005
2005
-
[23]
Large language models are not yet human-level evaluators for abstractive summarization
Chenhui Wang, Yutai Yang, Chenghao Dang, and Wanxiang Che. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 4215--4233. Association for Computational Linguistics, 2023
2023
-
[24]
FLASK : Fine-grained language model evaluation based on alignment skill sets
Seonghyeon Ye, Doyoung Kim, Sungdong Jang, Hyungjoo Shin, Youngjae Baek, Juho Song, Dongha Park, and Minjoon Seo. FLASK : Fine-grained language model evaluation based on alignment skill sets. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024
2024
-
[25]
Peyton Young
H. Peyton Young. Condorcet's theory of voting. American Political Science Review, 82 0 (4): 0 1231--1244, 1988
1988
-
[26]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems, volume 36, 2023
2023
-
[27]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[28]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[29]
Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.