LLMs Judging LLMs: A Simplex Perspective
Pith reviewed 2026-05-19 12:36 UTC · model grok-4.3
The pith
LLM judges produce robust rankings for many but not all datasets when epistemic uncertainty is modeled with simplex-based Bayesian priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By placing LLM judges and candidates as points on an (M-1)-simplex, geometric quantities such as areas become equivalent to ranking concepts; this yields identifiability conditions that are stronger for two-level scoring than for multi-level scoring, together with Bayesian priors on judge quality that produce rankings with substantially higher coverage rates than existing procedures on LLM benchmarks.
What carries the argument
the (M-1)-dimensional probability simplex on which both judges and candidates are represented as points, with geometric relations (such as triangle areas) standing in for ranking semantics
If this is right
- Rankings derived from LLM judges alone remain stable across many but not all existing datasets.
- Explicitly modeling epistemic uncertainty about judge quality produces coverage rates that exceed those of current methods.
- Two-level scoring systems admit clearer identifiability than multi-level systems because their simplex geometry aligns more directly with ranking concepts.
- Sensitivity analysis over different priors on judge quality can be performed directly on the simplex.
Where Pith is reading between the lines
- The same simplex construction might be applied to other LLM evaluation tasks such as pairwise preference or direct scoring.
- New benchmarks could be designed specifically to test the boundary cases where the simplex geometry predicts poor identifiability.
- The visual proofs on the simplex offer a way to communicate uncertainty in judge quality to practitioners who are not Bayesian statisticians.
Load-bearing premise
The mapping that places LLM judges and candidates onto simplex points continues to preserve the original ranking relationships implied by the scoring task.
What would settle it
A re-run of the benchmark experiments in which the proposed Bayesian method fails to produce higher coverage rates than the existing procedures, or in which small changes in assumed judge quality reverse the reported rankings.
Figures
read the original abstract
Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for $M$-level scoring systems, both LLM judges and candidates can be represented as points on an $(M-1)$-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective for two-level scoring ($M=2$) than multi-level scoring ($M>2$). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a geometric framework for analyzing LLM-as-judge rankings of LLM candidates by representing both as points on an (M-1)-dimensional probability simplex. Geometric quantities such as triangle areas are used to derive identifiability conditions, formalize the advantage of binary (M=2) over multi-level scoring, and construct Bayesian priors that encode epistemic uncertainty about judge quality. Experiments on LLM benchmarks indicate that pure LLM-judge rankings are robust in many but not all datasets, while the proposed Bayesian method achieves substantially higher coverage rates than existing procedures.
Significance. If the core simplex mapping holds, the work supplies an intuitive geometric lens that explains empirical patterns in LLM evaluation and demonstrates concrete gains from modeling epistemic uncertainty via sensitivity analyses on geometric priors. The visual proofs and coverage improvements constitute clear strengths for a practical problem in automated evaluation.
major comments (1)
- The central claim that geometric quantities on the (M-1)-simplex directly correspond to ranking identifiability and uncertainty (introduced in the abstract and developed in the theoretical sections) rests on the assumption that the mapping of LLM judges and candidates to simplex coordinates faithfully encodes the empirical distribution of assigned scores. The manuscript provides no explicit empirical checks for exchangeability of score probabilities or for whether judge-quality uncertainty maps to the stated prior variance; if this correspondence fails, both the identifiability theorems and the reported coverage gains become interpretive rather than predictive. This is load-bearing for the entire geometric argument and the Bayesian construction.
minor comments (1)
- Notation for the simplex coordinates and the precise definition of the geometric priors could be stated more explicitly in a single dedicated subsection to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful feedback. The major comment raises a substantive point about the empirical grounding of our geometric and Bayesian constructions, which we address directly below with a proposed revision.
read point-by-point responses
-
Referee: The central claim that geometric quantities on the (M-1)-simplex directly correspond to ranking identifiability and uncertainty (introduced in the abstract and developed in the theoretical sections) rests on the assumption that the mapping of LLM judges and candidates to simplex coordinates faithfully encodes the empirical distribution of assigned scores. The manuscript provides no explicit empirical checks for exchangeability of score probabilities or for whether judge-quality uncertainty maps to the stated prior variance; if this correspondence fails, both the identifiability theorems and the reported coverage gains become interpretive rather than predictive. This is load-bearing for the entire geometric argument and the Bayesian construction.
Authors: We thank the referee for highlighting this foundational modeling choice. The simplex coordinates are defined directly as the normalized probability vectors of the observed score distributions for each judge and candidate; the mapping therefore encodes the empirical score frequencies by construction rather than as a separate hypothesis. Geometric quantities such as triangle areas are then derived as functions of these probability vectors, yielding identifiability conditions that hold within the resulting probabilistic model. Exchangeability of score probabilities follows from treating assignments as multinomial draws from the simplex point, a standard assumption for categorical scoring that aligns with the data-generating process. The geometric Bayesian priors encode epistemic uncertainty by placing mass over plausible simplex locations for judge quality, with the reported sensitivity analyses varying concentration parameters to probe robustness. While these elements are theoretically consistent, we agree that explicit empirical verification would make the correspondence more transparent. In the revised manuscript we will add a short subsection (new Section 4.4) that reports (i) empirical checks on score-probability stability across repeated judgments within each benchmark and (ii) a comparison of observed judge variability against the prior variances used in the coverage experiments. This addition will strengthen the link between the geometric model and the reported gains without altering the core theorems. revision: partial
Circularity Check
No significant circularity in simplex modeling or Bayesian priors
full rationale
The paper introduces the (M-1)-simplex representation as an explicit modeling framework that maps judges and candidates to points where geometric quantities are defined to correspond to ranking concepts. This is a foundational assumption rather than a derived claim. Theoretical conditions and identifiability results follow directly from this representation by construction of the model, not by reducing to fitted data or prior self-citations. Geometric Bayesian priors are designed to encode epistemic uncertainty via the same simplex geometry and are varied for sensitivity analysis; they are not fitted to the benchmark ranking data used for coverage evaluation. Empirical results on LLM benchmarks serve as independent checks. No load-bearing step equates a prediction to its inputs by definition or self-referential fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM judges and candidates can be represented as points on an (M-1)-dimensional probability simplex such that geometric quantities correspond to ranking concepts.
Forward citations
Cited by 1 Pith paper
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Reference graph
Works this paper leans on
-
[1]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Association for Computational Linguistics. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language mod- els are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score
∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Figure B.10: Transition weight framework for encoding judge quality priors in 3-level scoring. Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score. Edges show allowed transitions, with weightsαdrawn from Dirichlet priors paramete...
-
[3]
The transition weightsα(m1,m2)→(m′ 1,m′
∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Each node( m1, m2)in the transition graph represents a confusion matrix entry: the probability of assigning score m2 when the true score ism1. The transition weightsα(m1,m2)→(m′ 1,m′
-
[4]
control how probability mass flows from parent to child nodes. All outgoing weights from any parent node sum to one, ensuring the result remains a valid probability distribution. The final confusion matrix entryθ(j) m′ 1,m′ 2 for judge j is a weighted average of its parent nodes’ values, weighted by the incoming edge weights. For example, in Figure 5, the...
work page 2021
-
[5]
is implemented as an alternative ranking method. PPI uses a small labeled dataset to calibrate predictions from LLM judges on a larger unlabeled dataset, providing statistically valid confidence intervals for candidate rankings. The implementation: (i) randomly partitions questions into labeled and unlabeled sets (using 5% or 10% labeled fractions), (ii) ...
work page 2021
-
[6]
Carefully read the original news article provided below
-
[7]
Read the candidate summaries presented in the <CANDIDATE #i ANSWER> sections
-
[8]
Rate each summary on a scale from 1 (very low) to 5 (very high) based on its relevance, consistency, fluency, and coherence. Note that summaries that are very similar on an axis may receive the same score. Definitions: * Relevance: The rating measures how well the summary captures the key points of the article. Summaries in which all and only the importan...
-
[9]
Carefully read the original question to understand what is being asked
-
[10]
Read each candidate answer carefully
-
[11]
Rate each answer according to the criteria below based on general mathematical knowledge and reasoning
-
[12]
Provide clear justification for each score with specific references to the candidate’s answer. Rate each answer using the following criteria: ### Accuracy Assessment (1 for correct, 0 for partially correct/borderline, -1 for incorrect) Based on your mathematical knowledge, how accurate is the candidate answer? Strive to categorize answers as either Correc...
-
[13]
Carefully read the original question
-
[14]
Carefully read the ground truth reference answer to understand the correct approach and solution
-
[15]
For each candidate answer: - Read the entire response - Evaluate it against the ground truth reference answer - Score it according to the criteria below - Provide clear justification for each score with specific references to both the candidate answer and ground truth Rate each answer using the following criteria relative to the ground truth reference ans...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.