LLMs Judging LLMs: A Simplex Perspective

Adarsh Subbaswamy; Fan Xia; Jean Feng; Patrick Vossler; Yifan Mai

arxiv: 2505.21972 · v3 · submitted 2025-05-28 · 💻 cs.LG · cs.AI· stat.ML

LLMs Judging LLMs: A Simplex Perspective

Patrick Vossler , Fan Xia , Yifan Mai , Adarsh Subbaswamy , Jean Feng This is my paper

Pith reviewed 2026-05-19 12:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords LLM as judgesimplex geometryepistemic uncertaintyBayesian rankingmodel evaluationranking identifiabilitycoverage rates

0 comments

The pith

LLM judges produce robust rankings for many but not all datasets when epistemic uncertainty is modeled with simplex-based Bayesian priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that both LLM judges and the candidates they evaluate can be placed as points on an (M-1)-dimensional probability simplex for any M-level scoring system. Geometric features such as triangle areas then correspond directly to ranking properties, supplying clear conditions for when a ranking can be recovered from judge scores alone. This geometry explains the practical observation that binary scoring is more reliable than finer-grained scales. The authors introduce priors that capture uncertainty about judge quality and demonstrate on benchmarks that the resulting Bayesian rankings achieve higher coverage than standard approaches.

Core claim

By placing LLM judges and candidates as points on an (M-1)-simplex, geometric quantities such as areas become equivalent to ranking concepts; this yields identifiability conditions that are stronger for two-level scoring than for multi-level scoring, together with Bayesian priors on judge quality that produce rankings with substantially higher coverage rates than existing procedures on LLM benchmarks.

What carries the argument

the (M-1)-dimensional probability simplex on which both judges and candidates are represented as points, with geometric relations (such as triangle areas) standing in for ranking semantics

If this is right

Rankings derived from LLM judges alone remain stable across many but not all existing datasets.
Explicitly modeling epistemic uncertainty about judge quality produces coverage rates that exceed those of current methods.
Two-level scoring systems admit clearer identifiability than multi-level systems because their simplex geometry aligns more directly with ranking concepts.
Sensitivity analysis over different priors on judge quality can be performed directly on the simplex.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same simplex construction might be applied to other LLM evaluation tasks such as pairwise preference or direct scoring.
New benchmarks could be designed specifically to test the boundary cases where the simplex geometry predicts poor identifiability.
The visual proofs on the simplex offer a way to communicate uncertainty in judge quality to practitioners who are not Bayesian statisticians.

Load-bearing premise

The mapping that places LLM judges and candidates onto simplex points continues to preserve the original ranking relationships implied by the scoring task.

What would settle it

A re-run of the benchmark experiments in which the proposed Bayesian method fails to produce higher coverage rates than the existing procedures, or in which small changes in assumed judge quality reverse the reported rankings.

Figures

Figures reproduced from arXiv: 2505.21972 by Adarsh Subbaswamy, Fan Xia, Jean Feng, Patrick Vossler, Yifan Mai.

**Figure 1.** Figure 1: LLM judge workflow: For each benchmark question, LLM judges score each candidate’s answer according to a rubric. Candidates are ranked based on their judge-assigned scores. Shaded boxes indicate cases where the same LLM serves as both candidate and judge (self-judging). key ranking concepts. Epistemic uncertainty manifests as uncertainty in the location of judge points, while aleatoric uncertainty manifest… view at source ↗

**Figure 2.** Figure 2: a shows how the judge and candidate in a 2-level scoring system all fall on a line segment and Figure 2b shows how these points in a 3-level scoring system all fall within a triangle. Using this geometric perspective, we can then establish equivalences between geometric concepts to key ranking concepts: 1. True score distributions correspond to barycentric coordinates. πk corresponds precisely to the baryc… view at source ↗

**Figure 3.** Figure 3: Visualization of judge assumptions for the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Non-identifiability in 3-level scoring. Same [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Weight propagation framework for encoding [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Top: Sensitivity of estimated rankings when [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Estimated rankings for the top 10 candidates [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Given the challenge of automatically evaluating free-form outputs from large language models (LLMs), an increasingly common solution is to use LLMs themselves as the judging mechanism, without any gold-standard scores. Implicitly, this practice accounts for only sampling variability (aleatoric uncertainty) and ignores uncertainty about judge quality (epistemic uncertainty). While this is justified if judges are perfectly accurate, it is unclear when such an approach is theoretically valid and practically robust. We study these questions for the task of ranking LLM candidates from a novel geometric perspective: for $M$-level scoring systems, both LLM judges and candidates can be represented as points on an $(M-1)$-dimensional probability simplex, where geometric concepts (e.g., triangle areas) correspond to key ranking concepts. This perspective yields intuitive theoretical conditions and visual proofs for when rankings are identifiable; for instance, we provide a formal basis for the ``folk wisdom'' that LLM judges are more effective for two-level scoring ($M=2$) than multi-level scoring ($M>2$). Leveraging the simplex, we design geometric Bayesian priors that encode epistemic uncertainty about judge quality and vary the priors to conduct sensitivity analyses. Experiments on LLM benchmarks show that rankings based solely on LLM judges are robust in many but not all datasets, underscoring both their widespread success and the need for caution. Our Bayesian method achieves substantially higher coverage rates than existing procedures, highlighting the importance of modeling epistemic uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The simplex geometry gives a clean formal handle on when LLM judge rankings are identifiable and shows coverage gains from modeling epistemic uncertainty, but the mapping from real score distributions to simplex points is the part that still needs stronger defense.

read the letter

The paper's main point is that by mapping LLM judges and the things they rank onto a probability simplex, you can use geometry to figure out when the rankings are reliable and to build better uncertainty estimates. This is a fresh angle on the LLM-as-judge problem. What is new here is the simplex representation itself and the way they pull identifiability conditions out of geometric arguments like areas of triangles. It gives a formal reason for the common observation that binary scoring (M=2) is more robust than using more levels. They also construct priors from the same geometry to handle uncertainty about how accurate the judge is, and they vary those priors to test sensitivity. The experiments are useful. They find that rankings from LLM judges alone are stable across quite a few benchmarks but break on others, and their Bayesian method gets substantially better coverage than the usual methods. That part shows why ignoring epistemic uncertainty is risky. The potential issue is whether the simplex mapping actually keeps the original ranking meaning intact. The stress test raises a good point: if the way LLMs assign scores doesn't match the exchangeability or other properties needed for the geometry to line up with real ranking behavior, then the identifiability results and the coverage gains might not generalize as cleanly as claimed. Without seeing the full derivations and the precise prior specifications or data filters, it's hard to tell how sensitive the conclusions are to those details. This paper is for people who use or study LLM-based evaluation in practice. Anyone selecting models or deploying systems based on automatic rankings would benefit from the caution it provides and from the improved uncertainty handling. It has enough new ideas and empirical support to deserve a full referee process rather than a desk reject. I would recommend sending it for peer review, but with specific asks for more justification on the semantic preservation of the simplex and for additional robustness checks on the experimental pipeline.

Referee Report

1 major / 1 minor

Summary. The manuscript develops a geometric framework for analyzing LLM-as-judge rankings of LLM candidates by representing both as points on an (M-1)-dimensional probability simplex. Geometric quantities such as triangle areas are used to derive identifiability conditions, formalize the advantage of binary (M=2) over multi-level scoring, and construct Bayesian priors that encode epistemic uncertainty about judge quality. Experiments on LLM benchmarks indicate that pure LLM-judge rankings are robust in many but not all datasets, while the proposed Bayesian method achieves substantially higher coverage rates than existing procedures.

Significance. If the core simplex mapping holds, the work supplies an intuitive geometric lens that explains empirical patterns in LLM evaluation and demonstrates concrete gains from modeling epistemic uncertainty via sensitivity analyses on geometric priors. The visual proofs and coverage improvements constitute clear strengths for a practical problem in automated evaluation.

major comments (1)

The central claim that geometric quantities on the (M-1)-simplex directly correspond to ranking identifiability and uncertainty (introduced in the abstract and developed in the theoretical sections) rests on the assumption that the mapping of LLM judges and candidates to simplex coordinates faithfully encodes the empirical distribution of assigned scores. The manuscript provides no explicit empirical checks for exchangeability of score probabilities or for whether judge-quality uncertainty maps to the stated prior variance; if this correspondence fails, both the identifiability theorems and the reported coverage gains become interpretive rather than predictive. This is load-bearing for the entire geometric argument and the Bayesian construction.

minor comments (1)

Notation for the simplex coordinates and the precise definition of the geometric priors could be stated more explicitly in a single dedicated subsection to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful feedback. The major comment raises a substantive point about the empirical grounding of our geometric and Bayesian constructions, which we address directly below with a proposed revision.

read point-by-point responses

Referee: The central claim that geometric quantities on the (M-1)-simplex directly correspond to ranking identifiability and uncertainty (introduced in the abstract and developed in the theoretical sections) rests on the assumption that the mapping of LLM judges and candidates to simplex coordinates faithfully encodes the empirical distribution of assigned scores. The manuscript provides no explicit empirical checks for exchangeability of score probabilities or for whether judge-quality uncertainty maps to the stated prior variance; if this correspondence fails, both the identifiability theorems and the reported coverage gains become interpretive rather than predictive. This is load-bearing for the entire geometric argument and the Bayesian construction.

Authors: We thank the referee for highlighting this foundational modeling choice. The simplex coordinates are defined directly as the normalized probability vectors of the observed score distributions for each judge and candidate; the mapping therefore encodes the empirical score frequencies by construction rather than as a separate hypothesis. Geometric quantities such as triangle areas are then derived as functions of these probability vectors, yielding identifiability conditions that hold within the resulting probabilistic model. Exchangeability of score probabilities follows from treating assignments as multinomial draws from the simplex point, a standard assumption for categorical scoring that aligns with the data-generating process. The geometric Bayesian priors encode epistemic uncertainty by placing mass over plausible simplex locations for judge quality, with the reported sensitivity analyses varying concentration parameters to probe robustness. While these elements are theoretically consistent, we agree that explicit empirical verification would make the correspondence more transparent. In the revised manuscript we will add a short subsection (new Section 4.4) that reports (i) empirical checks on score-probability stability across repeated judgments within each benchmark and (ii) a comparison of observed judge variability against the prior variances used in the coverage experiments. This addition will strengthen the link between the geometric model and the reported gains without altering the core theorems. revision: partial

Circularity Check

0 steps flagged

No significant circularity in simplex modeling or Bayesian priors

full rationale

The paper introduces the (M-1)-simplex representation as an explicit modeling framework that maps judges and candidates to points where geometric quantities are defined to correspond to ranking concepts. This is a foundational assumption rather than a derived claim. Theoretical conditions and identifiability results follow directly from this representation by construction of the model, not by reducing to fitted data or prior self-citations. Geometric Bayesian priors are designed to encode epistemic uncertainty via the same simplex geometry and are varied for sensitivity analysis; they are not fitted to the benchmark ranking data used for coverage evaluation. Empirical results on LLM benchmarks serve as independent checks. No load-bearing step equates a prediction to its inputs by definition or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the simplex embedding faithfully represents ranking semantics and that the chosen geometric priors adequately capture epistemic uncertainty about judge quality. No free parameters are explicitly named in the abstract, but the sensitivity analysis implies that prior hyperparameters are varied by hand.

axioms (1)

domain assumption LLM judges and candidates can be represented as points on an (M-1)-dimensional probability simplex such that geometric quantities correspond to ranking concepts.
Stated in the abstract as the novel geometric perspective that yields the theoretical conditions.

pith-pipeline@v0.9.0 · 5796 in / 1343 out tokens · 32147 ms · 2026-05-19T12:36:27.796058+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
cs.AI 2025-10 unverdicted novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Association for Computational Linguistics. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language mod- els are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score

∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Figure B.10: Transition weight framework for encoding judge quality priors in 3-level scoring. Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score. Edges show allowed transitions, with weightsαdrawn from Dirichlet priors paramete...

work page
[3]

The transition weightsα(m1,m2)→(m′ 1,m′

∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Each node( m1, m2)in the transition graph represents a confusion matrix entry: the probability of assigning score m2 when the true score ism1. The transition weightsα(m1,m2)→(m′ 1,m′

work page
[4]

label-flippers

control how probability mass flows from parent to child nodes. All outgoing weights from any parent node sum to one, ensuring the result remains a valid probability distribution. The final confusion matrix entryθ(j) m′ 1,m′ 2 for judge j is a weighted average of its parent nodes’ values, weighted by the incoming edge weights. For example, in Figure 5, the...

work page 2021
[5]

evaluations

is implemented as an alternative ranking method. PPI uses a small labeled dataset to calibrate predictions from LLM judges on a larger unlabeled dataset, providing statistically valid confidence intervals for candidate rankings. The implementation: (i) randomly partitions questions into labeled and unlabeled sets (using 5% or 10% labeled fractions), (ii) ...

work page 2021
[6]

Carefully read the original news article provided below

work page
[7]

Read the candidate summaries presented in the <CANDIDATE #i ANSWER> sections

work page
[8]

evaluations

Rate each summary on a scale from 1 (very low) to 5 (very high) based on its relevance, consistency, fluency, and coherence. Note that summaries that are very similar on an axis may receive the same score. Definitions: * Relevance: The rating measures how well the summary captures the key points of the article. Summaries in which all and only the importan...

work page
[9]

Carefully read the original question to understand what is being asked

work page
[10]

Read each candidate answer carefully

work page
[11]

Rate each answer according to the criteria below based on general mathematical knowledge and reasoning

work page
[12]

evaluations

Provide clear justification for each score with specific references to the candidate’s answer. Rate each answer using the following criteria: ### Accuracy Assessment (1 for correct, 0 for partially correct/borderline, -1 for incorrect) Based on your mathematical knowledge, how accurate is the candidate answer? Strive to categorize answers as either Correc...

work page
[13]

Carefully read the original question

work page
[14]

Carefully read the ground truth reference answer to understand the correct approach and solution

work page
[15]

evaluations

For each candidate answer: - Read the entire response - Evaluate it against the ground truth reference answer - Score it according to the criteria below - Provide clear justification for each score with specific references to both the candidate answer and ground truth Rate each answer using the following criteria relative to the ground truth reference ans...

work page

[1] [1]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Association for Computational Linguistics. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language mod- els are not fair evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score

∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Figure B.10: Transition weight framework for encoding judge quality priors in 3-level scoring. Nodes represent pairs( m1, m2)where m1 is the true score andm2 is the assigned score. Edges show allowed transitions, with weightsαdrawn from Dirichlet priors paramete...

work page

[3] [3]

The transition weightsα(m1,m2)→(m′ 1,m′

∼Dirichlet( ⃗β(m1,m2)) θ(j) m′ 1,m′ 2 = X (m1,m2)→(m′ 1,m′ 2) θ(j) (m1,m2)α(m1,m2)→(m′ 1,m′ 2) Each node( m1, m2)in the transition graph represents a confusion matrix entry: the probability of assigning score m2 when the true score ism1. The transition weightsα(m1,m2)→(m′ 1,m′

work page

[4] [4]

label-flippers

control how probability mass flows from parent to child nodes. All outgoing weights from any parent node sum to one, ensuring the result remains a valid probability distribution. The final confusion matrix entryθ(j) m′ 1,m′ 2 for judge j is a weighted average of its parent nodes’ values, weighted by the incoming edge weights. For example, in Figure 5, the...

work page 2021

[5] [5]

evaluations

is implemented as an alternative ranking method. PPI uses a small labeled dataset to calibrate predictions from LLM judges on a larger unlabeled dataset, providing statistically valid confidence intervals for candidate rankings. The implementation: (i) randomly partitions questions into labeled and unlabeled sets (using 5% or 10% labeled fractions), (ii) ...

work page 2021

[6] [6]

Carefully read the original news article provided below

work page

[7] [7]

Read the candidate summaries presented in the <CANDIDATE #i ANSWER> sections

work page

[8] [8]

evaluations

Rate each summary on a scale from 1 (very low) to 5 (very high) based on its relevance, consistency, fluency, and coherence. Note that summaries that are very similar on an axis may receive the same score. Definitions: * Relevance: The rating measures how well the summary captures the key points of the article. Summaries in which all and only the importan...

work page

[9] [9]

Carefully read the original question to understand what is being asked

work page

[10] [10]

Read each candidate answer carefully

work page

[11] [11]

Rate each answer according to the criteria below based on general mathematical knowledge and reasoning

work page

[12] [12]

evaluations

Provide clear justification for each score with specific references to the candidate’s answer. Rate each answer using the following criteria: ### Accuracy Assessment (1 for correct, 0 for partially correct/borderline, -1 for incorrect) Based on your mathematical knowledge, how accurate is the candidate answer? Strive to categorize answers as either Correc...

work page

[13] [13]

Carefully read the original question

work page

[14] [14]

Carefully read the ground truth reference answer to understand the correct approach and solution

work page

[15] [15]

evaluations

For each candidate answer: - Read the entire response - Evaluate it against the ground truth reference answer - Score it according to the criteria below - Provide clear justification for each score with specific references to both the candidate answer and ground truth Rate each answer using the following criteria relative to the ground truth reference ans...

work page