A Finite-Calibration Regime Map for LLM Judge Panels

Bin Zhu; Yanghui Rao

arxiv: 2606.01034 · v1 · pith:YCJH37LNnew · submitted 2026-05-31 · 💻 cs.CL · stat.ME

A Finite-Calibration Regime Map for LLM Judge Panels

Bin Zhu , Yanghui Rao This is my paper

Pith reviewed 2026-06-28 17:32 UTC · model grok-4.3

classification 💻 cs.CL stat.ME

keywords LLM judge panelsfinite calibrationscalar aggregationjoint tablesregime maphuman label budgetspanel selectionadditive outputs

0 comments

The pith

Scalar and reliability aggregation outperforms joint tables for LLM judge panel calibration in 16 of 20 real dataset-budget cells under finite human labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the tradeoff in calibrating panels of LLM judges when only a limited number of human labels are available. Low-dimensional stackers cost less to estimate but ignore interactions among judges, while joint output tables can capture those interactions but require enough labels to cover all cell combinations. Across RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool, scalar and reliability methods win in most budget scenarios because the judges' outputs tend to be additive or redundant. Controlled experiments with synthetic interaction data show the opposite regime: when six-way effects appear, joint tables reduce test error once labels cover the patterns. The result reframes the design question from how many judges to add toward whether any new judge's signal can actually be estimated from the labels on hand.

Core claim

The central claim is that current LLM judge outputs are often additive or redundant, so the operative limit is not panel size but whether the next judge's information remains estimable under the available human labels. This is shown by scalar and reliability aggregation winning 16 of 20 real dataset-budget cells, while controlled calibration-growth data with a six-way interaction instead selects a larger joint table and drops test MSE from 0.224 to 0.061 once unseen mass vanishes.

What carries the argument

The Finite-Calibration Regime Map, realized as the Finite-Calibration Panel Selection procedure that chooses over judge path, prefix size, and aggregator family using table and parametric estimation diagnostics.

If this is right

When judge outputs are additive, adding more judges yields little gain once labels suffice for scalar estimation.
Joint tables become preferable only after labels cover the interaction patterns that scalar methods miss.
The number of judges to include is secondary to checking whether their additional outputs are linearly or reliably predictable from existing labels.
Test error for joint tables falls sharply once the label budget eliminates unseen cell mass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could first compute pairwise correlations among judges on a small label set to decide whether scalar aggregation will suffice.
The same regime logic might apply to any multi-model scoring setup where label cost limits the calibration table size.
A natural next measurement is how quickly the regime boundary shifts when judge diversity increases beyond the seven-model pool studied here.

Load-bearing premise

The controlled calibration-growth data containing a six-way interaction accurately models the practical cases in which joint tables would be required.

What would settle it

A new real dataset in which a joint-table calibrator produces lower test MSE than scalar or reliability methods at the same human-label budget would falsify the claim that scalar aggregation dominates most operating regimes.

Figures

Figures reproduced from arXiv: 2606.01034 by Bin Zhu, Yanghui Rao.

**Figure 2.** Figure 2: Selected joint-table prefix size as a function of calibration budget. [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Strategy-separated joint-table validation-risk curves over prefix size [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Joint-table finite risk versus the complexity proxy [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Joint-table test risk as a function of test-time unseen pattern rate. [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution-indexed selected complexity and sparse-cell rate. [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Path-separated version of Figure 4. Each row is a dataset and each column is a path rule. This diag [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Marginal risk change versus marginal complexity increase when moving from [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Uncertainty version of Figure 1. Lines show mean test risk over splits, and shaded bands show [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Uncertainty version of Figure 2. Lines show mean selected prefix size and shaded bands show [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

read the original abstract

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scalar aggregation wins on real benchmarks under finite labels, but the controlled six-way interaction data may not reliably mark when joint tables become necessary.

read the letter

The main thing to know is that on RewardBench, LLMBar, SummEval, and Arena100K the paper finds scalar or reliability aggregation beats joint tables in 16 of 20 dataset-budget combinations. This suggests current LLM judges are mostly additive or redundant, so the real decision is whether the next judge adds estimable information given the human label budget.

What is new is the finite-calibration regime map that treats the stacker-versus-joint-table choice as a question of estimability rather than raw count of judges. They turn this into Finite-Calibration Panel Selection, a selector that varies judge path, prefix size, and aggregator family while checking table and parametric diagnostics. The real-data results are straightforward to interpret and the controlled calibration-growth experiment shows the complementary case where a six-way interaction makes the joint table win and drops test MSE from 0.224 to 0.061 once unseen mass disappears.

The framing is honest about the practical constraint and the selector looks deployable. The real-benchmark pattern gives a clear signal that many current setups can stay low-dimensional.

The soft spot is the controlled data. The regime boundary depends on that six-way interaction being representative, yet the abstract gives no detail on how it was generated or whether its strength and sparsity match the actual correlations seen in the four external benchmarks. If real judge outputs show weaker or higher-order dependencies, or if label noise inflates cell-count penalties earlier, the map's decision rule shifts. The lack of error bars or explicit exclusion rules in the reported wins also makes the 16/20 figure harder to weigh.

This paper is for people who maintain LLM evaluation pipelines and need a concrete way to choose calibration methods under label limits. A reader who already tunes judge panels would get immediate use from the selector idea.

It deserves a serious referee because the empirical pattern on public benchmarks is worth checking and the selector can be stress-tested by others. Send it to review, but flag the controlled experiment's design for closer scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper introduces a finite-calibration regime map for deciding between low-dimensional stackers and joint output tables when calibrating LLM judge panels under finite human-label budgets. It reports that scalar/reliability aggregation wins on 16 of 20 real dataset-budget cells across RewardBench, LLMBar, SummEval, and Arena100K (indicating additive or redundant judge outputs), while controlled calibration-growth data with an explicit six-way interaction selects larger joint tables and reduces test MSE from 0.224 to 0.061 once unseen mass vanishes. The work instantiates this as Finite-Calibration Panel Selection, a validation selector over judge path, prefix size, and aggregator family.

Significance. If the regime map and its decision boundary hold, the work would be significant for practical LLM judge deployment by reframing the question from panel size to whether additional judge information is estimable under available labels. The real-benchmark results provide concrete evidence that complex aggregators often add little under current conditions, and the controlled contrast illustrates when joint tables become preferable.

major comments (2)

[Abstract] Abstract: The central contrast between real datasets (scalar wins 16/20 cells) and controlled data (joint tables win under six-way interaction) is load-bearing for the regime map and Finite-Calibration Panel Selection selector, yet the abstract supplies no detail on how the six-way interaction is injected, whether its strength or sparsity reproduces empirical judge correlations on RewardBench/LLMBar/etc., or how joint-table estimation variance scales with the same label budgets used in the real experiments.
[Abstract] Abstract: The reported metrics (16/20 wins; MSE drop from 0.224 to 0.061) are given without error bars, confidence intervals, full methodology, data exclusion rules, or verification that the controlled six-way interaction matches real interaction patterns, which directly affects soundness of the claim that current judge outputs are often additive or redundant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for greater transparency on the controlled experiments and reported metrics. We address each point below and will revise the abstract and add supporting details to the main text.

read point-by-point responses

Referee: [Abstract] Abstract: The central contrast between real datasets (scalar wins 16/20 cells) and controlled data (joint tables win under six-way interaction) is load-bearing for the regime map and Finite-Calibration Panel Selection selector, yet the abstract supplies no detail on how the six-way interaction is injected, whether its strength or sparsity reproduces empirical judge correlations on RewardBench/LLMBar/etc., or how joint-table estimation variance scales with the same label budgets used in the real experiments.

Authors: We agree that the abstract is too concise on this point. In the revision we will expand the abstract to state that the six-way interaction is injected via a synthetic label generator that modulates the joint distribution of the seven judges to include explicit higher-order terms, with interaction strength and sparsity parameters chosen to reproduce the pairwise and triple-wise correlations observed on RewardBench and LLMBar (full calibration procedure and matching statistics appear in Section 4.2). The same label budgets used in the real experiments are applied to the controlled data, and the resulting estimation variance for joint tables is shown to scale as expected in Figure 5 and Appendix C. revision: yes
Referee: [Abstract] Abstract: The reported metrics (16/20 wins; MSE drop from 0.224 to 0.061) are given without error bars, confidence intervals, full methodology, data exclusion rules, or verification that the controlled six-way interaction matches real interaction patterns, which directly affects soundness of the claim that current judge outputs are often additive or redundant.

Authors: The abstract summarizes results whose full methodology, including data exclusion rules (instances with unanimous zero scores across all judges are dropped as uninformative), is given in Sections 3.1 and 4.1. We will add standard errors from the five repeated splits to the reported numbers in the revised abstract. We have also inserted a new verification paragraph and supplementary table in Section 4.2 that directly compares the induced correlation structure of the controlled six-way data against the empirical matrices from RewardBench and LLMBar, confirming alignment within 5% relative error on both pairwise and higher-order terms. This supports rather than undermines the claim that real judge outputs are frequently additive or redundant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical benchmarks.

full rationale

The paper reports empirical win rates (scalar aggregation wins 16/20 real dataset-budget cells on RewardBench, LLMBar, SummEval, Arena100K) and contrasts them with controlled calibration-growth experiments. No equations, predictions, or regime boundaries are shown to reduce by construction to fitted parameters, self-citations, or ansatzes imported from the authors' prior work. The derivation chain is self-contained against the stated external benchmarks and does not invoke uniqueness theorems or load-bearing self-citations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

Ledger based solely on abstract; limited detail available on parameters or background assumptions.

free parameters (2)

prefix size
One of the dimensions over which the Finite-Calibration Panel Selection selector operates.
aggregator family
Chosen as part of the deployable validation selector.

axioms (1)

domain assumption LLM judge outputs can be usefully modeled as either additive/redundant or containing estimable interactions under finite labels.
The distinction between scalar/reliability aggregation and joint-table calibration rests on this modeling choice.

invented entities (1)

Finite-Calibration Panel Selection no independent evidence
purpose: Deployable validation selector over judge path, prefix size, and aggregator family
New method introduced to instantiate the regime map.

pith-pipeline@v0.9.1-grok · 5720 in / 1485 out tokens · 52402 ms · 2026-06-28T17:32:23.766464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 13 linked inside Pith

[1]

Beyond LLM-as-a- judge: Deterministic metrics for multilingual generative text evaluation

Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, and Shammur Absar Chowdhury. Beyond LLM-as-a- judge: Deterministic metrics for multilingual generative text evaluation. arXiv:2604.05083,

Pith/arXiv arXiv
[2]

Atla Selene Mini: A general purpose evaluation model

Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, and Young Sun Park. Atla Selene Mini: A general purpose evaluation model. arXiv:2501.17195,

arXiv
[3]

Noise-response calibration: A causal intervention protocol for LLM- judges

Maxim Khomiakov and Jes Frellsen. Noise-response calibration: A causal intervention protocol for LLM- judges. arXiv:2603.17172,

arXiv
[4]

SCOPE: Selective conformal optimized pairwise LLM judging

Sher Badshah, Ali Emami, and Hassan Sajjad. SCOPE: Selective conformal optimized pairwise LLM judging. arXiv:2602.13110,

Pith/arXiv arXiv
[5]

Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. arXiv:2403.04132,

Pith/arXiv arXiv
[6]

Distribution-calibrated inference time compute for thinking LLM-as-a-judge

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, and Mehdi Mirzazadeh. Distribution-calibrated inference time compute for thinking LLM-as-a-judge. arXiv:2512.03019,

Pith/arXiv arXiv
[7]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Model card and technical report, 2026.https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash. Alexander R. Fabbri, Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval: Re-evaluating summarization evaluation.Transactions of the As...

2026
[8]

Gemma 3 technical report

Gemma Team. Gemma 3 technical report. arXiv:2503.19786,

Pith/arXiv arXiv
[9]

The Llama 3 herd of models

Aaron Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783,

Pith/arXiv arXiv
[10]

Ahmed, Shubham Sahai, and Ben Leong

Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai, and Ben Leong. Beyond consensus: Mitigating the agreeableness bias in LLM judge evaluations. arXiv:2510.11822,

arXiv
[11]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra S

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra S. Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7B. arXiv:2310.06825,

Pith/arXiv arXiv
[12]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. RewardBench: Evaluating reward models for language modeling. arXiv:2403.13787,

arXiv
[13]

Causal judge evaluation: Calibrated surrogate metrics for LLM systems

Eddie Landesberg and Manjari Narayan. Causal judge evaluation: Calibrated surrogate metrics for LLM systems. arXiv:2512.11150,

arXiv
[14]

On cost-effective LLM-as-a-judge improvement techniques

Ryan Lail and Luke Markham. On cost-effective LLM-as-a-judge improvement techniques. arXiv:2604.13717,

Pith/arXiv arXiv
[15]

Calibrate, don’t curate: Label-efficient estimation from noisy LLM judges

Yanran Li. Calibrate, don’t curate: Label-efficient estimation from noisy LLM judges. arXiv:2605.09702,

Pith/arXiv arXiv
[16]

Qwen2.5 technical report

Qwen Team. Qwen2.5 technical report. arXiv:2412.15115,

Pith/arXiv arXiv
[17]

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, and Kate M. Knill. Who can we trust? LLM-as-a-jury for comparative assessment. arXiv:2602.16610,

Pith/arXiv arXiv
[18]

Calibrating LLM judges: Linear probes for fast and reliable uncertainty estimation

Bhaktipriya Radharapu, Eshika Saxena, Kenneth Li, Chenxi Whitehouse, Adina Williams, and Nicola Cancedda. Calibrating LLM judges: Linear probes for fast and reliable uncertainty estimation. arXiv:2512.22245,

arXiv
[19]

Heterogeneous judge-aware ranking with sensitivity, disagreement, and confidence

Shibo Yu, Yingzhou Wang, Yan Chen, Guodong Li, and Jin-Hong Du. Heterogeneous judge-aware ranking with sensitivity, disagreement, and confidence. arXiv:2605.05073,

Pith/arXiv arXiv
[20]

Replacing judges with juries: Evaluating LLM gener- ations with a panel of diverse models

Pat Verga, Sebastian Hofst¨atter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorod- sky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM gener- ations with a panel of diverse models. arXiv:2404.18796,

Pith/arXiv arXiv
[21]

JudgeLM: Fine-tuned large language models are scal- able judges

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned large language models are scal- able judges. arXiv:2310.17631,

arXiv
[22]

You are a strict pref- erence judge. Return only valid JSON,

On unseen cells, both predictions lie in[0,1], so the squared contribution is at most the unseen probabilityu K. Integrating these cellwise bounds with respect to the test-time probabilitiesp z gives the weighted terms in Eq. (6). Proof sketch of Proposition 2.The lower balance condition and a Chernoff bound implyN z ≥ nM /(2cMK)for all cells with probabi...

2000
[23]

tableK=7

Datasetn M selected aggregation family selectedKtest MSE RewardBench 800 Logistic pairwise (0.23) 6.17 0.023±0.001 LLMBar 300 Logistic (0.43) 7.00 0.183±0.002 SummEval 800 Ridge + isotonic (0.63) 6.70 0.044±0.001 Arena100K 400 One-coin reliability + isotonic (0.43) 6.53 0.232±0.002 Table 9: Validation-selected scalar aggregation at each dataset’s largest ...

arXiv 2048

[1] [1]

Beyond LLM-as-a- judge: Deterministic metrics for multilingual generative text evaluation

Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, and Shammur Absar Chowdhury. Beyond LLM-as-a- judge: Deterministic metrics for multilingual generative text evaluation. arXiv:2604.05083,

Pith/arXiv arXiv

[2] [2]

Atla Selene Mini: A general purpose evaluation model

Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, and Young Sun Park. Atla Selene Mini: A general purpose evaluation model. arXiv:2501.17195,

arXiv

[3] [3]

Noise-response calibration: A causal intervention protocol for LLM- judges

Maxim Khomiakov and Jes Frellsen. Noise-response calibration: A causal intervention protocol for LLM- judges. arXiv:2603.17172,

arXiv

[4] [4]

SCOPE: Selective conformal optimized pairwise LLM judging

Sher Badshah, Ali Emami, and Hassan Sajjad. SCOPE: Selective conformal optimized pairwise LLM judging. arXiv:2602.13110,

Pith/arXiv arXiv

[5] [5]

Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot Arena: An open platform for evaluating LLMs by human preference. arXiv:2403.04132,

Pith/arXiv arXiv

[6] [6]

Distribution-calibrated inference time compute for thinking LLM-as-a-judge

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, and Mehdi Mirzazadeh. Distribution-calibrated inference time compute for thinking LLM-as-a-judge. arXiv:2512.03019,

Pith/arXiv arXiv

[7] [7]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Model card and technical report, 2026.https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash. Alexander R. Fabbri, Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval: Re-evaluating summarization evaluation.Transactions of the As...

2026

[8] [8]

Gemma 3 technical report

Gemma Team. Gemma 3 technical report. arXiv:2503.19786,

Pith/arXiv arXiv

[9] [9]

The Llama 3 herd of models

Aaron Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783,

Pith/arXiv arXiv

[10] [10]

Ahmed, Shubham Sahai, and Ben Leong

Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai, and Ben Leong. Beyond consensus: Mitigating the agreeableness bias in LLM judge evaluations. arXiv:2510.11822,

arXiv

[11] [11]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra S

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra S. Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7B. arXiv:2310.06825,

Pith/arXiv arXiv

[12] [12]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. RewardBench: Evaluating reward models for language modeling. arXiv:2403.13787,

arXiv

[13] [13]

Causal judge evaluation: Calibrated surrogate metrics for LLM systems

Eddie Landesberg and Manjari Narayan. Causal judge evaluation: Calibrated surrogate metrics for LLM systems. arXiv:2512.11150,

arXiv

[14] [14]

On cost-effective LLM-as-a-judge improvement techniques

Ryan Lail and Luke Markham. On cost-effective LLM-as-a-judge improvement techniques. arXiv:2604.13717,

Pith/arXiv arXiv

[15] [15]

Calibrate, don’t curate: Label-efficient estimation from noisy LLM judges

Yanran Li. Calibrate, don’t curate: Label-efficient estimation from noisy LLM judges. arXiv:2605.09702,

Pith/arXiv arXiv

[16] [16]

Qwen2.5 technical report

Qwen Team. Qwen2.5 technical report. arXiv:2412.15115,

Pith/arXiv arXiv

[17] [17]

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, and Kate M. Knill. Who can we trust? LLM-as-a-jury for comparative assessment. arXiv:2602.16610,

Pith/arXiv arXiv

[18] [18]

Calibrating LLM judges: Linear probes for fast and reliable uncertainty estimation

Bhaktipriya Radharapu, Eshika Saxena, Kenneth Li, Chenxi Whitehouse, Adina Williams, and Nicola Cancedda. Calibrating LLM judges: Linear probes for fast and reliable uncertainty estimation. arXiv:2512.22245,

arXiv

[19] [19]

Heterogeneous judge-aware ranking with sensitivity, disagreement, and confidence

Shibo Yu, Yingzhou Wang, Yan Chen, Guodong Li, and Jin-Hong Du. Heterogeneous judge-aware ranking with sensitivity, disagreement, and confidence. arXiv:2605.05073,

Pith/arXiv arXiv

[20] [20]

Replacing judges with juries: Evaluating LLM gener- ations with a panel of diverse models

Pat Verga, Sebastian Hofst¨atter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorod- sky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM gener- ations with a panel of diverse models. arXiv:2404.18796,

Pith/arXiv arXiv

[21] [21]

JudgeLM: Fine-tuned large language models are scal- able judges

Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned large language models are scal- able judges. arXiv:2310.17631,

arXiv

[22] [22]

You are a strict pref- erence judge. Return only valid JSON,

On unseen cells, both predictions lie in[0,1], so the squared contribution is at most the unseen probabilityu K. Integrating these cellwise bounds with respect to the test-time probabilitiesp z gives the weighted terms in Eq. (6). Proof sketch of Proposition 2.The lower balance condition and a Chernoff bound implyN z ≥ nM /(2cMK)for all cells with probabi...

2000

[23] [23]

tableK=7

Datasetn M selected aggregation family selectedKtest MSE RewardBench 800 Logistic pairwise (0.23) 6.17 0.023±0.001 LLMBar 300 Logistic (0.43) 7.00 0.183±0.002 SummEval 800 Ridge + isotonic (0.63) 6.70 0.044±0.001 Arena100K 400 One-coin reliability + isotonic (0.43) 6.53 0.232±0.002 Table 9: Validation-selected scalar aggregation at each dataset’s largest ...

arXiv 2048