Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Pith reviewed 2026-05-25 05:23 UTC · model grok-4.3
The pith
An adaptive algorithm using optimistically biased variance estimates matches the oracle inverse-variance weighted estimator rate for multi-judge LLM score estimation under a fixed budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that EST-IVWE, an adaptive procedure that builds and uses optimistically biased variance estimates, matches the error rate of the oracle inverse-variance weighted estimator up to lower-order terms in the budget. A matching local minimax lower bound, obtained via an Assouad-type in-expectation argument based on local perturbations, establishes that the proposed algorithms are instance-optimal. This bound is sharper than what Fano-type packing arguments can deliver because the latter lose the local variance information that determines the optimal allocation.
What carries the argument
The inverse-variance weighted estimator (IVWE) whose error is minimized by an oracle allocation depending on unknown query-judge variances; EST-IVWE extends this to the unknown-variance case by constructing optimistically biased variance estimates that stabilize empirical allocation without rate loss.
If this is right
- EST-IVWE attains the oracle IVWE error rate up to lower-order budget terms even when variances are unknown.
- A local minimax lower bound shows the achieved rate is instance-optimal for each fixed variance configuration.
- The Assouad-type argument based on local perturbations yields an allocation-dependent lower bound that Fano-type arguments cannot recover.
- Numerical comparisons on synthetic data and HelpSteer2 confirm lower error than uniform allocation under the same budget.
Where Pith is reading between the lines
- The same optimistic-bias stabilization technique may extend to other budgeted allocation problems where measurement costs and noise levels are heterogeneous and initially unknown.
- Local-perturbation lower-bound constructions could be applied to other estimation settings where global packing arguments erase the structure that governs optimal resource use.
- The instance-optimality result implies that uniform or non-adaptive allocations are provably suboptimal on instances with strong variance heterogeneity.
Load-bearing premise
The adaptive algorithm can construct and leverage optimistically biased variance estimates to stabilize the empirical allocation without degrading the final estimator's rate.
What would settle it
On synthetic instances where the true variances are known, if the squared error of EST-IVWE exceeds the oracle IVWE error by more than lower-order terms in the budget for large enough budgets, the rate-matching claim would be falsified.
Figures
read the original abstract
Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over na\"ive uniform allocation on synthetic and HelpSteer2 datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes budgeted heteroskedastic multi-judge estimation for LLM evaluations with heterogeneous judge costs and instance difficulties. It analyzes the inverse-variance weighted estimator (IVWE), derives its oracle allocation minimizing ℓ_p error under a fixed budget, and proposes the adaptive EST-IVWE algorithm that uses optimistically biased variance estimates to handle unknown variances while matching the oracle rate up to lower-order terms. It establishes instance-optimality via a matching local minimax lower bound derived from an Assouad-type in-expectation argument with local perturbations (avoiding coarse Fano packings), and validates the approach empirically against uniform allocation on synthetic data and the HelpSteer2 dataset.
Significance. If the central claims hold, the work delivers a practically relevant, instance-optimal framework for cost-efficient LLM-as-a-judge scoring that adapts to per-instance and per-judge variance heterogeneity. The local minimax lower bound that preserves the variance structure governing optimal allocation, together with the explicit adaptive procedure for unknown variances, constitutes a technical contribution beyond standard inverse-variance weighting. The empirical results on HelpSteer2 further support applicability.
major comments (2)
- [EST-IVWE algorithm and its analysis] The claim that EST-IVWE matches the oracle IVWE rate up to lower-order terms (abstract) rests on the optimistic bias construction stabilizing allocation without degrading the leading 1/sqrt(B) constant. The bias must be strong enough to avoid unstable allocations on high-variance instances yet weak enough that the resulting estimator retains the exact oracle leading term; an explicit bias definition and concentration argument showing the bias term is o(1/sqrt(B)) are required to confirm this.
- [Local minimax lower bound section] The Assouad-type local-perturbation argument is presented as yielding the sharp allocation-dependent lower bound. The specific local perturbation construction and the in-expectation calculation that retains the per-instance variance structure (rather than averaging it away) should be verified to ensure the lower bound exactly matches the oracle upper bound's leading constant.
minor comments (2)
- Clarify the precise meaning of 'lower-order terms in the budget' (e.g., whether o(1/sqrt(B)) or O(log B / sqrt(B)) is intended) and state the dependence on K, J, and p explicitly.
- The synthetic data generation process and the precise definition of the ℓ_p error metric used in the experiments should be described in more detail to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight opportunities to strengthen the explicitness of our technical arguments, and we address each point below. We are prepared to revise the manuscript to incorporate additional details where needed.
read point-by-point responses
-
Referee: [EST-IVWE algorithm and its analysis] The claim that EST-IVWE matches the oracle IVWE rate up to lower-order terms (abstract) rests on the optimistic bias construction stabilizing allocation without degrading the leading 1/sqrt(B) constant. The bias must be strong enough to avoid unstable allocations on high-variance instances yet weak enough that the resulting estimator retains the exact oracle leading term; an explicit bias definition and concentration argument showing the bias term is o(1/sqrt(B)) are required to confirm this.
Authors: We agree that the optimistic bias construction merits a more explicit treatment to confirm the leading constant is preserved. In the revision we will add a dedicated subsection (or lemma) in Section 4 that (i) states the precise bias term (a multiple of the estimated standard deviation scaled by a slowly growing function of the number of samples per instance), (ii) proves that the resulting allocation deviates from the oracle allocation by an o(1/sqrt(B)) term in total variation with high probability, and (iii) shows via a direct calculation that this deviation contributes only lower-order terms to the final ℓ_p error. This will make the matching claim fully rigorous. revision: yes
-
Referee: [Local minimax lower bound section] The Assouad-type local-perturbation argument is presented as yielding the sharp allocation-dependent lower bound. The specific local perturbation construction and the in-expectation calculation that retains the per-instance variance structure (rather than averaging it away) should be verified to ensure the lower bound exactly matches the oracle upper bound's leading constant.
Authors: The local perturbation is constructed by adding an independent Rademacher perturbation of size Θ(1/σ_{k j}) to each instance-judge mean, with the scale chosen small enough to remain inside the bounded score interval. The in-expectation lower bound is obtained by linearity of expectation over the independent sign flips; because each coordinate's contribution appears separately in the total risk and the variance of the estimator for that coordinate is exactly the reciprocal of the total weight allocated to it, the per-instance variance structure is retained and the resulting lower bound matches the leading 1/sqrt(B) term of the oracle upper bound. We will insert a short clarifying paragraph after the main proof in Section 5 that spells out this coordinate-wise calculation. revision: partial
Circularity Check
No significant circularity; derivations rely on independent technical arguments
full rationale
The paper analyzes the inverse-variance weighted estimator to derive an oracle allocation, then proposes EST-IVWE using optimistically biased variance estimates to match the oracle rate up to lower-order terms, with a matching local minimax lower bound obtained via a new Assouad-type in-expectation argument based on local perturbations. No load-bearing step reduces by construction to its inputs, fitted parameters renamed as predictions, or self-citation chains; the central technical insight (preserving local variance structure in the lower bound) is presented as novel and independent of the algorithm definition. This matches the expectation that most papers are non-circular when the proof techniques are self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The target score vector is bounded.
- domain assumption Judges have known per-query costs but unknown query-judge variances.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms... matching local minimax lower bound... Assouad-type in-expectation argument
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal allocation... A∗p(σ,c) = (∑k (cj∗(k)σ²k,j∗(k))^{p/(p+2)} )^{(p+2)/p}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms , pages =
Chierichetti, Flavio and Dasgupta, Anirban and Kumar, Ravi and Lattanzi, Silvio , title =. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms , pages =. 2014 , isbn =
work page 2014
- [2]
-
[3]
and Liu, Sihan and Pittas, Thanasis , title =
Diakonikolas, Ilias and Kane, Daniel M. and Liu, Sihan and Pittas, Thanasis , title =. 2025 , isbn =. doi:10.1145/3717823.3718162 , booktitle =
-
[4]
Kulkarni, Adithya and Chakraborty, Mohna and Xie, Sihong and Li, Qi , booktitle =. 2023 , volume =
work page 2023
-
[5]
The Innovation , pages =. 2026 , issn =. doi:10.1016/j.xinn.2025.101253 , author =
-
[6]
Rossi and Andrew Lan and Zichao Wang , booktitle=
Nigel Fernandez and Branislav Kveton and Ryan A. Rossi and Andrew Lan and Zichao Wang , booktitle=. 2026 , url=
work page 2026
-
[7]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.248
-
[8]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
-
[9]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[10]
Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =
-
[11]
Yuntao Bai and Saurav Kadavath and Sandipan Kundu and Amanda Askell and Jackson Kernion and Andy Jones and Anna Chen and Anna Goldie and Azalia Mirhoseini and Cameron McKinnon and Carol Chen and Catherine Olsson and Christopher Olah and Danny Hernandez and Dawn Drain and Deep Ganguli and Dustin Li and Eli Tran-Johnson and Ethan Perez and Jamie Kerr and Ja...
-
[12]
Aadirupa Saha and Aniket Wagde and Branislav Kveton , booktitle =. 2026 , series =
work page 2026
- [13]
-
[14]
and Klaus Hinkelmann , title =
Norwood Jr, Thomas E. and Klaus Hinkelmann , title =. The Annals of Statistics , number =. 1977 , doi =
work page 1977
-
[15]
K. Aiyappan Nair , title =. The Annals of Statistics , number =. 1980 , doi =
work page 1980
-
[16]
V. G. Voinov , journal =. 1984 , url=
work page 1984
-
[17]
Computational Statistics & Data Analysis , volume =. 2007 , issn =. doi:10.1016/j.csda.2007.04.004 , author =
-
[18]
J. K. Ghosh and Bimal K. Sinha , title =. The Annals of Statistics , number =. 1981 , doi =
work page 1981
-
[19]
Communications in Statistics - Theory and Methods , volume =
Bimal Kumar Sinha and Omar Mouqadem , title =. Communications in Statistics - Theory and Methods , volume =. 1982 , publisher =
work page 1982
-
[20]
Journal of Statistical Planning and Inference , volume =. 1997 , issn =. doi:10.1016/S0378-3758(96)00202-9 , author =
-
[21]
Dubois, Yann and Li, Chen Xuechen and Taori, Rohan and Zhang, Tianyi and Gulrajani, Ishaan and Ba, Jimmy and Guestrin, Carlos and Liang, Percy S and Hashimoto, Tatsunori B , booktitle =
-
[22]
Li, Haitao and Dong, Qian and Chen, Junjie and Su, Huixue and Zhou, Yujia and Ai, Qingyao and Ye, Ziyi and Liu, Yiqun , journal=. 2024 , url=
work page 2024
-
[23]
Raju, Ravi Shanker and Jain, Swayambhoo and Li, Bo and Li, Jonathan Lingjie and Thakker, Urmish. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U). 2024. doi:10.18653/v1/2024.customnlp4u-1.14
-
[24]
Gonzalez and Ion Stoica , booktitle=
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. 2023 , url=
work page 2023
-
[25]
arXiv preprint arXiv:2506.02945 , url=
Aishwarya Sahoo and Jeevana Kruthi Karnuthala and Tushar Parmanand Budhwani and Pranchal Agarwal and Sankaran Vaidyanathan and Alexa Siu and Franck Dernoncourt and Jennifer Healey and Nedim Lipka and Ryan Rossi and Uttaran Bhattacharya and Branislav Kveton , year=. arXiv preprint arXiv:2506.02945 , url=
-
[26]
Luyu Chen and Zeyu Zhang and Haoran Tan and Quanyu Dai and Hao Yang and Zhenhua Dong and Xu Chen , booktitle =
-
[27]
arXiv preprint arXiv:2601.05420 , url=
Yiqun T Chen and Sizhu Lu and Sijia Li and Moran Guo and Shengyi Li , year=. arXiv preprint arXiv:2601.05420 , url=
-
[28]
arXiv preprint arXiv:2511.21140 , url=
Chungpa Lee and Thomas Zeng and Jongwon Jeong and Jy-yong Sohn and Kangwook Lee , year=. arXiv preprint arXiv:2511.21140 , url=
-
[29]
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Ziyi Zhu and Olivier Tieleman and Alexey Bukhtiyarov and Jinghong Chen , year=. arXiv preprint arXiv:2603.01865 , url=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
arXiv preprint arXiv:2411.00640 , url=
Evan Miller , year=. arXiv preprint arXiv:2411.00640 , url=
-
[31]
Dorner and Vivian Yvonne Nastl and Moritz Hardt , booktitle=
Florian E. Dorner and Vivian Yvonne Nastl and Moritz Hardt , booktitle=. 2025 , url=
work page 2025
-
[32]
Sam Bowyer and Laurence Aitchison and Desi R. Ivanova , booktitle=. 2025 , url=
work page 2025
-
[33]
arXiv preprint arXiv:2505.19145 , url=
Weijie Su , year=. arXiv preprint arXiv:2505.19145 , url=
-
[34]
arXiv preprint arXiv:2505.12050 , url=
Vinod Raman and Hilal Asi and Satyen Kale , year=. arXiv preprint arXiv:2505.12050 , url=
- [35]
-
[36]
Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , url=
work page 2026
-
[37]
Advances in Neural Information Processing Systems , publisher =
Riccardo Poiani and R. Advances in Neural Information Processing Systems , publisher =
-
[38]
Wu, Di and Shi, Chengshuai and Zhou, Ruida and Shen, Cong , booktitle =. 2025 , volume =
work page 2025
-
[39]
Liu, Xinyu and You, Wei and Qin, Chao , year=
-
[40]
B. Laurent and P. Massart , title =. The Annals of Statistics , number =. 2000 , doi =
work page 2000
-
[41]
Proceedings of the 22nd Annual Conference on Learning Theory (COLT) , year =
Maurer, Andreas and Pontil, Massimiliano , title =. Proceedings of the 22nd Annual Conference on Learning Theory (COLT) , year =
-
[42]
Fontaine, Xavier and Perrault, Pierre and Valko, Michal and Perchet, Vianney , booktitle =. 2021 , volume =
work page 2021
-
[43]
Mathematics of Operations Research , volume =
Garivier, Aur\'. Mathematics of Operations Research , volume =. 2019 , doi =
work page 2019
-
[44]
IEEE Transactions on Automatic Control , title=
Jedra, Yassir and Prouti\`. IEEE Transactions on Automatic Control , title=. 2023 , volume=
work page 2023
-
[45]
Advances in Neural Information Processing Systems , pages =
Yun, Se-Young and Prouti\`. Advances in Neural Information Processing Systems , pages =
-
[46]
Lai and Herbert Robbins , journal =
Tse L. Lai and Herbert Robbins , journal =. 1985 , doi=
work page 1985
-
[47]
Philosophical Transactions of the Royal Society of London
Neyman, Jerzy and Pearson, Egon Sharpe , title =. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character , volume =. 1933 , doi =
work page 1933
-
[48]
Journal of the Royal Statistical Society , volume =
Neyman, Jerzy , title =. Journal of the Royal Statistical Society , volume =. 1934 , month =
work page 1934
-
[49]
Supplement to the Journal of the Royal Statistical Society , volume=
Problems arising in the analysis of a series of similar experiments , author=. Supplement to the Journal of the Royal Statistical Society , volume=. 1937 , month=. doi:10.2307/2984123 , publisher=
-
[50]
The combination of estimates from different experiments , author=. Biometrics , volume=. 1954 , month=. doi:10.2307/3001666 , publisher=
- [51]
-
[52]
Tiberiu Popoviciu , title =. Mathematica (Cluj) , volume =. 1935 , language =
work page 1935
-
[53]
Journal of the American Statistical Association , volume =
Wassily Hoeffding , title =. Journal of the American Statistical Association , volume =. 1963 , publisher =
work page 1963
- [54]
- [55]
-
[56]
Anderson, Theodore W. , journal=. 1955 , publisher=. doi:10.1090/S0002-9939-1955-0069229-1 , mrnumber=
- [57]
-
[58]
Comptes rendus des séances de l'Académie des sciences
Assouad, Patrice , title=. Comptes rendus des séances de l'Académie des sciences. Série 1, Mathématique , year=
- [59]
-
[60]
The Annals of Statistics , number =
Le Cam, Lucien , title =. The Annals of Statistics , number =. 1973 , doi =
work page 1973
-
[61]
Yu, Bin. 1997. doi:10.1007/978-1-4612-1880-7_29
- [62]
-
[63]
Richard D. Gill and Boris Y. Levit , title =. Bernoulli , number =. 1995 , doi=
work page 1995
- [64]
- [65]
-
[66]
Efroimovich, S. Yu. , title =. Problems of Information Transmission , year =
-
[67]
Aras, Efe and Lee, Kuan-Yun and Pananjady, Ashwin and Courtade, Thomas A. , booktitle=. 2019 , volume=
work page 2019
- [68]
-
[69]
Chen, Wei-Ning and Kairouz, Peter and Özgür, Ayfer , booktitle =
-
[70]
Lalitha, Anusha Lalitha and Kalantari, Kousha and Ma, Yifei and Deoras, Anoop and Kveton, Branislav , booktitle =. 2023 , volume =
work page 2023
- [71]
-
[72]
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =
Abeille, Marc and Faury, Louis and Calauz\`. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =. 2021 , volume =
work page 2021
-
[73]
Proceedings of The 29th International Conference on Artificial Intelligence and Statistics , year =
Lee, Junghyun and Jang, Kyoungseok and Vojnovi\'. Proceedings of The 29th International Conference on Artificial Intelligence and Statistics , year =
-
[74]
Zhu, Yuancheng and Lafferty, John , booktitle =
-
[75]
Kamalika Chaudhuri and Prateek Jain and Nagarajan Natarajan , booktitle =. 2017 , volume =
work page 2017
-
[76]
Kirschner, Johannes and Krause, Andreas , booktitle =. 2018 , volume =
work page 2018
-
[77]
arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Foundations and Trends® in Machine Learning , title =. 2015 , volume =. doi:10.1561/2200000048 , issn =
-
[80]
Foundations and Trends® in Machine Learning , title =
S\'. Foundations and Trends® in Machine Learning , title =. 2012 , volume =. doi:10.1561/2200000024 , issn =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.