The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime
Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3
The pith
The statistical limit for estimating an AI model's calibration error is Theta((L epsilon / m)^{1/3}) when the error rate is epsilon, and no estimator can improve on this rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This verification tax implies that as AI models improve, verifying their calibration becomes fundamentally harder with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; active querying eliminates the Lipschitz constant, collapsing estimation to detection; and 4)
What carries the argument
The minimax rate Theta((L epsilon / m)^{1/3}) for calibration error estimation in the rare-error regime under a Lipschitz calibration function.
If this is right
- Self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute.
- A sharp phase transition exists at m epsilon approximately 1 below which miscalibration is undetectable.
- Active querying eliminates the Lipschitz constant, collapsing estimation to detection.
- Verification cost grows exponentially with pipeline depth at rate L to the power K.
Where Pith is reading between the lines
- Practitioners should prioritize active querying methods for calibration checks on high-accuracy models rather than relying on passive benchmarks.
- The exponential scaling with depth suggests that verification should occur at intermediate stages in complex AI pipelines instead of only at the end.
- Future model releases should include the computed verification floor based on observed error rates to make calibration claims credible.
Load-bearing premise
The calibration function is Lipschitz continuous with some constant L and the model's error rate epsilon is known or fixed in advance.
What would settle it
An experiment that demonstrates an estimator achieving a strictly faster rate than (L epsilon / m)^{1/3} for calibration error, or that detects miscalibration in a regime where m epsilon is much less than 1.
Figures
read the original abstract
The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proves that the minimax rate for estimating calibration error of a model with error rate ε is Θ((Lε/m)^{1/3}), where L is the Lipschitz constant of the calibration function and m is the number of samples. This 'verification tax' implies that verifying calibration becomes harder as models improve. Four theoretical results are established: self-evaluation yields zero information about calibration, a phase transition at mε ≈ 1 below which miscalibration is undetectable, active querying removes the dependence on L, and verification cost grows exponentially with pipeline depth at rate L^K. These are validated empirically on 27 LLM-benchmark pairs (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande) using 95% bootstrap CIs and permutation tests, with self-evaluation non-significant in 80% of cases and 23% of frontier-model comparisons indistinguishable from noise.
Significance. If the minimax characterization holds, the work supplies a rigorous information-theoretic basis for the practical difficulties of auditing calibration in high-accuracy AI systems. It explains why post-hoc ECE values on frontier models often lie below the noise floor and motivates a shift toward active querying. The combination of the phase-transition result and the exponential depth dependence offers concrete guidance for evaluation design, while the large-scale empirical study on 27 model-benchmark pairs demonstrates that many current calibration claims cannot be distinguished from statistical noise.
major comments (2)
- [Section 3] Main theorem (Section 3): the claim that the minimax rate is Θ((Lε/m)^{1/3}) requires both an Ω lower bound and a matching O upper bound. The lower bound is obtained via a packing argument over L-Lipschitz functions with effective sample size scaled by ε, but the manuscript does not exhibit or analyze an estimator (e.g., a suitably bandwidth-tuned local polynomial regressor adapted to the Poisson sampling of errors) whose risk is O((Lε/m)^{1/3}). Without the explicit upper-bound construction the Θ statement is not fully established.
- [Section 5] Empirical section (Section 5, permutation-test paragraph): the claim that self-evaluation is non-significant in 80% of the 27 pairs rests on permutation tests, yet the precise null distribution (under perfect calibration and the rare-error Poisson sampling) is not specified. This detail is load-bearing for interpreting the reported non-significance rate and the 23% indistinguishability figure for frontier models.
minor comments (3)
- [Abstract] The four theoretical results are listed in the abstract but never enumerated with section references; adding a short roadmap paragraph would improve navigation.
- [Section 2] Notation for the calibration function f and the error indicator is introduced informally; a formal definition in a preliminaries subsection would eliminate ambiguity when the Lipschitz assumption is invoked.
- [Section 4] Figure 4 (verification cost vs. depth) shows exponential growth but omits confidence bands or sensitivity analysis with respect to L; adding these would strengthen the visual claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects that will improve the rigor of both the theoretical and empirical contributions. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [Section 3] Main theorem (Section 3): the claim that the minimax rate is Θ((Lε/m)^{1/3}) requires both an Ω lower bound and a matching O upper bound. The lower bound is obtained via a packing argument over L-Lipschitz functions with effective sample size scaled by ε, but the manuscript does not exhibit or analyze an estimator (e.g., a suitably bandwidth-tuned local polynomial regressor adapted to the Poisson sampling of errors) whose risk is O((Lε/m)^{1/3}). Without the explicit upper-bound construction the Θ statement is not fully established.
Authors: We agree that a complete proof of the minimax rate requires an explicit upper bound matching the lower bound. While the lower bound via the packing argument is novel and central to the verification tax result, the manuscript did not provide a detailed construction for the upper bound. In the revised manuscript, we will add an analysis of a suitably adapted local polynomial regressor (with bandwidth chosen as a function of L, ε, and m) under the Poisson sampling model for rare errors, establishing the O((Lε/m)^{1/3}) risk bound. This will fully establish the Θ characterization. revision: yes
-
Referee: [Section 5] Empirical section (Section 5, permutation-test paragraph): the claim that self-evaluation is non-significant in 80% of the 27 pairs rests on permutation tests, yet the precise null distribution (under perfect calibration and the rare-error Poisson sampling) is not specified. This detail is load-bearing for interpreting the reported non-significance rate and the 23% indistinguishability figure for frontier models.
Authors: We thank the referee for pointing this out. The permutation tests assess whether the observed calibration error is consistent with perfect calibration under the rare-error regime. Under the null hypothesis of perfect calibration, the error indicators are generated as independent Poisson random variables with rate ε, independent of the model's confidence scores. In the revision, we will explicitly state this null distribution and describe how the permutations are performed to generate the reference distribution for the test statistic. This will provide the necessary detail for interpreting the 80% non-significance rate and the 23% figure for frontier models. revision: yes
Circularity Check
No circularity: minimax rate is a self-contained information-theoretic derivation
full rationale
The paper derives the claimed Theta((L epsilon/m)^{1/3}) minimax rate via standard nonparametric estimation arguments (Le Cam/Fano lower bounds on a packing of L-Lipschitz calibration functions, with effective sample size scaled by the rare-error probability epsilon). This does not reduce to any fitted parameter, self-citation chain, or ansatz imported from prior work by the same author. The four listed results follow directly from the same rate analysis without re-using the target quantity as an input. Empirical sections rely on external benchmarks and permutation tests rather than internal fits. The derivation is therefore independent and self-contained against external statistical theory.
Axiom & Free-Parameter Ledger
free parameters (1)
- L
axioms (1)
- standard math Standard minimax lower-bound techniques from statistical estimation theory apply to calibration error functionals.
Reference graph
Works this paper leans on
-
[1]
A unifying theory of distance from calibration
Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727--1740, 2023
work page 2023
-
[2]
Asymptotic equivalence of nonparametric regression and white noise
Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics, 24 0 (6): 0 2384--2398, 1996
work page 1996
-
[3]
Chatbot arena: An open platform for evaluating LLMs by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In International Conference on Machine Learning, 2024
work page 2024
-
[4]
An information-theoretic analysis of expected calibration error
Futoshi Futami and Masahiro Fujisawa. An information-theoretic analysis of expected calibration error. In Advances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[5]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321--1330. PMLR, 2017
work page 2017
-
[6]
Time-uniform, nonparametric, nonasymptotic confidence sequences
Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49 0 (2): 0 1--36, 2021
work page 2021
-
[7]
Testing calibration in nearly-linear time
Lunjia Hu, Kevin Perera, and Sebastian Casalaina-Martin. Testing calibration in nearly-linear time. In Advances in Neural Information Processing Systems, volume 37, 2024. Also accepted at STOC 2025
work page 2024
-
[8]
Active testing: Sample-efficient model evaluation
Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evaluation. In International Conference on Machine Learning. PMLR, 2021
work page 2021
-
[9]
On estimation of the L_r norm of a regression function
Oleg V Lepski, Arkadi Nemirovski, and Vladimir Spokoiny. On estimation of the L_r norm of a regression function. Probability Theory and Related Fields, 113 0 (2): 0 221--253, 1999
work page 1999
-
[10]
Holistic evaluation of language models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023
work page 2023
-
[11]
Asymptotic equivalence of density estimation and Gaussian white noise
Michael Nussbaum. Asymptotic equivalence of density estimation and Gaussian white noise. Annals of Statistics, 24 0 (6): 0 2399--2430, 1996
work page 1996
-
[12]
Shubhanshu Shekhar and Aaditya Ramdas. Auditing fairness by betting. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[13]
On the estimation of expected calibration error
Shuo Sun, Jiaqi Song, and Alfred O Hero. On the estimation of expected calibration error. In Advances in Neural Information Processing Systems, volume 36, 2023. Spotlight
work page 2023
-
[14]
Introduction to Nonparametric Estimation
Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009
work page 2009
-
[15]
Aad W van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998
work page 1998
-
[16]
Judging LLM -as-a-judge with MT-Bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[17]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[18]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[19]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.