pith. sign in

arxiv: 2604.12951 · v1 · submitted 2026-04-14 · 💻 cs.LG

The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords calibration errorminimax estimationrare error regimeAI verificationphase transitionactive queryingECEverification tax
0
0 comments X

The pith

The statistical limit for estimating an AI model's calibration error is Theta((L epsilon / m)^{1/3}) when the error rate is epsilon, and no estimator can improve on this rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a fundamental statistical bound on how accurately one can measure the calibration of an AI model's confidence scores when the model makes few errors. It shows that the best possible accuracy for this measurement scales as the cube root of the product of the Lipschitz constant L, the error rate epsilon, divided by the number of samples m. A reader should care because many reported calibration metrics on strong models fall below this noise floor, meaning current evaluation practices may not reliably detect miscalibration. The work derives four concrete contradictions with standard methods, including that checking calibration on the model's own outputs without external labels gives no useful information whatsoever.

Core claim

We prove that the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This verification tax implies that as AI models improve, verifying their calibration becomes fundamentally harder with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; active querying eliminates the Lipschitz constant, collapsing estimation to detection; and 4)

What carries the argument

The minimax rate Theta((L epsilon / m)^{1/3}) for calibration error estimation in the rare-error regime under a Lipschitz calibration function.

If this is right

  • Self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute.
  • A sharp phase transition exists at m epsilon approximately 1 below which miscalibration is undetectable.
  • Active querying eliminates the Lipschitz constant, collapsing estimation to detection.
  • Verification cost grows exponentially with pipeline depth at rate L to the power K.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners should prioritize active querying methods for calibration checks on high-accuracy models rather than relying on passive benchmarks.
  • The exponential scaling with depth suggests that verification should occur at intermediate stages in complex AI pipelines instead of only at the end.
  • Future model releases should include the computed verification floor based on observed error rates to make calibration claims credible.

Load-bearing premise

The calibration function is Lipschitz continuous with some constant L and the model's error rate epsilon is known or fixed in advance.

What would settle it

An experiment that demonstrates an estimator achieving a strictly faster rate than (L epsilon / m)^{1/3} for calibration error, or that detects miscalibration in a regime where m epsilon is much less than 1.

Figures

Figures reproduced from arXiv: 2604.12951 by Jason Z Wang.

Figure 1
Figure 1. Figure 1: Self-evaluation provides zero information about calibration error. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Phase transition—detection power vs. m · ε across four error rates (synthetic). Right: Leaderboard noise—pairwise accuracy gaps vs. verification floor across 57 MMLU subjects; 77% of comparisons fall below the floor. See Appendix for enlarged version. 5 Surprise 3: Active Querying Eliminates L Theorem 12 (Active Verification Rate). With adaptive confidence-level selection, R∗ active(m, ε, L) = Θ(p ε/… view at source ↗
Figure 3
Figure 3. Figure 3: Active verification on real MMLU data. Left: Active vs. passive estimation error (log-log). Right: Error at fixed m=2000 vs. Lˆ—active (hatched) is L-independent; passive correlates with L. 6 Surprise 4: Composition Is Exponential Theorem 13 (Compositional Verification Tax). For a K-component pipeline with per-component Lipschitz L1, . . . , LK: Lsys ≤ Q k Lk + 1, and msys = Ω(Lsys · ε/δ3 ). For homogeneou… view at source ↗
Figure 4
Figure 4. Figure 4: Compositional verification tax (synthetic). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Minimax estimation error vs. error rate (fixed m=10,000, L=1). Sun et al.’s ε￾independent rate is flat; ours decreases with ε, revealing a 2.7× gap at frontier error rates (ε=0.05). (b) Required samples for meaningful verification (δ=ε/2). Both curves grow as ε → 0; the verification horizon (dashed) marks where passive verification exceeds typical benchmark sizes. practical checklist advertised in the … view at source ↗
Figure 6
Figure 6. Figure 6: Verification floors across 5 benchmarks. Only MMLU ( [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: TruthfulQA self-evaluation analysis. Confidence tracks accuracy more reliably than [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TruthfulQA subsampling convergence. Estimation error shrinks with sample size but [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TruthfulQA phase transition. Detection power rises sharply once [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Real 2-stage pipeline verification error. The pipeline has consistently higher estimation [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Real 2-stage pipeline Lipschitz comparison. The estimated system Lipschitz constant is [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Synthetic slope study: estimation error versus sample size for increasing numbers of [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Synthetic slope convergence to the minimax [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pseudo-classifier control: ECE estimate ± std versus m, with the verification floor overlaid. 10 0 10 1 10 2 10 3 Normalized sample size m 0.0 0.2 0.4 0.6 0.8 1.0 Detection power Real-Model Phase Transition: ECE Detection m = 1 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pseudo-classifier control: detection power versus [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
read the original abstract

The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proves that the minimax rate for estimating calibration error of a model with error rate ε is Θ((Lε/m)^{1/3}), where L is the Lipschitz constant of the calibration function and m is the number of samples. This 'verification tax' implies that verifying calibration becomes harder as models improve. Four theoretical results are established: self-evaluation yields zero information about calibration, a phase transition at mε ≈ 1 below which miscalibration is undetectable, active querying removes the dependence on L, and verification cost grows exponentially with pipeline depth at rate L^K. These are validated empirically on 27 LLM-benchmark pairs (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande) using 95% bootstrap CIs and permutation tests, with self-evaluation non-significant in 80% of cases and 23% of frontier-model comparisons indistinguishable from noise.

Significance. If the minimax characterization holds, the work supplies a rigorous information-theoretic basis for the practical difficulties of auditing calibration in high-accuracy AI systems. It explains why post-hoc ECE values on frontier models often lie below the noise floor and motivates a shift toward active querying. The combination of the phase-transition result and the exponential depth dependence offers concrete guidance for evaluation design, while the large-scale empirical study on 27 model-benchmark pairs demonstrates that many current calibration claims cannot be distinguished from statistical noise.

major comments (2)
  1. [Section 3] Main theorem (Section 3): the claim that the minimax rate is Θ((Lε/m)^{1/3}) requires both an Ω lower bound and a matching O upper bound. The lower bound is obtained via a packing argument over L-Lipschitz functions with effective sample size scaled by ε, but the manuscript does not exhibit or analyze an estimator (e.g., a suitably bandwidth-tuned local polynomial regressor adapted to the Poisson sampling of errors) whose risk is O((Lε/m)^{1/3}). Without the explicit upper-bound construction the Θ statement is not fully established.
  2. [Section 5] Empirical section (Section 5, permutation-test paragraph): the claim that self-evaluation is non-significant in 80% of the 27 pairs rests on permutation tests, yet the precise null distribution (under perfect calibration and the rare-error Poisson sampling) is not specified. This detail is load-bearing for interpreting the reported non-significance rate and the 23% indistinguishability figure for frontier models.
minor comments (3)
  1. [Abstract] The four theoretical results are listed in the abstract but never enumerated with section references; adding a short roadmap paragraph would improve navigation.
  2. [Section 2] Notation for the calibration function f and the error indicator is introduced informally; a formal definition in a preliminaries subsection would eliminate ambiguity when the Lipschitz assumption is invoked.
  3. [Section 4] Figure 4 (verification cost vs. depth) shows exponential growth but omits confidence bands or sensitivity analysis with respect to L; adding these would strengthen the visual claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects that will improve the rigor of both the theoretical and empirical contributions. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Section 3] Main theorem (Section 3): the claim that the minimax rate is Θ((Lε/m)^{1/3}) requires both an Ω lower bound and a matching O upper bound. The lower bound is obtained via a packing argument over L-Lipschitz functions with effective sample size scaled by ε, but the manuscript does not exhibit or analyze an estimator (e.g., a suitably bandwidth-tuned local polynomial regressor adapted to the Poisson sampling of errors) whose risk is O((Lε/m)^{1/3}). Without the explicit upper-bound construction the Θ statement is not fully established.

    Authors: We agree that a complete proof of the minimax rate requires an explicit upper bound matching the lower bound. While the lower bound via the packing argument is novel and central to the verification tax result, the manuscript did not provide a detailed construction for the upper bound. In the revised manuscript, we will add an analysis of a suitably adapted local polynomial regressor (with bandwidth chosen as a function of L, ε, and m) under the Poisson sampling model for rare errors, establishing the O((Lε/m)^{1/3}) risk bound. This will fully establish the Θ characterization. revision: yes

  2. Referee: [Section 5] Empirical section (Section 5, permutation-test paragraph): the claim that self-evaluation is non-significant in 80% of the 27 pairs rests on permutation tests, yet the precise null distribution (under perfect calibration and the rare-error Poisson sampling) is not specified. This detail is load-bearing for interpreting the reported non-significance rate and the 23% indistinguishability figure for frontier models.

    Authors: We thank the referee for pointing this out. The permutation tests assess whether the observed calibration error is consistent with perfect calibration under the rare-error regime. Under the null hypothesis of perfect calibration, the error indicators are generated as independent Poisson random variables with rate ε, independent of the model's confidence scores. In the revision, we will explicitly state this null distribution and describe how the permutations are performed to generate the reference distribution for the test statistic. This will provide the necessary detail for interpreting the 80% non-significance rate and the 23% figure for frontier models. revision: yes

Circularity Check

0 steps flagged

No circularity: minimax rate is a self-contained information-theoretic derivation

full rationale

The paper derives the claimed Theta((L epsilon/m)^{1/3}) minimax rate via standard nonparametric estimation arguments (Le Cam/Fano lower bounds on a packing of L-Lipschitz calibration functions, with effective sample size scaled by the rare-error probability epsilon). This does not reduce to any fitted parameter, self-citation chain, or ansatz imported from prior work by the same author. The four listed results follow directly from the same rate analysis without re-using the target quantity as an input. Empirical sections rely on external benchmarks and permutation tests rather than internal fits. The derivation is therefore independent and self-contained against external statistical theory.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical minimax theory for estimating a functional of a distribution under a rare-event regime; L is treated as a model parameter rather than fitted; no new entities are postulated.

free parameters (1)
  • L
    Lipschitz constant of the calibration function appearing in the rate expression; treated as given rather than estimated from data.
axioms (1)
  • standard math Standard minimax lower-bound techniques from statistical estimation theory apply to calibration error functionals.
    Invoked to obtain the Theta((L epsilon/m)^{1/3}) rate.

pith-pipeline@v0.9.0 · 5583 in / 1381 out tokens · 86036 ms · 2026-05-10T15:23:34.293554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    A unifying theory of distance from calibration

    Jaroslaw Blasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727--1740, 2023

  2. [2]

    Asymptotic equivalence of nonparametric regression and white noise

    Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics, 24 0 (6): 0 2384--2398, 1996

  3. [3]

    Chatbot arena: An open platform for evaluating LLMs by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In International Conference on Machine Learning, 2024

  4. [4]

    An information-theoretic analysis of expected calibration error

    Futoshi Futami and Masahiro Fujisawa. An information-theoretic analysis of expected calibration error. In Advances in Neural Information Processing Systems, volume 37, 2024

  5. [5]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321--1330. PMLR, 2017

  6. [6]

    Time-uniform, nonparametric, nonasymptotic confidence sequences

    Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49 0 (2): 0 1--36, 2021

  7. [7]

    Testing calibration in nearly-linear time

    Lunjia Hu, Kevin Perera, and Sebastian Casalaina-Martin. Testing calibration in nearly-linear time. In Advances in Neural Information Processing Systems, volume 37, 2024. Also accepted at STOC 2025

  8. [8]

    Active testing: Sample-efficient model evaluation

    Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evaluation. In International Conference on Machine Learning. PMLR, 2021

  9. [9]

    On estimation of the L_r norm of a regression function

    Oleg V Lepski, Arkadi Nemirovski, and Vladimir Spokoiny. On estimation of the L_r norm of a regression function. Probability Theory and Related Fields, 113 0 (2): 0 221--253, 1999

  10. [10]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023

  11. [11]

    Asymptotic equivalence of density estimation and Gaussian white noise

    Michael Nussbaum. Asymptotic equivalence of density estimation and Gaussian white noise. Annals of Statistics, 24 0 (6): 0 2399--2430, 1996

  12. [12]

    Auditing fairness by betting

    Shubhanshu Shekhar and Aaditya Ramdas. Auditing fairness by betting. Advances in Neural Information Processing Systems, 36, 2024

  13. [13]

    On the estimation of expected calibration error

    Shuo Sun, Jiaqi Song, and Alfred O Hero. On the estimation of expected calibration error. In Advances in Neural Information Processing Systems, volume 36, 2023. Spotlight

  14. [14]

    Introduction to Nonparametric Estimation

    Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009

  15. [15]

    Asymptotic Statistics

    Aad W van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

  16. [16]

    Judging LLM -as-a-judge with MT-Bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023

  17. [17]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  18. [18]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  19. [19]

    @position\@positionfalse position \@positiontrue \@trackname \@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear)

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...