A Perfectly Truthful Calibration Measure
Pith reviewed 2026-05-18 22:19 UTC · model grok-4.3
The pith
Averaged two-bin calibration error provides the first perfectly and strictly truthful calibration measure in the batch setting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case.
What carries the argument
Averaged two-bin calibration error (ATB), which partitions predictions into two bins and averages their squared calibration errors in a way that inherits strict truthfulness from the variance additivity of independent random variables.
If this is right
- ATB is zero exactly when the predictor is perfectly calibrated, providing both soundness and completeness.
- ATB is quadratically related to smooth calibration error and lower distance to calibration.
- ATB supports the first linear-time algorithm for testing whether a predictor is calibrated.
- The same variance-additivity recipe yields additional truthful measures such as quantile-binned l2-ECE.
Where Pith is reading between the lines
- ATB could be used as a default objective when training models that must report probabilities for downstream decision making on held-out data.
- The linear-time testing procedure might make routine calibration audits feasible for very large test sets in production systems.
- If the two-bin construction generalizes cleanly, similar truthful measures could be derived for multi-class or continuous prediction settings.
Load-bearing premise
The variance additivity property of independent random variables carries over to establish strict truthfulness when ATB is computed on a finite batch sample.
What would settle it
A concrete finite-sample example or simulation in which a predictor that reports the true conditional probabilities yields a strictly higher expected ATB value than a predictor that systematically distorts those probabilities.
read the original abstract
Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the averaged two-bin calibration error (ATB) as a calibration measure for the batch setting. It claims that ATB is perfectly and strictly truthful (minimized in expectation only at the true conditional probabilities), sound, and complete. The construction uses a general recipe based on variance additivity of independent random variables, establishes quadratic relations to smooth calibration error (smCal) and lower distance to calibration (distCal), and yields a linear-time calibration testing algorithm.
Significance. If the truthfulness and completeness claims hold with the stated finite-sample guarantees, the result would be a meaningful advance: it supplies the first perfectly truthful batch calibration measure, removing the incentive for predictors to deviate from true probabilities merely to minimize the reported error. The general variance-additivity recipe and the linear-time tester are additional assets that could be reused or extended.
major comments (1)
- [§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.
minor comments (2)
- [Abstract] Abstract: the claimed linear-time testing algorithm would benefit from an explicit big-O statement (e.g., O(n) or O(n log n)) and a brief description of the data structures used.
- [§2] Notation: ensure that the two-bin error term is defined uniformly before it is averaged; a single displayed equation for the per-bin contribution would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need to make the finite-sample truthfulness argument fully explicit. We address the comment below and will revise the manuscript to incorporate the requested clarification.
read point-by-point responses
-
Referee: [§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.
Authors: We agree that an explicit finite-n derivation is required. In the revision we will expand §3 (and add a short appendix) with the following calculation. Let the n samples be drawn i.i.d. Let p_1,…,p_n be the (random) predictions produced by the fixed predictor on these samples. ATB is defined by first partitioning the n predictions into two empirical bins (via a data-dependent threshold that depends only on the multiset of p_i’s) and then computing a normalized squared deviation within each bin. Define, for each sample i, two indicator random variables I_i^{(1)} and I_i^{(2)} that mark membership in the two bins; these indicators are functions of the entire vector (p_1,…,p_n) but are still measurable with respect to the sample. The contribution of sample i to ATB can be written as a sum of two terms, each of which is a centered random variable whose conditional variance (given the p-vector) is strictly positive unless the predictor equals the true conditional probability on the support of that bin. Because the n samples are independent, the cross-covariance terms E[(term_i)(term_j)] for i≠j vanish after taking the outer expectation over the p-vector; the only surviving terms are the per-sample variances. Consequently Var(∑_i term_i) = ∑_i Var(term_i) still holds, and the expectation of ATB is therefore strictly minimized precisely when every per-sample variance is zero, i.e., when the predictor is truthful. The same argument yields the quadratic relation to smCal and distCal at finite n. We will include the full algebraic expansion and the conditioning argument in the revised manuscript. revision: yes
Circularity Check
Derivation of ATB truthfulness relies on standard variance additivity without reduction to inputs or self-citations
full rationale
The paper establishes strict truthfulness of ATB via a general recipe that invokes the standard mathematical property Var(X+Y)=Var(X)+Var(Y) for independent random variables. This property is external and not defined in terms of the target result. No equations reduce the claimed prediction or first-principles result to a fitted parameter, self-definition, or load-bearing self-citation chain. The citation to Hu et al. (2024) concerns only an algorithmic improvement and is not used to justify the truthfulness guarantee. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Variance additivity of independent random variables
invented entities (1)
-
Averaged two-bin calibration error (ATB)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Ey∼p[ℓ2-BinECEB′(p,y)]=Ey∼p[ℓ2-BinECEB(p,y)] for any partitions B,B′ … we crucially use the variance additivity of independent random variables … ∑t∈Bi pt(1−pt)
-
IndisputableMonolith.Foundation.BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1.2 … Eq∼Unif([0,1]) [ (1/T²) (∑rt<q (rt−yt))² + (∑rt≥q (rt−yt))² ]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Instance-Adaptive Online Multicalibration
A single online multicalibration algorithm adaptively refines a dyadic grid and achieves instance-dependent rates: O(T^{2/3}) worst-case, O(sqrt T) for marginal stochastic data, and O(sqrt(JT)) for J-piecewise station...
-
Testable and Actionable Calibration for Full Swap Regret
Introduces SCDL as a calibration measure that is fully actionable for full swap regret and testable with nearly optimal sample error while satisfying continuity and consistency.
-
Instance-Adaptive Online Multicalibration
A single algorithm for online multicalibration achieves instance-adaptive rates by dynamically refining a dyadic prediction grid, recovering the worst-case Õ(T^{2/3}) bound and improving to Õ(√T) in marginal stochasti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.