A Perfectly Truthful Calibration Measure

Jason Hartline; Lunjia Hu; Yifan Wu

arxiv: 2508.13100 · v3 · submitted 2025-08-18 · 💻 cs.LG · cs.DS· stat.ML

A Perfectly Truthful Calibration Measure

Jason Hartline , Lunjia Hu , Yifan Wu This is my paper

Pith reviewed 2026-05-18 22:19 UTC · model grok-4.3

classification 💻 cs.LG cs.DSstat.ML

keywords calibration measuretruthful calibrationbatch settingcalibration errormachine learningprediction probabilitiescalibration testing

0 comments

The pith

Averaged two-bin calibration error provides the first perfectly and strictly truthful calibration measure in the batch setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces averaged two-bin calibration error, called ATB, as a calibration measure that is perfectly and strictly truthful when evaluated on a finite random sample from a batch. This means the measure reaches its minimum in expectation exactly when the predictor outputs the ground-truth conditional probabilities rather than some distorted version that might look better on the sample. Earlier calibration measures all create an incentive to lie because their values can decrease when a predictor adjusts outputs away from the truth to better match the realized sample. ATB is also sound and complete, equaling zero if and only if the predictor is perfectly calibrated, and it connects quadratically to the smooth calibration error and the lower distance to calibration. The authors supply a general construction recipe that uses variance additivity of independent random variables to guarantee truthfulness, with ATB as one instance and quantile-binned l2-ECE as another.

Core claim

We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case.

What carries the argument

Averaged two-bin calibration error (ATB), which partitions predictions into two bins and averages their squared calibration errors in a way that inherits strict truthfulness from the variance additivity of independent random variables.

If this is right

ATB is zero exactly when the predictor is perfectly calibrated, providing both soundness and completeness.
ATB is quadratically related to smooth calibration error and lower distance to calibration.
ATB supports the first linear-time algorithm for testing whether a predictor is calibrated.
The same variance-additivity recipe yields additional truthful measures such as quantile-binned l2-ECE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ATB could be used as a default objective when training models that must report probabilities for downstream decision making on held-out data.
The linear-time testing procedure might make routine calibration audits feasible for very large test sets in production systems.
If the two-bin construction generalizes cleanly, similar truthful measures could be derived for multi-class or continuous prediction settings.

Load-bearing premise

The variance additivity property of independent random variables carries over to establish strict truthfulness when ATB is computed on a finite batch sample.

What would settle it

A concrete finite-sample example or simulation in which a predictor that reports the true conditional probabilities yields a strictly higher expected ATB value than a predictor that systematically distorts those probabilities.

read the original abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first perfectly truthful batch calibration measure via ATB and a variance additivity recipe, with the finite-sample step as the main point to verify.

read the letter

This paper's key contribution is a calibration measure called averaged two-bin calibration error, or ATB, that is perfectly and strictly truthful in the batch setting. No prior work had achieved that, even without sequential complications. They define ATB simply and prove it is minimized in expectation only when the predictions match the true conditional probabilities. The general recipe they give relies on variance additivity for independent random variables, and ATB is a special case. This also yields other measures, such as quantile-binned l2-ECE. On top of that, the simplicity supports a linear-time algorithm for testing calibration, improving on Hu et al. The work does well by making the measure easy to compute while fixing the incentive issue that plagued earlier ones. Predictors no longer have reason to distort their outputs just to score better on the measure. The potential soft spot is exactly the one in the stress test. The proof uses independence to get strict truthfulness, but in a finite batch the empirical averages and binning might introduce dependencies or cross terms that affect the expectation. The paper needs to show explicitly that the minimum remains unique at the true probabilities after these steps. If that holds, the claim is solid. This is for readers interested in proper evaluation of probabilistic predictions, especially in settings where truthfulness matters for high-stakes use. It engages the recent literature on calibration measures directly. I would recommend sending it for peer review. The idea is novel enough and the construction concrete enough that referees can check the proofs and clarify any finite-sample gaps.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the averaged two-bin calibration error (ATB) as a calibration measure for the batch setting. It claims that ATB is perfectly and strictly truthful (minimized in expectation only at the true conditional probabilities), sound, and complete. The construction uses a general recipe based on variance additivity of independent random variables, establishes quadratic relations to smooth calibration error (smCal) and lower distance to calibration (distCal), and yields a linear-time calibration testing algorithm.

Significance. If the truthfulness and completeness claims hold with the stated finite-sample guarantees, the result would be a meaningful advance: it supplies the first perfectly truthful batch calibration measure, removing the incentive for predictors to deviate from true probabilities merely to minimize the reported error. The general variance-additivity recipe and the linear-time tester are additional assets that could be reused or extended.

major comments (1)

[§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.

minor comments (2)

[Abstract] Abstract: the claimed linear-time testing algorithm would benefit from an explicit big-O statement (e.g., O(n) or O(n log n)) and a brief description of the data structures used.
[§2] Notation: ensure that the two-bin error term is defined uniformly before it is averaged; a single displayed equation for the per-bin contribution would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to make the finite-sample truthfulness argument fully explicit. We address the comment below and will revise the manuscript to incorporate the requested clarification.

read point-by-point responses

Referee: [§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.

Authors: We agree that an explicit finite-n derivation is required. In the revision we will expand §3 (and add a short appendix) with the following calculation. Let the n samples be drawn i.i.d. Let p_1,…,p_n be the (random) predictions produced by the fixed predictor on these samples. ATB is defined by first partitioning the n predictions into two empirical bins (via a data-dependent threshold that depends only on the multiset of p_i’s) and then computing a normalized squared deviation within each bin. Define, for each sample i, two indicator random variables I_i^{(1)} and I_i^{(2)} that mark membership in the two bins; these indicators are functions of the entire vector (p_1,…,p_n) but are still measurable with respect to the sample. The contribution of sample i to ATB can be written as a sum of two terms, each of which is a centered random variable whose conditional variance (given the p-vector) is strictly positive unless the predictor equals the true conditional probability on the support of that bin. Because the n samples are independent, the cross-covariance terms E[(term_i)(term_j)] for i≠j vanish after taking the outer expectation over the p-vector; the only surviving terms are the per-sample variances. Consequently Var(∑_i term_i) = ∑_i Var(term_i) still holds, and the expectation of ATB is therefore strictly minimized precisely when every per-sample variance is zero, i.e., when the predictor is truthful. The same argument yields the quadratic relation to smCal and distCal at finite n. We will include the full algebraic expansion and the conditioning argument in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation of ATB truthfulness relies on standard variance additivity without reduction to inputs or self-citations

full rationale

The paper establishes strict truthfulness of ATB via a general recipe that invokes the standard mathematical property Var(X+Y)=Var(X)+Var(Y) for independent random variables. This property is external and not defined in terms of the target result. No equations reduce the claimed prediction or first-principles result to a fitted parameter, self-definition, or load-bearing self-citation chain. The citation to Hu et al. (2024) concerns only an algorithmic improvement and is not used to justify the truthfulness guarantee. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the variance additivity property and the specific definition of ATB; no free parameters or invented physical entities are indicated in the abstract.

axioms (1)

standard math Variance additivity of independent random variables
Invoked in the general recipe that proves truthfulness of ATB as a special case.

invented entities (1)

Averaged two-bin calibration error (ATB) no independent evidence
purpose: A new calibration measure that is perfectly truthful
Defined directly in the paper as the average of two-bin calibration errors.

pith-pipeline@v0.9.0 · 5810 in / 1299 out tokens · 54851 ms · 2026-05-18T22:19:13.137348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Ey∼p[ℓ2-BinECEB′(p,y)]=Ey∼p[ℓ2-BinECEB(p,y)] for any partitions B,B′ … we crucially use the variance additivity of independent random variables … ∑t∈Bi pt(1−pt)
IndisputableMonolith.Foundation.BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1.2 … Eq∼Unif([0,1]) [ (1/T²) (∑rt<q (rt−yt))² + (∑rt≥q (rt−yt))² ]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instance-Adaptive Online Multicalibration
cs.LG 2026-05 conditional novelty 8.0

A single online multicalibration algorithm adaptively refines a dyadic grid and achieves instance-dependent rates: O(T^{2/3}) worst-case, O(sqrt T) for marginal stochastic data, and O(sqrt(JT)) for J-piecewise station...
Testable and Actionable Calibration for Full Swap Regret
cs.LG 2026-05 unverdicted novelty 7.0

Introduces SCDL as a calibration measure that is fully actionable for full swap regret and testable with nearly optimal sample error while satisfying continuity and consistency.
Instance-Adaptive Online Multicalibration
cs.LG 2026-05 unverdicted novelty 7.0

A single algorithm for online multicalibration achieves instance-adaptive rates by dynamically refining a dyadic prediction grid, recovering the worst-case Õ(T^{2/3}) bound and improving to Õ(√T) in marginal stochasti...