pith. sign in

arxiv: 2508.13100 · v3 · submitted 2025-08-18 · 💻 cs.LG · cs.DS· stat.ML

A Perfectly Truthful Calibration Measure

Pith reviewed 2026-05-18 22:19 UTC · model grok-4.3

classification 💻 cs.LG cs.DSstat.ML
keywords calibration measuretruthful calibrationbatch settingcalibration errormachine learningprediction probabilitiescalibration testing
0
0 comments X

The pith

Averaged two-bin calibration error provides the first perfectly and strictly truthful calibration measure in the batch setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces averaged two-bin calibration error, called ATB, as a calibration measure that is perfectly and strictly truthful when evaluated on a finite random sample from a batch. This means the measure reaches its minimum in expectation exactly when the predictor outputs the ground-truth conditional probabilities rather than some distorted version that might look better on the sample. Earlier calibration measures all create an incentive to lie because their values can decrease when a predictor adjusts outputs away from the truth to better match the realized sample. ATB is also sound and complete, equaling zero if and only if the predictor is perfectly calibrated, and it connects quadratically to the smooth calibration error and the lower distance to calibration. The authors supply a general construction recipe that uses variance additivity of independent random variables to guarantee truthfulness, with ATB as one instance and quantile-binned l2-ECE as another.

Core claim

We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case.

What carries the argument

Averaged two-bin calibration error (ATB), which partitions predictions into two bins and averages their squared calibration errors in a way that inherits strict truthfulness from the variance additivity of independent random variables.

If this is right

  • ATB is zero exactly when the predictor is perfectly calibrated, providing both soundness and completeness.
  • ATB is quadratically related to smooth calibration error and lower distance to calibration.
  • ATB supports the first linear-time algorithm for testing whether a predictor is calibrated.
  • The same variance-additivity recipe yields additional truthful measures such as quantile-binned l2-ECE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ATB could be used as a default objective when training models that must report probabilities for downstream decision making on held-out data.
  • The linear-time testing procedure might make routine calibration audits feasible for very large test sets in production systems.
  • If the two-bin construction generalizes cleanly, similar truthful measures could be derived for multi-class or continuous prediction settings.

Load-bearing premise

The variance additivity property of independent random variables carries over to establish strict truthfulness when ATB is computed on a finite batch sample.

What would settle it

A concrete finite-sample example or simulation in which a predictor that reports the true conditional probabilities yields a strictly higher expected ATB value than a predictor that systematically distorts those probabilities.

read the original abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the averaged two-bin calibration error (ATB) as a calibration measure for the batch setting. It claims that ATB is perfectly and strictly truthful (minimized in expectation only at the true conditional probabilities), sound, and complete. The construction uses a general recipe based on variance additivity of independent random variables, establishes quadratic relations to smooth calibration error (smCal) and lower distance to calibration (distCal), and yields a linear-time calibration testing algorithm.

Significance. If the truthfulness and completeness claims hold with the stated finite-sample guarantees, the result would be a meaningful advance: it supplies the first perfectly truthful batch calibration measure, removing the incentive for predictors to deviate from true probabilities merely to minimize the reported error. The general variance-additivity recipe and the linear-time tester are additional assets that could be reused or extended.

major comments (1)
  1. [§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.
minor comments (2)
  1. [Abstract] Abstract: the claimed linear-time testing algorithm would benefit from an explicit big-O statement (e.g., O(n) or O(n log n)) and a brief description of the data structures used.
  2. [§2] Notation: ensure that the two-bin error term is defined uniformly before it is averaged; a single displayed equation for the per-bin contribution would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to make the finite-sample truthfulness argument fully explicit. We address the comment below and will revise the manuscript to incorporate the requested clarification.

read point-by-point responses
  1. Referee: [§3] §3 (Truthfulness proof) and the finite-sample definition of ATB: the argument invokes Var(X+Y)=Var(X)+Var(Y) for independent random variables to obtain strict minimization in expectation. Because ATB is computed on a finite batch of size n with empirical binning and averaging, the proof must explicitly verify that cross terms or dependence induced by shared samples do not allow a non-truthful predictor to achieve a strictly lower expected value. Please supply the missing finite-n calculation or concentration argument that preserves the strict inequality.

    Authors: We agree that an explicit finite-n derivation is required. In the revision we will expand §3 (and add a short appendix) with the following calculation. Let the n samples be drawn i.i.d. Let p_1,…,p_n be the (random) predictions produced by the fixed predictor on these samples. ATB is defined by first partitioning the n predictions into two empirical bins (via a data-dependent threshold that depends only on the multiset of p_i’s) and then computing a normalized squared deviation within each bin. Define, for each sample i, two indicator random variables I_i^{(1)} and I_i^{(2)} that mark membership in the two bins; these indicators are functions of the entire vector (p_1,…,p_n) but are still measurable with respect to the sample. The contribution of sample i to ATB can be written as a sum of two terms, each of which is a centered random variable whose conditional variance (given the p-vector) is strictly positive unless the predictor equals the true conditional probability on the support of that bin. Because the n samples are independent, the cross-covariance terms E[(term_i)(term_j)] for i≠j vanish after taking the outer expectation over the p-vector; the only surviving terms are the per-sample variances. Consequently Var(∑_i term_i) = ∑_i Var(term_i) still holds, and the expectation of ATB is therefore strictly minimized precisely when every per-sample variance is zero, i.e., when the predictor is truthful. The same argument yields the quadratic relation to smCal and distCal at finite n. We will include the full algebraic expansion and the conditioning argument in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation of ATB truthfulness relies on standard variance additivity without reduction to inputs or self-citations

full rationale

The paper establishes strict truthfulness of ATB via a general recipe that invokes the standard mathematical property Var(X+Y)=Var(X)+Var(Y) for independent random variables. This property is external and not defined in terms of the target result. No equations reduce the claimed prediction or first-principles result to a fitted parameter, self-definition, or load-bearing self-citation chain. The citation to Hu et al. (2024) concerns only an algorithmic improvement and is not used to justify the truthfulness guarantee. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the variance additivity property and the specific definition of ATB; no free parameters or invented physical entities are indicated in the abstract.

axioms (1)
  • standard math Variance additivity of independent random variables
    Invoked in the general recipe that proves truthfulness of ATB as a special case.
invented entities (1)
  • Averaged two-bin calibration error (ATB) no independent evidence
    purpose: A new calibration measure that is perfectly truthful
    Defined directly in the paper as the average of two-bin calibration errors.

pith-pipeline@v0.9.0 · 5810 in / 1299 out tokens · 54851 ms · 2026-05-18T22:19:13.137348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Instance-Adaptive Online Multicalibration

    cs.LG 2026-05 conditional novelty 8.0

    A single online multicalibration algorithm adaptively refines a dyadic grid and achieves instance-dependent rates: O(T^{2/3}) worst-case, O(sqrt T) for marginal stochastic data, and O(sqrt(JT)) for J-piecewise station...

  2. Testable and Actionable Calibration for Full Swap Regret

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces SCDL as a calibration measure that is fully actionable for full swap regret and testable with nearly optimal sample error while satisfying continuity and consistency.

  3. Instance-Adaptive Online Multicalibration

    cs.LG 2026-05 unverdicted novelty 7.0

    A single algorithm for online multicalibration achieves instance-adaptive rates by dynamically refining a dyadic prediction grid, recovering the worst-case Õ(T^{2/3}) bound and improving to Õ(√T) in marginal stochasti...