pith. sign in

arxiv: 2605.01796 · v2 · submitted 2026-05-03 · 💻 cs.LG · math.ST· stat.TH

Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH
keywords confidence calibrationexpected calibration erroroverconfidence riskcalibrated size ratioconfidence-weighted AUCmachine learning evaluationpost-hoc calibration
0
0 comments X

The pith

The Expected Calibration Error can remain small even under arbitrarily large overconfidence risk, which motivates the Calibrated Size Ratio as a replacement metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the standard Expected Calibration Error can stay low even when a model's confidence assignments carry high overconfidence risk in wrong predictions. In its place the authors introduce the Calibrated Size Ratio, an interpretable quantity that equals one only under perfect calibration and yields a derived probability of risk. They further establish that weighting accuracy and AUC by the model's reported confidence levels extracts calibration information that the ordinary unweighted versions miss. A sympathetic reader would care because many real-world uses of classifiers rely on the numerical confidence values to decide when to trust or defer a prediction. The authors validate the new indicators on controlled synthetic distributions and on fifteen real datasets both before and after common post-hoc calibration steps.

Core claim

We show that ECE can remain small even under arbitrarily large overconfidence risk. We therefore propose the Calibrated Size Ratio (CSR), an interpretable metric that equals 1 under perfect calibration and from which we derive the risk probability P_risk that quantifies statistical evidence for overconfidence. We also show that confidence-weighted accuracy is the natural complement for measuring discriminative value and prove that the confidence-weighted AUC captures calibration information while the classical AUC cannot. Standard post-hoc methods can still leave risky confidence profiles on real data.

What carries the argument

The Calibrated Size Ratio (CSR), an interpretable scalar that equals 1 under perfect calibration and is constructed to be sensitive to the size of overconfident regions regardless of where they occur.

If this is right

  • CSR yields a direct probability P_risk that quantifies evidence for overconfidence.
  • Confidence weighting extends to every standard classification metric and supplies complementary information.
  • The classical AUC is provably insensitive to calibration while its confidence-weighted counterpart is not.
  • Common post-hoc calibration procedures can still produce confidence profiles that CSR flags as risky.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Direct optimization of models or calibration maps toward CSR could reduce the incidence of undetected overconfidence in deployed systems.
  • Model selection pipelines that currently rely on AUC might benefit from substituting or adding cwAUC when calibration quality matters.
  • The separation between risk assessment and discriminative value opens the possibility of multi-objective calibration procedures that trade off the two explicitly.

Load-bearing premise

That the linear nature of ECE is the main reason it fails to flag overconfidence risk and that the proposed CSR provides a more risk-sensitive alternative without introducing new fitting artifacts.

What would settle it

A controlled synthetic confidence distribution in which high overconfidence is concentrated at the top confidence levels yet ECE stays below a conventional threshold while CSR rises above 1 and P_risk becomes large.

Figures

Figures reproduced from arXiv: 2605.01796 by Fernando Martin-Maroto, Gonzalo G. de Polavieja, Nabil Abderrahaman.

Figure 1
Figure 1. Figure 1: ROC and confidence-weighted ROC curves for the binary Adult Income [ [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ROC and confidence-weighted ROC curves for class 0 of various datasets under Platt [PITH_FULL_IMAGE:figures/full_fig_p030_2.png] view at source ↗
read the original abstract

Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability $P_{\mathrm{risk}}$ that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy $\mathrm{cwA}$ is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and find that CSR separates risky from non-risky assignments. We also test the metrics on fifteen real datasets, with and without post-hoc calibration, and find that standard methods can yield risky confidence profiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that ECE can remain small despite arbitrarily large overconfidence risk (demonstrated via synthetic constructions localizing overconfidence to high-confidence bins), proposes the Calibrated Size Ratio (CSR) as a more interpretable alternative that equals 1 under perfect calibration, derives a risk probability P_risk via one-sided statistical test on CSR deviations, introduces confidence-weighted accuracy (cwA) and proves that confidence-weighted AUC (cwAUC) incorporates calibration information (via absolute deviation in pairwise comparisons) while standard AUC does not (due to invariance under strictly increasing transformations), and validates the metrics on synthetic profiles plus 15 real datasets with and without post-hoc calibration.

Significance. If the central claims hold, the work is significant for challenging the dominance of ECE in calibration assessment and for providing a risk-sensitive metric (CSR) together with a proof that cwAUC captures calibration information that AUC discards. Explicit synthetic constructions, direct (non-circular) definitions of CSR and P_risk, and the invariance-based proof for cwAUC are strengths; the 15-dataset experiments further support the practical relevance. The reader's concerns about unshown derivations and low soundness do not appear to land, as the skeptic analysis confirms the constructions and proofs are explicit and internally consistent without reduction to prior self-citations.

major comments (2)
  1. The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.
  2. The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.
minor comments (2)
  1. Abstract: 'cwA' is used without prior expansion; spell out 'confidence-weighted accuracy' on first use for clarity.
  2. Notation: Ensure consistent use of P_risk (with mathrm) and CSR throughout; a small table summarizing metric properties (ECE vs CSR vs cwAUC) would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.

    Authors: We agree that these details would improve the robustness of the empirical claims. In the revised manuscript, we will add bootstrapped 95% confidence intervals (error bars) for all reported CSR, P_risk, and related metrics across the 15 datasets. We will explicitly state the dataset exclusion criteria (standard public benchmarks: CIFAR-10/100, SVHN, ImageNet subsets, and others, selected for diversity in size and domain with no cherry-picking) and include statistical significance tests (paired Wilcoxon signed-rank tests with p-values) comparing CSR/P_risk between uncalibrated and post-hoc calibrated models to support the claim that risky profiles can occur despite low ECE. revision: yes

  2. Referee: The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.

    Authors: We will revise the section to explicitly define the weighting function for cwAUC as w(i,j) = c_i * c_j for each pair of instances i and j, where c denotes the predicted confidence. This weighting directly embeds the absolute calibration deviation |acc - conf| into the expected pairwise ranking score. The proof relies only on the definition of AUC as a ranking metric and the effect of strictly increasing transformations; it holds for any finite collection of predictions with confidences and binary labels, without requiring binning, fixed sample sizes, or other assumptions. We will add this clarification and a brief remark on generality in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained

full rationale

The paper's central claims rest on direct definitions and explicit constructions rather than self-referential reductions. CSR is introduced as the ratio of observed to calibrated bin sizes (or continuous equivalent) equaling 1 exactly under perfect calibration, with P_risk obtained from a standard one-sided test on deviation from 1. The ECE limitation is shown via synthetic constructions localizing overconfidence to high-confidence bins whose linear weighting keeps ECE small. The cwAUC proof proceeds by demonstrating that confidence-weighting incorporates absolute |acc - conf| deviations into pairwise rankings, while classical AUC remains invariant under strictly increasing score transforms. No load-bearing step reduces to fitted parameters, author-overlapping citations, or ansatzes imported from prior work; the fifteen-dataset validation and controlled synthetic profiles provide independent empirical support without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the domain assumption that calibration assessment should be sensitive to confidence level and that weighting by confidence adds discriminative information.

axioms (1)
  • domain assumption ECE is a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs
    Explicitly stated as the starting point for proposing CSR.

pith-pipeline@v0.9.0 · 5522 in / 1243 out tokens · 87784 ms · 2026-05-10T15:07:58.957766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017. URL https://proceedings.mlr.press/v70/ guo17a.html

  2. [2]

    How flawed is ECE? an analysis via logit smoothing

    Muthu Chidambaram, Holden Lee, Colin McSwiggen, and Semon Rezchikov. How flawed is ECE? an analysis via logit smoothing. InProceedings of the 41st International Confer- ence on Machine Learning (ICML 2024), volume 235 ofProceedings of Machine Learning Research, pages 8417–8434. PMLR, 2024. URL https://proceedings.mlr.press/v235/ chidambaram24a.html

  3. [3]

    Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022

    Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022. URLhttps://jmlr.org/papers/v23/22-0658.html

  4. [4]

    Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B

    Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019), volume 89 ofProceedings of Machine Learning...

  5. [5]

    T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

    Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

  6. [6]

    URLhttps://jmlr.org/papers/v24/22-0320.html

  7. [7]

    Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use

    Lorenzo Famiglini, Andrea Campagner, and Federico Cabitza. Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use. In Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, and Roxana Radulescu, editors,Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), volume 372 ofFrontiers in Arti...

  8. [8]

    TCE: A test-based approach to measuring calibration error

    Takuo Matsubara, Niek Tax, Richard Mudd, and Ido Guy. TCE: A test-based approach to measuring calibration error. In Robin J. Evans and Ilya Shpitser, editors,Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), volume 216 of Proceedings of Machine Learning Research, pages 1390–1400. PMLR, 2023. URL https: //proceedings....

  9. [9]

    On the distance from calibration in sequential prediction

    Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory (COLT 2024), volume 247 ofProceedings of Machine Learning Research, pages 4307–

  10. [10]

    URLhttps://proceedings.mlr.press/v247/qiao24a.html

    PMLR, 2024. URLhttps://proceedings.mlr.press/v247/qiao24a.html

  11. [11]

    Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025

    Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, and Pegah Ghaffari. Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025. URLhttps://arxiv.org/abs/2505.18622

  12. [12]

    An entropic metric for measuring calibration of machine learning models

    Daniel James Sumler, Lee Devlin, Simon Maskell, and Richard Oliver Lane. An entropic metric for measuring calibration of machine learning models. InProceedings of the European 10 Workshop on Trustworthy Artificial Intelligence (TRUST-AI 2025), volume 4132 ofCEUR Workshop Proceedings, pages 169–179. CEUR-WS.org, 2025. URL https://ceur-ws.org/ Vol-4132/short53.pdf

  13. [13]

    Receiver operating characteristic (roc) curves

    Mehryar Mohri. Receiver operating characteristic (roc) curves. Technical report, New York University, 2018. URLhttps://cs.nyu.edu/~mohri/postscript/auc.pdf

  14. [14]

    Green and John A

    David M. Green and John A. Swets.Signal Detection Theory and Psychophysics. John Wiley & Sons, 1966

  15. [15]

    Becker and R

    B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. URL https: //archive.ics.uci.edu/dataset/2/adult. [Dataset]

  16. [16]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Techni- cal report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

  17. [17]

    Transforming classifier scores into accurate multiclass probability estimates

    Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699. ACM, 2002. doi: 10.1145/775047. 775151

  18. [18]

    John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. ISBN 978-0-262-19448-1

  19. [19]

    UCI machine learning repository

    Dheeru Dua and Casey Graff. UCI machine learning repository. https://archive.ics.uci. edu/ml, 2019

  20. [20]

    Gradient-based learning applied to document recognition,

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  21. [21]

    Chen and C

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016. doi: 10.1145/2939672.2939785

  22. [22]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

  23. [23]

    S. Moro, P. Rita, and P. Cortez. Bank Marketing. UCI Machine Learning Repository, 2014. URLhttps://archive.ics.uci.edu/dataset/222/bank+marketing. [Dataset]

  24. [24]

    Zwitter and M

    M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. URL https://archive.ics.uci.edu/dataset/14/breast+cancer. [Dataset]

  25. [25]

    Whiteson

    D. Whiteson. HIGGS. UCI Machine Learning Repository, 2014. URL https://archive. ics.uci.edu/dataset/280/higgs. [Dataset]

  26. [26]

    J. Quinlan. Credit Approval. UCI Machine Learning Repository, 1987. URL https:// archive.ics.uci.edu/dataset/27/credit+approval. [Dataset]

  27. [27]

    M. Kahn. Diabetes. UCI Machine Learning Repository. URL https://archive.ics.uci. edu/dataset/34/diabetes. [Dataset]

  28. [28]

    UCI Machine Learning Repository, 2016

    Liver Disorders. UCI Machine Learning Repository, 2016. URL https://archive.ics. uci.edu/dataset/60/liver+disorders. [Dataset]

  29. [29]

    Mohammad and L

    R. Mohammad and L. McCluskey. Phishing Websites. UCI Machine Learning Repository, 2012. URLhttps://archive.ics.uci.edu/dataset/327/phishing+websites. [Dataset]

  30. [30]

    A. Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,

  31. [31]

    [Dataset]

    URL https://archive.ics.uci.edu/dataset/722/naticusdroid+android+ permissions+dataset. [Dataset]. 11

  32. [32]

    Bruno, F

    B. Bruno, F. Mastrogiovanni, and A. Sgorbissa. Dataset for ADL Recognition with Wrist-worn Accelerometer. UCI Machine Learning Repository, 2012. URL https://archive.ics.uci.edu/dataset/283/dataset+for+adl+recognition+ with+wrist+worn+accelerometer. [Dataset]

  33. [33]

    Blackard

    J. Blackard. Covertype. UCI Machine Learning Repository, 1998. URL https://archive. ics.uci.edu/dataset/31/covertype. [Dataset]

  34. [34]

    UCI Machine Learning Repository, 2019

    Estimation of Obesity Levels Based On Eating Habits and Physical Condition. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/dataset/ 544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+ condition. [Dataset]

  35. [35]

    Aeberhard and M

    S. Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1992. URL https: //archive.ics.uci.edu/dataset/109/wine. [Dataset]

  36. [36]

    Algorithms for hyper- parameter optimization

    James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization. InAdvances in Neural Information Processing Systems (NeurIPS 2011), volume 24. Curran Associates, 2011. URL https://proceedings.neurips.cc/ paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html

  37. [37]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–

  38. [38]

    Akiba, S

    ACM, 2019. doi: 10.1145/3292500.3330701

  39. [39]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re- search, 12:2825–2830, 2011. URL https://jmlr.org/papers/v12/pedregosa11a...