Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics
Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3
The pith
The Expected Calibration Error can remain small even under arbitrarily large overconfidence risk, which motivates the Calibrated Size Ratio as a replacement metric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that ECE can remain small even under arbitrarily large overconfidence risk. We therefore propose the Calibrated Size Ratio (CSR), an interpretable metric that equals 1 under perfect calibration and from which we derive the risk probability P_risk that quantifies statistical evidence for overconfidence. We also show that confidence-weighted accuracy is the natural complement for measuring discriminative value and prove that the confidence-weighted AUC captures calibration information while the classical AUC cannot. Standard post-hoc methods can still leave risky confidence profiles on real data.
What carries the argument
The Calibrated Size Ratio (CSR), an interpretable scalar that equals 1 under perfect calibration and is constructed to be sensitive to the size of overconfident regions regardless of where they occur.
If this is right
- CSR yields a direct probability P_risk that quantifies evidence for overconfidence.
- Confidence weighting extends to every standard classification metric and supplies complementary information.
- The classical AUC is provably insensitive to calibration while its confidence-weighted counterpart is not.
- Common post-hoc calibration procedures can still produce confidence profiles that CSR flags as risky.
Where Pith is reading between the lines
- Direct optimization of models or calibration maps toward CSR could reduce the incidence of undetected overconfidence in deployed systems.
- Model selection pipelines that currently rely on AUC might benefit from substituting or adding cwAUC when calibration quality matters.
- The separation between risk assessment and discriminative value opens the possibility of multi-objective calibration procedures that trade off the two explicitly.
Load-bearing premise
That the linear nature of ECE is the main reason it fails to flag overconfidence risk and that the proposed CSR provides a more risk-sensitive alternative without introducing new fitting artifacts.
What would settle it
A controlled synthetic confidence distribution in which high overconfidence is concentrated at the top confidence levels yet ECE stays below a conventional threshold while CSR rises above 1 and P_risk becomes large.
Figures
read the original abstract
Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability $P_{\mathrm{risk}}$ that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy $\mathrm{cwA}$ is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and find that CSR separates risky from non-risky assignments. We also test the metrics on fifteen real datasets, with and without post-hoc calibration, and find that standard methods can yield risky confidence profiles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ECE can remain small despite arbitrarily large overconfidence risk (demonstrated via synthetic constructions localizing overconfidence to high-confidence bins), proposes the Calibrated Size Ratio (CSR) as a more interpretable alternative that equals 1 under perfect calibration, derives a risk probability P_risk via one-sided statistical test on CSR deviations, introduces confidence-weighted accuracy (cwA) and proves that confidence-weighted AUC (cwAUC) incorporates calibration information (via absolute deviation in pairwise comparisons) while standard AUC does not (due to invariance under strictly increasing transformations), and validates the metrics on synthetic profiles plus 15 real datasets with and without post-hoc calibration.
Significance. If the central claims hold, the work is significant for challenging the dominance of ECE in calibration assessment and for providing a risk-sensitive metric (CSR) together with a proof that cwAUC captures calibration information that AUC discards. Explicit synthetic constructions, direct (non-circular) definitions of CSR and P_risk, and the invariance-based proof for cwAUC are strengths; the 15-dataset experiments further support the practical relevance. The reader's concerns about unshown derivations and low soundness do not appear to land, as the skeptic analysis confirms the constructions and proofs are explicit and internally consistent without reduction to prior self-citations.
major comments (2)
- The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.
- The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.
minor comments (2)
- Abstract: 'cwA' is used without prior expansion; spell out 'confidence-weighted accuracy' on first use for clarity.
- Notation: Ensure consistent use of P_risk (with mathrm) and CSR throughout; a small table summarizing metric properties (ECE vs CSR vs cwAUC) would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.
Authors: We agree that these details would improve the robustness of the empirical claims. In the revised manuscript, we will add bootstrapped 95% confidence intervals (error bars) for all reported CSR, P_risk, and related metrics across the 15 datasets. We will explicitly state the dataset exclusion criteria (standard public benchmarks: CIFAR-10/100, SVHN, ImageNet subsets, and others, selected for diversity in size and domain with no cherry-picking) and include statistical significance tests (paired Wilcoxon signed-rank tests with p-values) comparing CSR/P_risk between uncalibrated and post-hoc calibrated models to support the claim that risky profiles can occur despite low ECE. revision: yes
-
Referee: The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.
Authors: We will revise the section to explicitly define the weighting function for cwAUC as w(i,j) = c_i * c_j for each pair of instances i and j, where c denotes the predicted confidence. This weighting directly embeds the absolute calibration deviation |acc - conf| into the expected pairwise ranking score. The proof relies only on the definition of AUC as a ranking metric and the effect of strictly increasing transformations; it holds for any finite collection of predictions with confidences and binary labels, without requiring binning, fixed sample sizes, or other assumptions. We will add this clarification and a brief remark on generality in the revised text. revision: yes
Circularity Check
No significant circularity; derivations are self-contained
full rationale
The paper's central claims rest on direct definitions and explicit constructions rather than self-referential reductions. CSR is introduced as the ratio of observed to calibrated bin sizes (or continuous equivalent) equaling 1 exactly under perfect calibration, with P_risk obtained from a standard one-sided test on deviation from 1. The ECE limitation is shown via synthetic constructions localizing overconfidence to high-confidence bins whose linear weighting keeps ECE small. The cwAUC proof proceeds by demonstrating that confidence-weighting incorporates absolute |acc - conf| deviations into pairwise rankings, while classical AUC remains invariant under strictly increasing score transforms. No load-bearing step reduces to fitted parameters, author-overlapping citations, or ansatzes imported from prior work; the fifteen-dataset validation and controlled synthetic profiles provide independent empirical support without circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ECE is a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs
Reference graph
Works this paper leans on
-
[1]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017. URL https://proceedings.mlr.press/v70/ guo17a.html
work page 2017
-
[2]
How flawed is ECE? an analysis via logit smoothing
Muthu Chidambaram, Holden Lee, Colin McSwiggen, and Semon Rezchikov. How flawed is ECE? an analysis via logit smoothing. InProceedings of the 41st International Confer- ence on Machine Learning (ICML 2024), volume 235 ofProceedings of Machine Learning Research, pages 8417–8434. PMLR, 2024. URL https://proceedings.mlr.press/v235/ chidambaram24a.html
work page 2024
-
[3]
Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022. URLhttps://jmlr.org/papers/v23/22-0658.html
work page 2022
-
[4]
Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B
Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019), volume 89 ofProceedings of Machine Learning...
work page 2019
-
[5]
Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,
-
[6]
URLhttps://jmlr.org/papers/v24/22-0320.html
-
[7]
Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use
Lorenzo Famiglini, Andrea Campagner, and Federico Cabitza. Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use. In Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, and Roxana Radulescu, editors,Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), volume 372 ofFrontiers in Arti...
work page 2023
-
[8]
TCE: A test-based approach to measuring calibration error
Takuo Matsubara, Niek Tax, Richard Mudd, and Ido Guy. TCE: A test-based approach to measuring calibration error. In Robin J. Evans and Ilya Shpitser, editors,Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), volume 216 of Proceedings of Machine Learning Research, pages 1390–1400. PMLR, 2023. URL https: //proceedings....
work page 2023
-
[9]
On the distance from calibration in sequential prediction
Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory (COLT 2024), volume 247 ofProceedings of Machine Learning Research, pages 4307–
work page 2024
-
[10]
URLhttps://proceedings.mlr.press/v247/qiao24a.html
PMLR, 2024. URLhttps://proceedings.mlr.press/v247/qiao24a.html
work page 2024
-
[11]
Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, and Pegah Ghaffari. Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025. URLhttps://arxiv.org/abs/2505.18622
-
[12]
An entropic metric for measuring calibration of machine learning models
Daniel James Sumler, Lee Devlin, Simon Maskell, and Richard Oliver Lane. An entropic metric for measuring calibration of machine learning models. InProceedings of the European 10 Workshop on Trustworthy Artificial Intelligence (TRUST-AI 2025), volume 4132 ofCEUR Workshop Proceedings, pages 169–179. CEUR-WS.org, 2025. URL https://ceur-ws.org/ Vol-4132/short53.pdf
work page 2025
-
[13]
Receiver operating characteristic (roc) curves
Mehryar Mohri. Receiver operating characteristic (roc) curves. Technical report, New York University, 2018. URLhttps://cs.nyu.edu/~mohri/postscript/auc.pdf
work page 2018
-
[14]
David M. Green and John A. Swets.Signal Detection Theory and Psychophysics. John Wiley & Sons, 1966
work page 1966
-
[15]
B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. URL https: //archive.ics.uci.edu/dataset/2/adult. [Dataset]
work page 1996
-
[16]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Techni- cal report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf
work page 2009
-
[17]
Transforming classifier scores into accurate multiclass probability estimates
Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699. ACM, 2002. doi: 10.1145/775047. 775151
-
[18]
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. ISBN 978-0-262-19448-1
work page 1999
-
[19]
UCI machine learning repository
Dheeru Dua and Casey Graff. UCI machine learning repository. https://archive.ics.uci. edu/ml, 2019
work page 2019
-
[20]
Gradient-based learning applied to document recognition,
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791
-
[21]
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016. doi: 10.1145/2939672.2939785
-
[22]
Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950
work page 1950
-
[23]
S. Moro, P. Rita, and P. Cortez. Bank Marketing. UCI Machine Learning Repository, 2014. URLhttps://archive.ics.uci.edu/dataset/222/bank+marketing. [Dataset]
work page 2014
-
[24]
M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. URL https://archive.ics.uci.edu/dataset/14/breast+cancer. [Dataset]
work page 1988
- [25]
-
[26]
J. Quinlan. Credit Approval. UCI Machine Learning Repository, 1987. URL https:// archive.ics.uci.edu/dataset/27/credit+approval. [Dataset]
work page 1987
-
[27]
M. Kahn. Diabetes. UCI Machine Learning Repository. URL https://archive.ics.uci. edu/dataset/34/diabetes. [Dataset]
-
[28]
UCI Machine Learning Repository, 2016
Liver Disorders. UCI Machine Learning Repository, 2016. URL https://archive.ics. uci.edu/dataset/60/liver+disorders. [Dataset]
work page 2016
-
[29]
R. Mohammad and L. McCluskey. Phishing Websites. UCI Machine Learning Repository, 2012. URLhttps://archive.ics.uci.edu/dataset/327/phishing+websites. [Dataset]
work page 2012
-
[30]
A. Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,
- [31]
- [32]
- [33]
-
[34]
UCI Machine Learning Repository, 2019
Estimation of Obesity Levels Based On Eating Habits and Physical Condition. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/dataset/ 544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+ condition. [Dataset]
work page 2019
-
[35]
S. Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1992. URL https: //archive.ics.uci.edu/dataset/109/wine. [Dataset]
work page 1992
-
[36]
Algorithms for hyper- parameter optimization
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization. InAdvances in Neural Information Processing Systems (NeurIPS 2011), volume 24. Curran Associates, 2011. URL https://proceedings.neurips.cc/ paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html
work page 2011
-
[37]
Optuna: A next-generation hyperparameter optimization framework
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–
- [38]
-
[39]
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re- search, 12:2825–2830, 2011. URL https://jmlr.org/papers/v12/pedregosa11a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.