Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

Fernando Martin-Maroto; Gonzalo G. de Polavieja; Nabil Abderrahaman

arxiv: 2605.01796 · v2 · submitted 2026-05-03 · 💻 cs.LG · math.ST· stat.TH

Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics

Fernando Martin-Maroto , Nabil Abderrahaman , Gonzalo G. de Polavieja This is my paper

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords confidence calibrationexpected calibration erroroverconfidence riskcalibrated size ratioconfidence-weighted AUCmachine learning evaluationpost-hoc calibration

0 comments

The pith

The Expected Calibration Error can remain small even under arbitrarily large overconfidence risk, which motivates the Calibrated Size Ratio as a replacement metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the standard Expected Calibration Error can stay low even when a model's confidence assignments carry high overconfidence risk in wrong predictions. In its place the authors introduce the Calibrated Size Ratio, an interpretable quantity that equals one only under perfect calibration and yields a derived probability of risk. They further establish that weighting accuracy and AUC by the model's reported confidence levels extracts calibration information that the ordinary unweighted versions miss. A sympathetic reader would care because many real-world uses of classifiers rely on the numerical confidence values to decide when to trust or defer a prediction. The authors validate the new indicators on controlled synthetic distributions and on fifteen real datasets both before and after common post-hoc calibration steps.

Core claim

We show that ECE can remain small even under arbitrarily large overconfidence risk. We therefore propose the Calibrated Size Ratio (CSR), an interpretable metric that equals 1 under perfect calibration and from which we derive the risk probability P_risk that quantifies statistical evidence for overconfidence. We also show that confidence-weighted accuracy is the natural complement for measuring discriminative value and prove that the confidence-weighted AUC captures calibration information while the classical AUC cannot. Standard post-hoc methods can still leave risky confidence profiles on real data.

What carries the argument

The Calibrated Size Ratio (CSR), an interpretable scalar that equals 1 under perfect calibration and is constructed to be sensitive to the size of overconfident regions regardless of where they occur.

If this is right

CSR yields a direct probability P_risk that quantifies evidence for overconfidence.
Confidence weighting extends to every standard classification metric and supplies complementary information.
The classical AUC is provably insensitive to calibration while its confidence-weighted counterpart is not.
Common post-hoc calibration procedures can still produce confidence profiles that CSR flags as risky.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct optimization of models or calibration maps toward CSR could reduce the incidence of undetected overconfidence in deployed systems.
Model selection pipelines that currently rely on AUC might benefit from substituting or adding cwAUC when calibration quality matters.
The separation between risk assessment and discriminative value opens the possibility of multi-objective calibration procedures that trade off the two explicitly.

Load-bearing premise

That the linear nature of ECE is the main reason it fails to flag overconfidence risk and that the proposed CSR provides a more risk-sensitive alternative without introducing new fitting artifacts.

What would settle it

A controlled synthetic confidence distribution in which high overconfidence is concentrated at the top confidence levels yet ECE stays below a conventional threshold while CSR rises above 1 and P_risk becomes large.

Figures

Figures reproduced from arXiv: 2605.01796 by Fernando Martin-Maroto, Gonzalo G. de Polavieja, Nabil Abderrahaman.

**Figure 2.** Figure 2: ROC and confidence-weighted ROC curves for class 0 of various datasets under Platt [PITH_FULL_IMAGE:figures/full_fig_p030_2.png] view at source ↗

read the original abstract

Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability $P_{\mathrm{risk}}$ that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy $\mathrm{cwA}$ is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and find that CSR separates risky from non-risky assignments. We also test the metrics on fifteen real datasets, with and without post-hoc calibration, and find that standard methods can yield risky confidence profiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows ECE can miss high overconfidence risk and introduces CSR plus P_risk as direct alternatives, with a clean proof that cwAUC picks up calibration information AUC ignores.

read the letter

The central point is that ECE stays low even when overconfidence risk is large because it weights errors linearly by bin size, and the authors give CSR as a ratio that hits exactly 1 only under perfect per-region calibration, from which they derive P_risk via a simple one-sided test. They also show that weighting accuracy and AUC by confidence turns those metrics into calibration-sensitive ones, with an explicit argument that cwAUC incorporates absolute |acc - conf| deviations into pairwise rankings while plain AUC is invariant to any monotonic score transform. The synthetic constructions make the ECE blind spot concrete by localizing overconfidence to high-confidence bins, and the fifteen-dataset runs confirm that common post-hoc methods can leave CSR below 1 even when ECE looks acceptable. The derivations avoid circularity and do not rely on the authors' prior results. The experiments are straightforward and support the separation claim without obvious confounds. One minor limitation is that the write-up could give more explicit guidance on binning choices or sample-size requirements for stable P_risk estimates in very large models, though the core statements hold without those details. This work is aimed at engineers who need to decide whether a deployed classifier's confidence scores are safe to trust in practice, and at researchers who want better diagnostic tools than ECE alone. It is worth sending to referees because the ideas are clearly motivated, the supporting constructions are reproducible from the description, and the claims are narrow enough to be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper claims that ECE can remain small despite arbitrarily large overconfidence risk (demonstrated via synthetic constructions localizing overconfidence to high-confidence bins), proposes the Calibrated Size Ratio (CSR) as a more interpretable alternative that equals 1 under perfect calibration, derives a risk probability P_risk via one-sided statistical test on CSR deviations, introduces confidence-weighted accuracy (cwA) and proves that confidence-weighted AUC (cwAUC) incorporates calibration information (via absolute deviation in pairwise comparisons) while standard AUC does not (due to invariance under strictly increasing transformations), and validates the metrics on synthetic profiles plus 15 real datasets with and without post-hoc calibration.

Significance. If the central claims hold, the work is significant for challenging the dominance of ECE in calibration assessment and for providing a risk-sensitive metric (CSR) together with a proof that cwAUC captures calibration information that AUC discards. Explicit synthetic constructions, direct (non-circular) definitions of CSR and P_risk, and the invariance-based proof for cwAUC are strengths; the 15-dataset experiments further support the practical relevance. The reader's concerns about unshown derivations and low soundness do not appear to land, as the skeptic analysis confirms the constructions and proofs are explicit and internally consistent without reduction to prior self-citations.

major comments (2)

The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.
The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.

minor comments (2)

Abstract: 'cwA' is used without prior expansion; spell out 'confidence-weighted accuracy' on first use for clarity.
Notation: Ensure consistent use of P_risk (with mathrm) and CSR throughout; a small table summarizing metric properties (ECE vs CSR vs cwAUC) would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: The experiments section (referenced in the abstract and skeptic analysis) reports results on 15 real datasets but omits error bars, exclusion criteria for datasets, or statistical significance tests on CSR/P_risk differences; this is load-bearing for the claim that 'standard methods can yield risky confidence profiles' even when ECE is low.

Authors: We agree that these details would improve the robustness of the empirical claims. In the revised manuscript, we will add bootstrapped 95% confidence intervals (error bars) for all reported CSR, P_risk, and related metrics across the 15 datasets. We will explicitly state the dataset exclusion criteria (standard public benchmarks: CIFAR-10/100, SVHN, ImageNet subsets, and others, selected for diversity in size and domain with no cherry-picking) and include statistical significance tests (paired Wilcoxon signed-rank tests with p-values) comparing CSR/P_risk between uncalibrated and post-hoc calibrated models to support the claim that risky profiles can occur despite low ECE. revision: yes
Referee: The proof that cwAUC captures calibration information (abstract and § on confidence-weighted metrics) relies on showing incorporation of |acc - conf| into pairwise comparisons; the manuscript should explicitly state the precise weighting function and confirm it holds without additional assumptions on binning or sample size, as this is central to the superiority claim over classical AUC.

Authors: We will revise the section to explicitly define the weighting function for cwAUC as w(i,j) = c_i * c_j for each pair of instances i and j, where c denotes the predicted confidence. This weighting directly embeds the absolute calibration deviation |acc - conf| into the expected pairwise ranking score. The proof relies only on the definition of AUC as a ranking metric and the effect of strictly increasing transformations; it holds for any finite collection of predictions with confidences and binary labels, without requiring binning, fixed sample sizes, or other assumptions. We will add this clarification and a brief remark on generality in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained

full rationale

The paper's central claims rest on direct definitions and explicit constructions rather than self-referential reductions. CSR is introduced as the ratio of observed to calibrated bin sizes (or continuous equivalent) equaling 1 exactly under perfect calibration, with P_risk obtained from a standard one-sided test on deviation from 1. The ECE limitation is shown via synthetic constructions localizing overconfidence to high-confidence bins whose linear weighting keeps ECE small. The cwAUC proof proceeds by demonstrating that confidence-weighting incorporates absolute |acc - conf| deviations into pairwise rankings, while classical AUC remains invariant under strictly increasing score transforms. No load-bearing step reduces to fitted parameters, author-overlapping citations, or ansatzes imported from prior work; the fifteen-dataset validation and controlled synthetic profiles provide independent empirical support without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the domain assumption that calibration assessment should be sensitive to confidence level and that weighting by confidence adds discriminative information.

axioms (1)

domain assumption ECE is a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs
Explicitly stated as the starting point for proposing CSR.

pith-pipeline@v0.9.0 · 5522 in / 1243 out tokens · 87784 ms · 2026-05-10T15:07:58.957766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017. URL https://proceedings.mlr.press/v70/ guo17a.html

work page 2017
[2]

How flawed is ECE? an analysis via logit smoothing

Muthu Chidambaram, Holden Lee, Colin McSwiggen, and Semon Rezchikov. How flawed is ECE? an analysis via logit smoothing. InProceedings of the 41st International Confer- ence on Machine Learning (ICML 2024), volume 235 ofProceedings of Machine Learning Research, pages 8417–8434. PMLR, 2024. URL https://proceedings.mlr.press/v235/ chidambaram24a.html

work page 2024
[3]

Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022

Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022. URLhttps://jmlr.org/papers/v23/22-0658.html

work page 2022
[4]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019), volume 89 ofProceedings of Machine Learning...

work page 2019
[5]

T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

work page
[6]

URLhttps://jmlr.org/papers/v24/22-0320.html

work page
[7]

Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use

Lorenzo Famiglini, Andrea Campagner, and Federico Cabitza. Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use. In Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, and Roxana Radulescu, editors,Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), volume 372 ofFrontiers in Arti...

work page 2023
[8]

TCE: A test-based approach to measuring calibration error

Takuo Matsubara, Niek Tax, Richard Mudd, and Ido Guy. TCE: A test-based approach to measuring calibration error. In Robin J. Evans and Ilya Shpitser, editors,Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), volume 216 of Proceedings of Machine Learning Research, pages 1390–1400. PMLR, 2023. URL https: //proceedings....

work page 2023
[9]

On the distance from calibration in sequential prediction

Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory (COLT 2024), volume 247 ofProceedings of Machine Learning Research, pages 4307–

work page 2024
[10]

URLhttps://proceedings.mlr.press/v247/qiao24a.html

PMLR, 2024. URLhttps://proceedings.mlr.press/v247/qiao24a.html

work page 2024
[11]

Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, and Pegah Ghaffari. Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025. URLhttps://arxiv.org/abs/2505.18622

work page arXiv 2025
[12]

An entropic metric for measuring calibration of machine learning models

Daniel James Sumler, Lee Devlin, Simon Maskell, and Richard Oliver Lane. An entropic metric for measuring calibration of machine learning models. InProceedings of the European 10 Workshop on Trustworthy Artificial Intelligence (TRUST-AI 2025), volume 4132 ofCEUR Workshop Proceedings, pages 169–179. CEUR-WS.org, 2025. URL https://ceur-ws.org/ Vol-4132/short53.pdf

work page 2025
[13]

Receiver operating characteristic (roc) curves

Mehryar Mohri. Receiver operating characteristic (roc) curves. Technical report, New York University, 2018. URLhttps://cs.nyu.edu/~mohri/postscript/auc.pdf

work page 2018
[14]

Green and John A

David M. Green and John A. Swets.Signal Detection Theory and Psychophysics. John Wiley & Sons, 1966

work page 1966
[15]

Becker and R

B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. URL https: //archive.ics.uci.edu/dataset/2/adult. [Dataset]

work page 1996
[16]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Techni- cal report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

work page 2009
[17]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699. ACM, 2002. doi: 10.1145/775047. 775151

work page doi:10.1145/775047 2002
[18]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. ISBN 978-0-262-19448-1

work page 1999
[19]

UCI machine learning repository

Dheeru Dua and Casey Graff. UCI machine learning repository. https://archive.ics.uci. edu/ml, 2019

work page 2019
[20]

Gradient-based learning applied to document recognition,

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[21]

Chen and C

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[22]

Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

work page 1950
[23]

S. Moro, P. Rita, and P. Cortez. Bank Marketing. UCI Machine Learning Repository, 2014. URLhttps://archive.ics.uci.edu/dataset/222/bank+marketing. [Dataset]

work page 2014
[24]

Zwitter and M

M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. URL https://archive.ics.uci.edu/dataset/14/breast+cancer. [Dataset]

work page 1988
[25]

Whiteson

D. Whiteson. HIGGS. UCI Machine Learning Repository, 2014. URL https://archive. ics.uci.edu/dataset/280/higgs. [Dataset]

work page 2014
[26]

J. Quinlan. Credit Approval. UCI Machine Learning Repository, 1987. URL https:// archive.ics.uci.edu/dataset/27/credit+approval. [Dataset]

work page 1987
[27]

M. Kahn. Diabetes. UCI Machine Learning Repository. URL https://archive.ics.uci. edu/dataset/34/diabetes. [Dataset]

work page
[28]

UCI Machine Learning Repository, 2016

Liver Disorders. UCI Machine Learning Repository, 2016. URL https://archive.ics. uci.edu/dataset/60/liver+disorders. [Dataset]

work page 2016
[29]

Mohammad and L

R. Mohammad and L. McCluskey. Phishing Websites. UCI Machine Learning Repository, 2012. URLhttps://archive.ics.uci.edu/dataset/327/phishing+websites. [Dataset]

work page 2012
[30]

A. Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,

work page
[31]

[Dataset]

URL https://archive.ics.uci.edu/dataset/722/naticusdroid+android+ permissions+dataset. [Dataset]. 11

work page
[32]

Bruno, F

B. Bruno, F. Mastrogiovanni, and A. Sgorbissa. Dataset for ADL Recognition with Wrist-worn Accelerometer. UCI Machine Learning Repository, 2012. URL https://archive.ics.uci.edu/dataset/283/dataset+for+adl+recognition+ with+wrist+worn+accelerometer. [Dataset]

work page 2012
[33]

Blackard

J. Blackard. Covertype. UCI Machine Learning Repository, 1998. URL https://archive. ics.uci.edu/dataset/31/covertype. [Dataset]

work page 1998
[34]

UCI Machine Learning Repository, 2019

Estimation of Obesity Levels Based On Eating Habits and Physical Condition. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/dataset/ 544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+ condition. [Dataset]

work page 2019
[35]

Aeberhard and M

S. Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1992. URL https: //archive.ics.uci.edu/dataset/109/wine. [Dataset]

work page 1992
[36]

Algorithms for hyper- parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization. InAdvances in Neural Information Processing Systems (NeurIPS 2011), volume 24. Curran Associates, 2011. URL https://proceedings.neurips.cc/ paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html

work page 2011
[37]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–

work page
[38]

Akiba, S

ACM, 2019. doi: 10.1145/3292500.3330701

work page doi:10.1145/3292500.3330701 2019
[39]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re- search, 12:2825–2830, 2011. URL https://jmlr.org/papers/v12/pedregosa11a...

work page arXiv 2011

[1] [1]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning (ICML 2017), volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017. URL https://proceedings.mlr.press/v70/ guo17a.html

work page 2017

[2] [2]

How flawed is ECE? an analysis via logit smoothing

Muthu Chidambaram, Holden Lee, Colin McSwiggen, and Semon Rezchikov. How flawed is ECE? an analysis via logit smoothing. InProceedings of the 41st International Confer- ence on Machine Learning (ICML 2024), volume 235 ofProceedings of Machine Learning Research, pages 8417–8434. PMLR, 2024. URL https://proceedings.mlr.press/v235/ chidambaram24a.html

work page 2024

[3] [3]

Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022

Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, and Cherie Xu. Metrics of calibration for probabilistic predictions.Journal of Machine Learning Research, 23(351): 1–54, 2022. URLhttps://jmlr.org/papers/v23/22-0658.html

work page 2022

[4] [4]

Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B

Juozas Vaicenavicius, David Widmann, Carl R. Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Schön. Evaluating model calibration in classification. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019), volume 89 ofProceedings of Machine Learning...

work page 2019

[5] [5]

T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-Cal: An optimal test for the calibration of predictive models.Journal of Machine Learning Research, 24(335):1–72,

work page

[6] [6]

URLhttps://jmlr.org/papers/v24/22-0320.html

work page

[7] [7]

Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use

Lorenzo Famiglini, Andrea Campagner, and Federico Cabitza. Towards a rigorous calibration assessment framework: Advancements in metrics, methods, and use. In Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, and Roxana Radulescu, editors,Proceedings of the 26th European Conference on Artificial Intelligence (ECAI 2023), volume 372 ofFrontiers in Arti...

work page 2023

[8] [8]

TCE: A test-based approach to measuring calibration error

Takuo Matsubara, Niek Tax, Richard Mudd, and Ido Guy. TCE: A test-based approach to measuring calibration error. In Robin J. Evans and Ilya Shpitser, editors,Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), volume 216 of Proceedings of Machine Learning Research, pages 1390–1400. PMLR, 2023. URL https: //proceedings....

work page 2023

[9] [9]

On the distance from calibration in sequential prediction

Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Shipra Agrawal and Aaron Roth, editors,Proceedings of Thirty Seventh Conference on Learning Theory (COLT 2024), volume 247 ofProceedings of Machine Learning Research, pages 4307–

work page 2024

[10] [10]

URLhttps://proceedings.mlr.press/v247/qiao24a.html

PMLR, 2024. URLhttps://proceedings.mlr.press/v247/qiao24a.html

work page 2024

[11] [11]

Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar, and Pegah Ghaffari. Trust, or don’t predict: Introducing the CWSA family for confidence-aware model evaluation, 2025. URLhttps://arxiv.org/abs/2505.18622

work page arXiv 2025

[12] [12]

An entropic metric for measuring calibration of machine learning models

Daniel James Sumler, Lee Devlin, Simon Maskell, and Richard Oliver Lane. An entropic metric for measuring calibration of machine learning models. InProceedings of the European 10 Workshop on Trustworthy Artificial Intelligence (TRUST-AI 2025), volume 4132 ofCEUR Workshop Proceedings, pages 169–179. CEUR-WS.org, 2025. URL https://ceur-ws.org/ Vol-4132/short53.pdf

work page 2025

[13] [13]

Receiver operating characteristic (roc) curves

Mehryar Mohri. Receiver operating characteristic (roc) curves. Technical report, New York University, 2018. URLhttps://cs.nyu.edu/~mohri/postscript/auc.pdf

work page 2018

[14] [14]

Green and John A

David M. Green and John A. Swets.Signal Detection Theory and Psychophysics. John Wiley & Sons, 1966

work page 1966

[15] [15]

Becker and R

B. Becker and R. Kohavi. Adult. UCI Machine Learning Repository, 1996. URL https: //archive.ics.uci.edu/dataset/2/adult. [Dataset]

work page 1996

[16] [16]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Techni- cal report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

work page 2009

[17] [17]

Transforming classifier scores into accurate multiclass probability estimates

Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699. ACM, 2002. doi: 10.1145/775047. 775151

work page doi:10.1145/775047 2002

[18] [18]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors,Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. ISBN 978-0-262-19448-1

work page 1999

[19] [19]

UCI machine learning repository

Dheeru Dua and Casey Graff. UCI machine learning repository. https://archive.ics.uci. edu/ml, 2019

work page 2019

[20] [20]

Gradient-based learning applied to document recognition,

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998

[21] [21]

Chen and C

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[22] [22]

Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

work page 1950

[23] [23]

S. Moro, P. Rita, and P. Cortez. Bank Marketing. UCI Machine Learning Repository, 2014. URLhttps://archive.ics.uci.edu/dataset/222/bank+marketing. [Dataset]

work page 2014

[24] [24]

Zwitter and M

M. Zwitter and M. Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. URL https://archive.ics.uci.edu/dataset/14/breast+cancer. [Dataset]

work page 1988

[25] [25]

Whiteson

D. Whiteson. HIGGS. UCI Machine Learning Repository, 2014. URL https://archive. ics.uci.edu/dataset/280/higgs. [Dataset]

work page 2014

[26] [26]

J. Quinlan. Credit Approval. UCI Machine Learning Repository, 1987. URL https:// archive.ics.uci.edu/dataset/27/credit+approval. [Dataset]

work page 1987

[27] [27]

M. Kahn. Diabetes. UCI Machine Learning Repository. URL https://archive.ics.uci. edu/dataset/34/diabetes. [Dataset]

work page

[28] [28]

UCI Machine Learning Repository, 2016

Liver Disorders. UCI Machine Learning Repository, 2016. URL https://archive.ics. uci.edu/dataset/60/liver+disorders. [Dataset]

work page 2016

[29] [29]

Mohammad and L

R. Mohammad and L. McCluskey. Phishing Websites. UCI Machine Learning Repository, 2012. URLhttps://archive.ics.uci.edu/dataset/327/phishing+websites. [Dataset]

work page 2012

[30] [30]

A. Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,

work page

[31] [31]

[Dataset]

URL https://archive.ics.uci.edu/dataset/722/naticusdroid+android+ permissions+dataset. [Dataset]. 11

work page

[32] [32]

Bruno, F

B. Bruno, F. Mastrogiovanni, and A. Sgorbissa. Dataset for ADL Recognition with Wrist-worn Accelerometer. UCI Machine Learning Repository, 2012. URL https://archive.ics.uci.edu/dataset/283/dataset+for+adl+recognition+ with+wrist+worn+accelerometer. [Dataset]

work page 2012

[33] [33]

Blackard

J. Blackard. Covertype. UCI Machine Learning Repository, 1998. URL https://archive. ics.uci.edu/dataset/31/covertype. [Dataset]

work page 1998

[34] [34]

UCI Machine Learning Repository, 2019

Estimation of Obesity Levels Based On Eating Habits and Physical Condition. UCI Machine Learning Repository, 2019. URL https://archive.ics.uci.edu/dataset/ 544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+ condition. [Dataset]

work page 2019

[35] [35]

Aeberhard and M

S. Aeberhard and M. Forina. Wine. UCI Machine Learning Repository, 1992. URL https: //archive.ics.uci.edu/dataset/109/wine. [Dataset]

work page 1992

[36] [36]

Algorithms for hyper- parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization. InAdvances in Neural Information Processing Systems (NeurIPS 2011), volume 24. Curran Associates, 2011. URL https://proceedings.neurips.cc/ paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html

work page 2011

[37] [37]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–

work page

[38] [38]

Akiba, S

ACM, 2019. doi: 10.1145/3292500.3330701

work page doi:10.1145/3292500.3330701 2019

[39] [39]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Re- search, 12:2825–2830, 2011. URL https://jmlr.org/papers/v12/pedregosa11a...

work page arXiv 2011