Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Andreas Rauh; Jelke Wibbeke; Nico Sch\"onfisch; Sebastian Rohjans

arxiv: 2508.17761 · v3 · submitted 2025-08-25 · 💻 cs.LG · stat.ML

Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Jelke Wibbeke , Nico Sch\"onfisch , Sebastian Rohjans , Andreas Rauh This is my paper

Pith reviewed 2026-05-18 20:43 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords calibration metricsregressionuncertainty quantificationrecalibrationevaluation inconsistencyENCECWCsafety-critical applications

0 comments

The pith

Calibration metrics for regression models frequently disagree when evaluating the same recalibration results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that different calibration metrics for regression uncertainty estimates often lead to conflicting or even contradictory assessments of how well a model has been recalibrated. A sympathetic reader would care because safety-critical applications rely on trustworthy uncertainty estimates, and inconsistent metrics undermine the ability to confidently judge model reliability. The authors systematically categorize metrics from the literature and test them through controlled experiments on real-world, synthetic, and artificially miscalibrated data. Their analysis shows that this inconsistency allows for potential cherry-picking of favorable metrics. They conclude that ENCE and CWC are among the more consistent and dependable metrics.

Core claim

Through controlled experiments with real-world, synthetic and artificially miscalibrated data, calibration metrics frequently produce conflicting results. Many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. The Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) are identified as the most dependable metrics in the tests.

What carries the argument

Systematic categorization and independent benchmarking of regression calibration metrics on controlled datasets, independent of specific modeling methods or recalibration approaches.

If this is right

Recalibration methods may appear successful under one metric but fail under others, so reported improvements may not generalize across notions of calibration.
Researchers could select only favorable metrics to present misleading impressions of recalibration success.
Metric choice becomes a decisive factor in how calibration research outcomes are interpreted and compared.
ENCE and CWC offer more stable evaluations than many alternatives in the tested scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future studies might benefit from reporting results across a standardized set of multiple metrics to limit selective interpretation.
Developing metrics with stronger mutual agreement could address the root cause of conflicting conclusions.
Practical deployment of calibrated models may need ensemble checks using several metrics rather than any single one.

Load-bearing premise

Controlled experiments using artificially miscalibrated data can reliably expose the strengths and weaknesses of calibration metrics as they would behave on real-world data.

What would settle it

Demonstrating consistent agreement across a wide range of calibration metrics on diverse real-world recalibrated regression models would challenge the finding of substantial inconsistencies.

Figures

Figures reproduced from arXiv: 2508.17761 by Andreas Rauh, Jelke Wibbeke, Nico Sch\"onfisch, Sebastian Rohjans.

**Figure 2.** Figure 2: Comparing the correlation reveals two distinct groups that show internal agreement in how they assess model [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 2.** Figure 2: Heatmap of the Spearman’s rank correlation coefficients using the real-world data. The evaluation metrics [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Normalized metric values for the deep ensembles across 16 real-world datasets. The values are normalized [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of the Spearman’s rank correlation coefficients using synthetic data. The evaluation metrics label [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Normalized metric values for the deep ensembles across 10 synthetic datasets. The values are normalized [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Relative change in metric value after model recalibration. Each heatmap shows whether a metric detects a [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Relative change in metric values after artificially miscalibrating predictions across four scenarios. Each [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Stochastic evaluation of metric sensitivity under controlled miscalibration. The heatmap shows the fraction [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Calibration metrics for regression often disagree on the same results, but the artificial miscalibration experiments may overstate how common or severe those disagreements are in practice.

read the letter

The main thing to know is that this paper finds calibration metrics for regression models frequently disagree when scoring the same recalibration, sometimes reaching opposite conclusions. That observation flags a real risk of cherry-picking metrics in safety-critical work. They back it with controlled tests across real, synthetic, and artificially adjusted data, and they flag ENCE and CWC as more consistent performers in their runs. The systematic pull-together of metrics from the literature and the independent benchmarking without locking to one model or method is the clearest new piece here. It gives a practical map of how different definitions and scales play out, which prior papers often skipped by testing only a narrow set. The experiments are set up to isolate metric behavior, which is a reasonable way to surface the inconsistencies. The softer spot is the reliance on artificial miscalibration steps like variance scaling or mean shifts. Those perturbations can create clean, low-dimensional trade-offs between sharpness and coverage that naturally trained regressors rarely produce in the same way. If the joint distribution of errors and uncertainties in the artificial cases does not match what shows up after real training, the reported conflict rate could be inflated. The abstract mentions real-world data as well, but the headline conflicts seem tied to the controlled artificial setups. This paper is aimed at people who build or evaluate regression models where uncertainty quality affects decisions. A reader who needs to pick or defend a calibration metric would get direct value from the comparisons and the warning. It is worth sending for peer review so the experimental construction can be checked against more natural miscalibration patterns.

Referee Report

2 major / 2 minor

Summary. The paper systematically categorizes regression calibration metrics from the literature and evaluates them through controlled experiments on real-world, synthetic, and artificially miscalibrated data. It claims that these metrics frequently disagree or yield contradictory verdicts on identical recalibration outcomes, raising concerns about cherry-picking, and identifies ENCE and CWC as the most reliable based on the tests.

Significance. If the inconsistencies are robust, the work would highlight a methodological weakness in how calibration is assessed across studies, potentially improving standards for uncertainty quantification in safety-critical regression applications. The empirical benchmarking approach, independent of specific models, is a strength if the experimental construction holds.

major comments (2)

[Experimental setup / controlled experiments section] Experimental setup (artificial miscalibration): The perturbations such as uniform variance scaling and mean shifts may generate miscalibration patterns whose joint error-uncertainty distribution differs from those arising in naturally trained regressors (e.g., input-dependent heteroscedasticity or quantile distortions correlated with features). This risks inflating reported metric disagreements; a direct comparison to naturally miscalibrated models from standard training procedures is needed to confirm the inconsistencies indict the metrics rather than the construction.
[Results / analysis of inconsistencies] Results on conflicting verdicts: The claim that 'some even indicate contradictory conclusions' requires explicit quantification (e.g., percentage of recalibration cases where one metric improves while another worsens, with statistical significance). Without this, it is unclear whether the inconsistencies are frequent enough to undermine typical recalibration evaluations.

minor comments (2)

[Metric categorization] Clarify the exact definitions and formulas for all categorized metrics in a dedicated table or appendix to aid reproducibility.
[Abstract and results] The abstract mentions 'real-world, synthetic and artificially miscalibrated data' but the relative contribution of each to the inconsistency findings should be broken out more clearly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and robustness of our analysis. We address each major comment in detail below.

read point-by-point responses

Referee: [Experimental setup / controlled experiments section] Experimental setup (artificial miscalibration): The perturbations such as uniform variance scaling and mean shifts may generate miscalibration patterns whose joint error-uncertainty distribution differs from those arising in naturally trained regressors (e.g., input-dependent heteroscedasticity or quantile distortions correlated with features). This risks inflating reported metric disagreements; a direct comparison to naturally miscalibrated models from standard training procedures is needed to confirm the inconsistencies indict the metrics rather than the construction.

Authors: We agree that artificial perturbations such as uniform variance scaling and mean shifts may produce joint error-uncertainty distributions that differ from those in naturally trained regressors. While our experiments already incorporate real-world datasets (where miscalibrations arise from standard training) and synthetic data, we acknowledge that a more explicit comparison would strengthen the claim that metric inconsistencies are not an artifact of the artificial construction. In the revised manuscript we will add a subsection that directly compares the distributions and metric behaviors under artificial perturbations versus naturally miscalibrated models obtained from standard training procedures on the same datasets. revision: yes
Referee: [Results / analysis of inconsistencies] Results on conflicting verdicts: The claim that 'some even indicate contradictory conclusions' requires explicit quantification (e.g., percentage of recalibration cases where one metric improves while another worsens, with statistical significance). Without this, it is unclear whether the inconsistencies are frequent enough to undermine typical recalibration evaluations.

Authors: We agree that explicit quantification would make the prevalence of conflicting verdicts clearer. In the revised results section we will add a table reporting the percentage of recalibration cases (across all datasets, models, and recalibration methods) in which one metric registers improvement while another registers degradation or no change. We will also include statistical significance assessments using paired Wilcoxon tests or bootstrap confidence intervals to evaluate whether the observed disagreements are systematic rather than due to sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking of calibration metrics on controlled data

full rationale

The paper performs an empirical analysis by cataloging regression calibration metrics from the literature and evaluating them through direct comparisons on real-world, synthetic, and artificially miscalibrated datasets. Central claims about metric inconsistencies and contradictory conclusions arise from observable differences in metric outputs under these experiments, without any derivation chains, fitted parameters renamed as predictions, or self-referential definitions that reduce the results to their inputs by construction. The study is self-contained against external benchmarks, relying on reproducible experimental disagreements rather than theoretical constructs or self-citations that bear the load of the main findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no free parameters, invented entities, or ad-hoc axioms are introduced. The work relies on standard domain assumptions about calibration definitions and the validity of synthetic miscalibration.

axioms (1)

domain assumption Calibration metrics can be meaningfully compared and benchmarked independently of any specific modeling method or recalibration technique.
This premise enables the paper's core experimental design of testing metrics in isolation.

pith-pipeline@v0.9.0 · 5768 in / 1106 out tokens · 54163 ms · 2026-05-18T20:43:24.123227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Max-Heinrich Laves, Sontje Ihler, Jacob F

doi:10.1109/TITS.2025.3532803. Max-Heinrich Laves, Sontje Ihler, Jacob F. Fast, Lüder A. Kahrs, and Tobias Ortmaier. Well-Calibrated Regression Uncertainty in Medical Imaging with Deep Learning. InProceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121 ofProceedings of Machine Learning Research, pages 393–412. PMLR, July

work page doi:10.1109/tits.2025.3532803 2025
[2]

doi:https://doi.org/10.1016/j.eswa.2022.116659

ISSN 0957-4174. doi:https://doi.org/10.1016/j.eswa.2022.116659. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun

work page doi:10.1016/j.eswa.2022.116659 2022
[3]

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U

doi:10.1109/MCI.2022.3155327. Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76: 243–...

work page doi:10.1109/mci.2022.3155327 2022
[4]

doi:10.1016/j.inffus.2021.05.008. H. M. Dipu Kabir, Abbas Khosravi, Mohammad Anwar Hosen, and Saeid Nahavandi. Neural Network-Based Uncertainty Quantification: A Survey of Methodologies and Applications.IEEE Access, 6:36218–36234,

work page doi:10.1016/j.inffus.2021.05.008 2021
[5]

doi:10.1109/ACCESS.2018.2836917. Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, Muhammad Shahzad, Wen Yang, Richard Bamler, and Xiao Xiang Zhu. A survey of uncertainty in deep neural networks.Artificial Intelligence Review, 56(1):1513–1...

work page doi:10.1109/access.2018.2836917 2018
[6]

doi:10.1007/s10462-023-10562-9

ISSN 1573-7462. doi:10.1007/s10462-023-10562-9. Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, March

work page doi:10.1007/s10462-023-10562-9
[7]

Laurens Sluijterman, Eric Cator, and Tom Heskes

doi:10.1007/s10994-021-05946-3. Laurens Sluijterman, Eric Cator, and Tom Heskes. How to evaluate uncertainty estimates in machine learning for regression?Neural Networks, 173:106203,

work page doi:10.1007/s10994-021-05946-3
[8]

Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li

doi:https://doi.org/10.1016/j.neunet.2024.106203. Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li. A Survey on Uncertainty Quantification Methods for Deep Learning, January

work page doi:10.1016/j.neunet.2024.106203 2024
[9]

Apostolos F

arXiv:2302.13425 [cs]. Apostolos F. Psaros, Xuhui Meng, Zongren Zou, Ling Guo, and George Em Karniadakis. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons.Journal of Computational Physics, 477:111902, March

work page arXiv
[10]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q

doi:10.1016/j.jcp.2022.111902. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, August

work page doi:10.1016/j.jcp.2022.111902 2022
[11]

Victor Dheur and Souhaib Ben Taieb

doi:10.3390/s22155540. Victor Dheur and Souhaib Ben Taieb. A Large-Scale Study of Probabilistic Calibration in Neural Network Regression. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 7813–7836. PMLR, July

work page doi:10.3390/s22155540
[12]

Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection

Fabian Küppers, Jonas Schneider, and Anselm Haselhoff. Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection. InComputer Vision – ECCV 2022 Workshops, pages 426–442, Cham,

work page 2022
[13]

doi:10.1371/journal.pcbi.1011392. A. Khosravi, S. Nahavandi, D. Creighton, and A. F. Atiya. Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances.IEEE Transactions on Neural Networks, 22(9):1341–1356, September

work page doi:10.1371/journal.pcbi.1011392
[14]

23 arXivTemplateA PREPRINT Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, and V olker Krüger

doi:10.1109/TNN.2011.2162110. 23 arXivTemplateA PREPRINT Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, and V olker Krüger. Uncertainty quantification metrics for deep regression.Pattern Recognition Letters, 186:91–97, October

work page doi:10.1109/tnn.2011.2162110 2011
[15]

doi:10.1016/j.patrec.2024.09.011

ISSN 01678655. doi:10.1016/j.patrec.2024.09.011. Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation.Journal of the American Statistical Association, 102(477):359–378, March

work page doi:10.1016/j.patrec.2024.09.011 2024
[16]

Strictly proper scoring rules, prediction, and estimation

doi:10.1198/016214506000001437. Eric Zelikman, Christopher Healy, Sharon Zhou, and Anand Avati. CRUDE: Calibrating Regression Uncertainty Distributions Empirically, March

work page doi:10.1198/016214506000001437
[17]

Shengjia Zhao, Tengyu Ma, and Stefano Ermon

arXiv:2005.12496 [cs]. Shengjia Zhao, Tengyu Ma, and Stefano Ermon. Individual Calibration with Randomized Forecasting. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 11387–11397. PMLR, July

work page arXiv 2005
[18]

Youngseog Chung, Ian Char, Han Guo, Jeff Schneider, and Willie Neiswanger

arXiv:2506.01486 [cs.LG]. Youngseog Chung, Ian Char, Han Guo, Jeff Schneider, and Willie Neiswanger. Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantification, 2021b. arXiv:2109.10254 [cs.LG]. F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. W...

work page arXiv
[19]

Jacob Feldmesser

https: //doi.org/10.24432/C5QK55. Jacob Feldmesser. Computer hardware. UCI Machine Learning Repository,

work page doi:10.24432/c5qk55
[20]

Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford

https://doi.org/10.24432/ C5830D. Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford. Abalone. UCI Machine Learning Repository, 1995.https://doi.org/10.24432/C55C7W. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties.Decision Support S...

work page doi:10.24432/c55c7w 1995
[21]

Kam Hamidieh

https://doi.org/10.24432/C5VW2C. Kam Hamidieh. Superconductivty data. UCI Machine Learning Repository,

work page doi:10.24432/c5vw2c
[22]

Karl Ulrich

https: //doi.org/10.24432/C5PG66. Karl Ulrich. Servo. UCI Machine Learning Repository, 1993.https://doi.org/10.24432/C5Q30F. I-Cheng Yeh. Concrete compressive strength. UCI Machine Learning Repository,

work page doi:10.24432/c5pg66 1993
[23]

Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig

https: //doi.org/10.24432/C5002N. Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, pages 1–1,

work page doi:10.24432/c5002n
[24]

An Dinh, Stacey Miertschin, Amber Young, and Somya D

https: //doi.org/10.24432/C51307. An Dinh, Stacey Miertschin, Amber Young, and Somya D. Mohanty. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.BMC medical informatics and decision making, 19(1):1–15,

work page doi:10.24432/c51307
[25]

net:cal - Uncertainty Calibration

24 arXivTemplateA PREPRINT Table 2: List of used real-world datasets. No. (ID) Name Number of samples after cleaning Ref. 1 forest-fires 513 [Cortez and Morais, 2008] 2 facebook-metrics 493 [Moro et al., 2016] 3 computer-hardware 206 [Feldmesser, 1987] 4 abalone 4145 [Nash et al., 1995] 5 winequality-white 6494 [Cortez et al., 2009] 6 airfoil-self-noise 1...

work page 2008
[26]

All synthetic datasets consist of 1000 samples and are generated without noise

Some of the synthetic datasets are created using the sklearn library [Pedregosa et al., 2011]. All synthetic datasets consist of 1000 samples and are generated without noise. For both dataset types, all features are normalized by minmax-scaling between 0 and

work page 2011
[27]

To avoid model distortions due to exceeding noise or erroneous samples, all datasets are cleaned of outliers using an isolation forest outlier detection with a probability threshold of 0.8 [Liu et al., 2012]. 4https://github.com/EFS-OpenSource/calibration-framework 5https://uncertainty-toolbox.github.io/ 6https://archive.ics.uci.edu/ 25 arXivTemplateA PRE...

work page 2012

[1] [1]

Max-Heinrich Laves, Sontje Ihler, Jacob F

doi:10.1109/TITS.2025.3532803. Max-Heinrich Laves, Sontje Ihler, Jacob F. Fast, Lüder A. Kahrs, and Tobias Ortmaier. Well-Calibrated Regression Uncertainty in Medical Imaging with Deep Learning. InProceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121 ofProceedings of Machine Learning Research, pages 393–412. PMLR, July

work page doi:10.1109/tits.2025.3532803 2025

[2] [2]

doi:https://doi.org/10.1016/j.eswa.2022.116659

ISSN 0957-4174. doi:https://doi.org/10.1016/j.eswa.2022.116659. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun

work page doi:10.1016/j.eswa.2022.116659 2022

[3] [3]

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U

doi:10.1109/MCI.2022.3155327. Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76: 243–...

work page doi:10.1109/mci.2022.3155327 2022

[4] [4]

doi:10.1016/j.inffus.2021.05.008. H. M. Dipu Kabir, Abbas Khosravi, Mohammad Anwar Hosen, and Saeid Nahavandi. Neural Network-Based Uncertainty Quantification: A Survey of Methodologies and Applications.IEEE Access, 6:36218–36234,

work page doi:10.1016/j.inffus.2021.05.008 2021

[5] [5]

doi:10.1109/ACCESS.2018.2836917. Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, Muhammad Shahzad, Wen Yang, Richard Bamler, and Xiao Xiang Zhu. A survey of uncertainty in deep neural networks.Artificial Intelligence Review, 56(1):1513–1...

work page doi:10.1109/access.2018.2836917 2018

[6] [6]

doi:10.1007/s10462-023-10562-9

ISSN 1573-7462. doi:10.1007/s10462-023-10562-9. Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):457–506, March

work page doi:10.1007/s10462-023-10562-9

[7] [7]

Laurens Sluijterman, Eric Cator, and Tom Heskes

doi:10.1007/s10994-021-05946-3. Laurens Sluijterman, Eric Cator, and Tom Heskes. How to evaluate uncertainty estimates in machine learning for regression?Neural Networks, 173:106203,

work page doi:10.1007/s10994-021-05946-3

[8] [8]

Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li

doi:https://doi.org/10.1016/j.neunet.2024.106203. Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li. A Survey on Uncertainty Quantification Methods for Deep Learning, January

work page doi:10.1016/j.neunet.2024.106203 2024

[9] [9]

Apostolos F

arXiv:2302.13425 [cs]. Apostolos F. Psaros, Xuhui Meng, Zongren Zou, Ling Guo, and George Em Karniadakis. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons.Journal of Computational Physics, 477:111902, March

work page arXiv

[10] [10]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q

doi:10.1016/j.jcp.2022.111902. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, August

work page doi:10.1016/j.jcp.2022.111902 2022

[11] [11]

Victor Dheur and Souhaib Ben Taieb

doi:10.3390/s22155540. Victor Dheur and Souhaib Ben Taieb. A Large-Scale Study of Probabilistic Calibration in Neural Network Regression. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 7813–7836. PMLR, July

work page doi:10.3390/s22155540

[12] [12]

Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection

Fabian Küppers, Jonas Schneider, and Anselm Haselhoff. Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection. InComputer Vision – ECCV 2022 Workshops, pages 426–442, Cham,

work page 2022

[13] [13]

doi:10.1371/journal.pcbi.1011392. A. Khosravi, S. Nahavandi, D. Creighton, and A. F. Atiya. Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances.IEEE Transactions on Neural Networks, 22(9):1341–1356, September

work page doi:10.1371/journal.pcbi.1011392

[14] [14]

23 arXivTemplateA PREPRINT Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, and V olker Krüger

doi:10.1109/TNN.2011.2162110. 23 arXivTemplateA PREPRINT Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, and V olker Krüger. Uncertainty quantification metrics for deep regression.Pattern Recognition Letters, 186:91–97, October

work page doi:10.1109/tnn.2011.2162110 2011

[15] [15]

doi:10.1016/j.patrec.2024.09.011

ISSN 01678655. doi:10.1016/j.patrec.2024.09.011. Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation.Journal of the American Statistical Association, 102(477):359–378, March

work page doi:10.1016/j.patrec.2024.09.011 2024

[16] [16]

Strictly proper scoring rules, prediction, and estimation

doi:10.1198/016214506000001437. Eric Zelikman, Christopher Healy, Sharon Zhou, and Anand Avati. CRUDE: Calibrating Regression Uncertainty Distributions Empirically, March

work page doi:10.1198/016214506000001437

[17] [17]

Shengjia Zhao, Tengyu Ma, and Stefano Ermon

arXiv:2005.12496 [cs]. Shengjia Zhao, Tengyu Ma, and Stefano Ermon. Individual Calibration with Randomized Forecasting. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 11387–11397. PMLR, July

work page arXiv 2005

[18] [18]

Youngseog Chung, Ian Char, Han Guo, Jeff Schneider, and Willie Neiswanger

arXiv:2506.01486 [cs.LG]. Youngseog Chung, Ian Char, Han Guo, Jeff Schneider, and Willie Neiswanger. Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantification, 2021b. arXiv:2109.10254 [cs.LG]. F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. W...

work page arXiv

[19] [19]

Jacob Feldmesser

https: //doi.org/10.24432/C5QK55. Jacob Feldmesser. Computer hardware. UCI Machine Learning Repository,

work page doi:10.24432/c5qk55

[20] [20]

Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford

https://doi.org/10.24432/ C5830D. Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford. Abalone. UCI Machine Learning Repository, 1995.https://doi.org/10.24432/C55C7W. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties.Decision Support S...

work page doi:10.24432/c55c7w 1995

[21] [21]

Kam Hamidieh

https://doi.org/10.24432/C5VW2C. Kam Hamidieh. Superconductivty data. UCI Machine Learning Repository,

work page doi:10.24432/c5vw2c

[22] [22]

Karl Ulrich

https: //doi.org/10.24432/C5PG66. Karl Ulrich. Servo. UCI Machine Learning Repository, 1993.https://doi.org/10.24432/C5Q30F. I-Cheng Yeh. Concrete compressive strength. UCI Machine Learning Repository,

work page doi:10.24432/c5pg66 1993

[23] [23]

Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig

https: //doi.org/10.24432/C5002N. Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, pages 1–1,

work page doi:10.24432/c5002n

[24] [24]

An Dinh, Stacey Miertschin, Amber Young, and Somya D

https: //doi.org/10.24432/C51307. An Dinh, Stacey Miertschin, Amber Young, and Somya D. Mohanty. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.BMC medical informatics and decision making, 19(1):1–15,

work page doi:10.24432/c51307

[25] [25]

net:cal - Uncertainty Calibration

24 arXivTemplateA PREPRINT Table 2: List of used real-world datasets. No. (ID) Name Number of samples after cleaning Ref. 1 forest-fires 513 [Cortez and Morais, 2008] 2 facebook-metrics 493 [Moro et al., 2016] 3 computer-hardware 206 [Feldmesser, 1987] 4 abalone 4145 [Nash et al., 1995] 5 winequality-white 6494 [Cortez et al., 2009] 6 airfoil-self-noise 1...

work page 2008

[26] [26]

All synthetic datasets consist of 1000 samples and are generated without noise

Some of the synthetic datasets are created using the sklearn library [Pedregosa et al., 2011]. All synthetic datasets consist of 1000 samples and are generated without noise. For both dataset types, all features are normalized by minmax-scaling between 0 and

work page 2011

[27] [27]

To avoid model distortions due to exceeding noise or erroneous samples, all datasets are cleaned of outliers using an isolation forest outlier detection with a probability threshold of 0.8 [Liu et al., 2012]. 4https://github.com/EFS-OpenSource/calibration-framework 5https://uncertainty-toolbox.github.io/ 6https://archive.ics.uci.edu/ 25 arXivTemplateA PRE...

work page 2012