Enhancing Deep Neural Network Reliability with Refinement and Calibration

Ajay Shastry; Chetan Arora; Ramya Hebbalaguppe; Soumya Suvra Ghosal

arxiv: 2605.23249 · v1 · pith:E2MU55ZXnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Enhancing Deep Neural Network Reliability with Refinement and Calibration

Ramya Hebbalaguppe , Ajay Shastry , Soumya Suvra Ghosal , Chetan Arora This is my paper

Pith reviewed 2026-05-25 04:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords deep neural networksmodel calibrationrefinementreliabilitysupervised contrastive learningexpected calibration errorclass imbalance

0 comments

The pith

A new loss function and RefCal framework jointly optimize calibration, refinement, and accuracy to make deep neural network confidence estimates more reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that deep neural networks can produce more trustworthy confidence estimates when trained with a loss that explicitly promotes refinement alongside calibration and accuracy. Refinement requires the model to assign markedly different confidence scores to correct versus incorrect predictions, a property that many calibration techniques erode. The authors introduce a novel loss optimized through supervised contrastive learning and embed it in the RefCal framework for joint optimization. A sympathetic reader would care because post-processing calibration often masks rather than fixes unreliable uncertainty, limiting trust in model decisions on real tasks.

Core claim

The paper claims that its RefCal framework, built around a novel loss function that promotes refinement and is optimized through supervised contrastive learning, jointly optimizes calibration, refinement, and accuracy during training. This approach addresses the observed trade-off where existing calibration methods improve one metric at the expense of the other, yielding more reliable uncertainty estimates on imbalanced datasets without depending on post-processing.

What carries the argument

RefCal, the unified training framework that uses a refinement-promoting loss function optimized via supervised contrastive learning to balance calibration, refinement, and accuracy.

If this is right

Models will assign confidence scores that are both well-aligned with correctness probability and sharply different for correct versus incorrect predictions.
The common trade-off between calibration and refinement will be reduced, allowing simultaneous gains in both.
Reliance on separate post-processing steps for calibration will decrease because uncertainty quality improves during training.
Performance on long-tailed or imbalanced data will show better overall reliability in terms of accuracy, refinement, and calibration error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint optimization approach could be tested on tasks beyond image classification, such as natural language processing, to check if the refinement benefit holds more broadly.
This suggests that contrastive objectives might be adapted for other uncertainty-related goals like selective classification or out-of-distribution detection.
In deployed systems, the method might allow simpler pipelines by reducing the need for separate calibration modules after training.

Load-bearing premise

Optimizing the novel loss through supervised contrastive learning produces genuine improvements in the model's uncertainty estimation rather than merely improving post-hoc metrics on the evaluation set.

What would settle it

Training a model with RefCal on a new dataset or architecture and finding that both refinement and calibration metrics fail to improve over a baseline trained with standard loss plus post-hoc calibration.

Figures

Figures reproduced from arXiv: 2605.23249 by Ajay Shastry, Chetan Arora, Ramya Hebbalaguppe, Soumya Suvra Ghosal.

**Figure 1.** Figure 1: [Why RefCal?]: Our proposed training regime, RefCal optimizes both, Refinement and Calibration. (a) shows when we calibrate a ResNet-50 model using MMCE Kumar et al. (2018) calibration on CIFAR100-LT, it results in lower separation between the confidence values of the correct and incorrect predicted classes (red and blue). To this end, when we use the proposed proxy refinement loss along with calibration … view at source ↗

**Figure 2.** Figure 2: Effect of RefCal Training: Spider plot comparing Top-1% Accuracy, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (Left) AUC vs. ECE trade-off and (Right) Top 1% accuracy vs. ECE trade-off for ResNet50 on CIFAR100-LT (IF=10%). Higher AUC and Top 1% accuracy with lower ECE (top-left) are desirable. Lower variance in AUC and ECE highlights the reliability of RefCal variants. approach to non-calibrated/directly calibrated baseline models. Notice a jump in accuracy by over 10% in every case, similarly on AUC we do see a … view at source ↗

**Figure 4.** Figure 4: [Grad-CAM]: Our proposed training regime, RefCal allows for joint optimization of calibration and refinement. Grad-CAM visualizations for Resnet-18 trained on ImageNet-LT, using a particular calibration technique (column bearing title Calibration Loss (C.L.)), and then by jointly optimizing with the same calibration but adding our refinement loss (columns bearing title RefCal i.e., columns 3,5, and 7). Her… view at source ↗

**Figure 5.** Figure 5: [Robustness of RefCal to varying degree of corruptions in case of Brightness distorsion from Hendrycks & Dietterich (2018)]: We report results on CIFAR100-C using ResNet50 He et al. (2016). We observe that RefCal variants offer superior performance in both Top 1% accuracy and AUROC in comparison with any train-time calibration and SOTA refinement methods taken independently. Our approach shows relatively l… view at source ↗

**Figure 6.** Figure 6: (Left) AUC vs. ECE trade-off and (Right) Top 1% accuracy vs. ECE trade-off. We use ResNet-50 trained on CIFAR100-LT (imbalance factor 10%). Higher AUC and lower ECE is better. Higher Top 1% accuracy with lower calibration error (Top-left location) is desirable. The lower variance in AUC and ECE in the plot emphasizes the reliability of RefCal variants. Note that RefCal+CE+TS performs the best. We have run … view at source ↗

**Figure 7.** Figure 7: [Why RefCal? Example 2]: Our proposed training regime, RefCal allows for joint optimization of calibration and refinement. (a) shows when we calibrate a ResNet-50 model using MbLSLiu et al. (2022) calibration on CIFAR100-LT, it results in lower separation between the confidence values of the two classes (red and blue). To overcome this, we jointly train using the proposed proxy refinement loss along with c… view at source ↗

**Figure 8.** Figure 8: [Grad-CAM]: Our proposed training regime, RefCal allows for joint optimization of calibration and refinement. Grad-CAM visualizations for Resnet-18 trained on ImageNet-LT, using a particular calibration technique (column bearing title Calibration Loss (C.L.)), and then by jointly optimizing with the same calibration but adding our refinement loss (columns bearing title RefCal i.e., columns 3,5, and 7). Her… view at source ↗

read the original abstract

Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model's predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model's understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main claim is a new supervised contrastive loss inside RefCal that jointly targets accuracy, calibration, and refinement, but the large accuracy lift makes it unclear whether the refinement gain is independent or mostly a side effect.

read the letter

The abstract describes a supervised contrastive loss meant to push correct and incorrect predictions apart in confidence space, plus a joint training setup called RefCal that also handles calibration. That combination is presented as new. On CIFAR-100-LT with 10% imbalance it reports accuracy 58.81, refinement 95.67, ECE 0.08 against a correctness-ranking baseline at 46.27 / 93.7 / 0.22. The numbers are the clearest thing the abstract gives us. The work does a service by reminding readers that post-hoc calibration can mask poor refinement and that training-time fixes are worth trying. The stress-test point lands: the accuracy gap is large enough that it could be driving most of the ECE and refinement improvement, and nothing in the visible text shows an equation that penalizes low separation between correct and incorrect cases once accuracy is controlled for. Without the loss derivation, ablations, or error analysis it is impossible to tell whether the contrastive term adds anything beyond representation regularization. The paper is aimed at people who deploy DNNs on imbalanced data and care about both calibration and sharpness. A reader who already works on contrastive losses or reliability metrics might pick up the idea, but only after seeing the full method and controls. It is worth sending to peer review so the claims can be checked against the actual loss and experiments rather than the abstract alone.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes RefCal, a unified training framework for deep neural networks that introduces a novel loss function explicitly promoting refinement (sharp separation of confidence scores between correct and incorrect predictions) and optimizes it jointly with calibration and accuracy via supervised contrastive learning. On the CIFAR-100-LT dataset with 10% class imbalance, RefCal is reported to achieve (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), outperforming the Correctness Ranking Loss baseline at (46.27, 93.7, 0.22).

Significance. If the central claim holds—that the contrastive loss produces intrinsic refinement gains independent of accuracy improvements—this would address a noted limitation of post-hoc calibration methods and provide a training-time approach to more reliable uncertainty estimates. The explicit joint optimization of the three properties is a potentially useful direction, though the single-dataset evaluation limits broader impact assessment.

major comments (2)

[Abstract] Abstract: No equation or derivation of the novel loss function is provided, so it cannot be verified whether the loss explicitly penalizes low separation between correct/incorrect confidences (as claimed) or primarily acts as a representation regularizer whose effect on refinement and ECE is incidental to the large accuracy gain (58.81 vs 46.27).
[Abstract] Abstract: Results are reported on only a single dataset (CIFAR-100-LT with 10% imbalance); without additional datasets, ablations isolating the contrastive term's contribution to refinement, or analysis showing gains are independent of hyperparameter choices tuned to these metrics, the general claim of a unified framework improving reliability remains unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the empirical support.

read point-by-point responses

Referee: [Abstract] Abstract: No equation or derivation of the novel loss function is provided, so it cannot be verified whether the loss explicitly penalizes low separation between correct/incorrect confidences (as claimed) or primarily acts as a representation regularizer whose effect on refinement and ECE is incidental to the large accuracy gain (58.81 vs 46.27).

Authors: We agree the abstract omits the equation due to length limits. The full manuscript (Section 3) derives the supervised contrastive loss with an explicit term that penalizes insufficient separation between correct and incorrect prediction embeddings, directly targeting refinement rather than acting only as a generic regularizer. We will insert the key loss equation into the abstract in the revision so readers can immediately verify the formulation. revision: yes
Referee: [Abstract] Abstract: Results are reported on only a single dataset (CIFAR-100-LT with 10% imbalance); without additional datasets, ablations isolating the contrastive term's contribution to refinement, or analysis showing gains are independent of hyperparameter choices tuned to these metrics, the general claim of a unified framework improving reliability remains unsupported.

Authors: We acknowledge the single-dataset limitation. CIFAR-100-LT with 10% imbalance is a demanding benchmark for joint accuracy-calibration-refinement evaluation, but broader claims require more evidence. In revision we will add results on CIFAR-10-LT and balanced CIFAR-100, include ablations that remove or ablate the contrastive term while holding accuracy fixed, and report sensitivity to hyperparameter choices to demonstrate that refinement and ECE gains are not incidental. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with novel loss claim

full rationale

The provided abstract and text contain no equations, derivations, or self-citations that reduce any claimed result to its inputs by construction. The paper proposes a novel loss function for refinement via supervised contrastive learning and a joint optimization framework RefCal, with empirical performance reported on CIFAR-100-LT. No fitted parameters are renamed as predictions, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled. The central claims rest on the proposed loss and training procedure rather than reducing to prior fitted values or self-referential definitions. This is the common case of an independent proposal supported by experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be identified from the text; typical ML frameworks of this type introduce loss-weighting hyperparameters.

pith-pipeline@v0.9.0 · 5785 in / 1154 out tokens · 26102 ms · 2026-05-25T04:53:57.499525+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2 (Proposed refinement loss). L_ref = ∑_i [min_p∈P_i ½∥z_i − z_p∥² − min_n∈N_i ½∥z_i − z_n∥²]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 2. L_SC > L_ref (via Jensen, log-sum-exp and Lemma 1 cosine identity)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756,

Han Cai, Chuang Gan, and Song Han. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756,

work page arXiv
[2]

Local temperature scaling for proba- bility calibration.CoRR, abs/2008.05105,

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. Local temperature scaling for proba- bility calibration.CoRR, abs/2008.05105,

work page arXiv 2008
[3]

Better features, better cal- ibration: A simple fix for overconfident networks

Soumya Suvra Ghosal, Ramya Hebbalaguppe, and Dinesh Manocha. Better features, better cal- ibration: A simple fix for overconfident networks. InMachine Learning and Knowledge Dis- covery in Databases. Research Track: European Conference, ECML PKDD 2025, Porto, Por- tugal, September 15–19, 2025, Proceedings, Part I, pp. 231–247, Berlin, Heidelberg,

work page 2025
[4]

ISBN 978-3-032-05961-1

Springer-Verlag. ISBN 978-3-032-05961-1. doi: 10.1007/978-3-032-05962-8

work page doi:10.1007/978-3-032-05962-8
[5]

URLhttps: //doi.org/10.1007/978-3-032-05962-8_14. A. Ghosh, T. Schaaf, and M. Gormley. Adafocal: Calibration-aware adaptive focal loss. InAdvances in NeurIPS, volume 35, pp. 1583–1595,

work page doi:10.1007/978-3-032-05962-8_14
[6]

Hebbalaguppe, S

R. Hebbalaguppe, S. Ghosal, J. Prakash, H. Khadilkar, and C. Arora. A novel data augmentation technique for out-of-distribution sample detection using compounded corruptions. InECML, pp. 529–545. Springer, 2022a. 10 ICLR 2026 Workshop: Principled Design for Trustworthy AI R. Hebbalaguppe, M. Baranwal, K. Anand, and C. Arora. Calibration transfer via knowl...

work page 2026
[7]

Beyond temper- ature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration.arXiv preprint arXiv:1910.12656,

Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Hao Song, Peter Flach, et al. Beyond temper- ature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration.arXiv preprint arXiv:1910.12656,

work page arXiv 1910
[8]

Verified uncertainty calibration.arXiv preprint arXiv:1909.10155,

Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration.arXiv preprint arXiv:1909.10155,

work page arXiv 1909
[9]

Enhancing the reliability of out-of-distribution image detection in neural networks

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks.arXiv preprint arXiv:1706.02690,

work page arXiv
[10]

When does label smoothing help?arXiv preprint arXiv:1906.02629,

Rafael M ¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help?arXiv preprint arXiv:1906.02629,

work page arXiv 1906
[11]

A new vector partition of the probability score.Journal of Applied Meteorology and Climatology, 12(4):595–600,

11 ICLR 2026 Workshop: Principled Design for Trustworthy AI Allan H Murphy. A new vector partition of the probability score.Journal of Applied Meteorology and Climatology, 12(4):595–600,

work page 2026
[12]

Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

Tianbao Yang. Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

work page arXiv
[14]

From Tab. 6, we inferRefCalvariants namely, RefCal+NLL,RefCal+LS,RefCal+CE+TS,RefCal+MMCE,RefCal+CRL,RefCal 14 ICLR 2026 Workshop: Principled Design for Trustworthy AI Method CIFAR-10-LTTop1 (%)↑AUROC↑ECE (%)↓SCE (%)↓ACE(%)↓smECE(%)↓×10−2 ×10−2 ×10−2 ×10−2 NLL (CE) 85.30 98.40 09.35 02.22 02.02 08.91+RefCal(Ours) 89.70 98.82 08.60 01.83 01.45 06.85 LS Sze...

work page arXiv 2026
[15]

Hendryckset al.Hendrycks & Dietterich (2018) benchmark robustness of aDNNusing15algo- rithmically generated image corruptions that mimic natural corruptions

B.2 ROBUSTNESS OFRE FCA L B.2.1 ROBUSTNESS TO NATURAL CORRUPTIONS DNNs lack robustness to out-of-distribution data or natural corruptions such as noise, blur, etc. Hendryckset al.Hendrycks & Dietterich (2018) benchmark robustness of aDNNusing15algo- rithmically generated image corruptions that mimic natural corruptions. Each corruption severity ranges fro...

work page arXiv 2018
[16]

FL+MDCA,

These hyperparameters enable us to reproduce all the tables in main and supplemental material. We used a fixed seed value of1234across all datasets and architectures forRefCal. D.1 BASELINES •CE+TSGuo et al. (2017) The authors find that contemporary neural networks exhibit poor calibration, a departure from neural networks developed a decade ago. Extensiv...

work page 2017
[17]

Each image is assigned one label

The dataset includes 10,000 test images, with 1000 images per class, and fewer than 50,000 training images. Each image is assigned one label. •STL-10Coates et al. (2011) STL10 is a dataset inspired by the CIFAR-10 dataset, featuring certain modifications. Notably, each class in STL10 has fewer labeled training examples compared to CIFAR-10. The higher res...

work page 2011
[18]

fine” label (the class to which it belongs) and a “coarse

There are73,527images in the training set and26,032images in the test set. •CIFAR-100:The CIFAR-100 dataset Krizhevsky & Hinton (2009) has 100 classes con- taining 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the cla...

work page 2009
[19]

These 100 classes are further grouped into 20 overarching superclasses

The dataset comprises fewer than 50,000 training images and 10,000 test images, with 100 images per class in the test set. These 100 classes are further grouped into 20 overarching superclasses. Each image is labeled with two annotations: a fine label indicating the specific class, and a coarse label representing the corresponding superclass. •CIFAR-100-C...

work page 2018

[1] [1]

Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756,

Han Cai, Chuang Gan, and Song Han. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756,

work page arXiv

[2] [2]

Local temperature scaling for proba- bility calibration.CoRR, abs/2008.05105,

Zhipeng Ding, Xu Han, Peirong Liu, and Marc Niethammer. Local temperature scaling for proba- bility calibration.CoRR, abs/2008.05105,

work page arXiv 2008

[3] [3]

Better features, better cal- ibration: A simple fix for overconfident networks

Soumya Suvra Ghosal, Ramya Hebbalaguppe, and Dinesh Manocha. Better features, better cal- ibration: A simple fix for overconfident networks. InMachine Learning and Knowledge Dis- covery in Databases. Research Track: European Conference, ECML PKDD 2025, Porto, Por- tugal, September 15–19, 2025, Proceedings, Part I, pp. 231–247, Berlin, Heidelberg,

work page 2025

[4] [4]

ISBN 978-3-032-05961-1

Springer-Verlag. ISBN 978-3-032-05961-1. doi: 10.1007/978-3-032-05962-8

work page doi:10.1007/978-3-032-05962-8

[5] [5]

URLhttps: //doi.org/10.1007/978-3-032-05962-8_14. A. Ghosh, T. Schaaf, and M. Gormley. Adafocal: Calibration-aware adaptive focal loss. InAdvances in NeurIPS, volume 35, pp. 1583–1595,

work page doi:10.1007/978-3-032-05962-8_14

[6] [6]

Hebbalaguppe, S

R. Hebbalaguppe, S. Ghosal, J. Prakash, H. Khadilkar, and C. Arora. A novel data augmentation technique for out-of-distribution sample detection using compounded corruptions. InECML, pp. 529–545. Springer, 2022a. 10 ICLR 2026 Workshop: Principled Design for Trustworthy AI R. Hebbalaguppe, M. Baranwal, K. Anand, and C. Arora. Calibration transfer via knowl...

work page 2026

[7] [7]

Beyond temper- ature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration.arXiv preprint arXiv:1910.12656,

Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Hao Song, Peter Flach, et al. Beyond temper- ature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration.arXiv preprint arXiv:1910.12656,

work page arXiv 1910

[8] [8]

Verified uncertainty calibration.arXiv preprint arXiv:1909.10155,

Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration.arXiv preprint arXiv:1909.10155,

work page arXiv 1909

[9] [9]

Enhancing the reliability of out-of-distribution image detection in neural networks

Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks.arXiv preprint arXiv:1706.02690,

work page arXiv

[10] [10]

When does label smoothing help?arXiv preprint arXiv:1906.02629,

Rafael M ¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help?arXiv preprint arXiv:1906.02629,

work page arXiv 1906

[11] [11]

A new vector partition of the probability score.Journal of Applied Meteorology and Climatology, 12(4):595–600,

11 ICLR 2026 Workshop: Principled Design for Trustworthy AI Allan H Murphy. A new vector partition of the probability score.Journal of Applied Meteorology and Climatology, 12(4):595–600,

work page 2026

[12] [12]

Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

Tianbao Yang. Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

work page arXiv

[13] [14]

From Tab. 6, we inferRefCalvariants namely, RefCal+NLL,RefCal+LS,RefCal+CE+TS,RefCal+MMCE,RefCal+CRL,RefCal 14 ICLR 2026 Workshop: Principled Design for Trustworthy AI Method CIFAR-10-LTTop1 (%)↑AUROC↑ECE (%)↓SCE (%)↓ACE(%)↓smECE(%)↓×10−2 ×10−2 ×10−2 ×10−2 NLL (CE) 85.30 98.40 09.35 02.22 02.02 08.91+RefCal(Ours) 89.70 98.82 08.60 01.83 01.45 06.85 LS Sze...

work page arXiv 2026

[14] [15]

Hendryckset al.Hendrycks & Dietterich (2018) benchmark robustness of aDNNusing15algo- rithmically generated image corruptions that mimic natural corruptions

B.2 ROBUSTNESS OFRE FCA L B.2.1 ROBUSTNESS TO NATURAL CORRUPTIONS DNNs lack robustness to out-of-distribution data or natural corruptions such as noise, blur, etc. Hendryckset al.Hendrycks & Dietterich (2018) benchmark robustness of aDNNusing15algo- rithmically generated image corruptions that mimic natural corruptions. Each corruption severity ranges fro...

work page arXiv 2018

[15] [16]

FL+MDCA,

These hyperparameters enable us to reproduce all the tables in main and supplemental material. We used a fixed seed value of1234across all datasets and architectures forRefCal. D.1 BASELINES •CE+TSGuo et al. (2017) The authors find that contemporary neural networks exhibit poor calibration, a departure from neural networks developed a decade ago. Extensiv...

work page 2017

[16] [17]

Each image is assigned one label

The dataset includes 10,000 test images, with 1000 images per class, and fewer than 50,000 training images. Each image is assigned one label. •STL-10Coates et al. (2011) STL10 is a dataset inspired by the CIFAR-10 dataset, featuring certain modifications. Notably, each class in STL10 has fewer labeled training examples compared to CIFAR-10. The higher res...

work page 2011

[17] [18]

fine” label (the class to which it belongs) and a “coarse

There are73,527images in the training set and26,032images in the test set. •CIFAR-100:The CIFAR-100 dataset Krizhevsky & Hinton (2009) has 100 classes con- taining 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the cla...

work page 2009

[18] [19]

These 100 classes are further grouped into 20 overarching superclasses

The dataset comprises fewer than 50,000 training images and 10,000 test images, with 100 images per class in the test set. These 100 classes are further grouped into 20 overarching superclasses. Each image is labeled with two annotations: a fine label indicating the specific class, and a coarse label representing the corresponding superclass. •CIFAR-100-C...

work page 2018