An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

Farah E. Shamout; L. Juli\'an Lechuga L\'opez; Tim G. J. Rudner

arxiv: 2603.02719 · v4 · pith:6RASYPHOnew · submitted 2026-03-03 · 💻 cs.LG

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

L. Juli\'an Lechuga L\'opez , Farah E. Shamout , Tim G. J. Rudner This is my paper

Pith reviewed 2026-05-25 07:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords selective predictioncalibrationmultimodal learningclinical AIICU datamiscalibrationmultilabel classificationuncertainty estimation

0 comments

The pith

Selective prediction degrades performance in multimodal clinical classification due to class-dependent miscalibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether uncertainty-based selective prediction reliably improves safety in multilabel clinical condition classification from multimodal ICU data. It finds that selective prediction often worsens results even when standard metrics appear strong. The root cause is class-dependent miscalibration: models express high uncertainty on correct predictions and low uncertainty on incorrect ones, especially for rare conditions. Aggregate metrics mask these failures, so the work calls for calibration-aware checks before relying on deferral in clinical settings.

Core claim

Across state-of-the-art unimodal and multimodal models on multimodal ICU data, selective prediction based on uncertainty estimates substantially degrades performance in multilabel clinical condition classification despite strong standard evaluation metrics. This failure arises from severe class-dependent miscalibration, in which models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Commonly used aggregate metrics obscure these per-class effects and therefore cannot assess selective prediction behavior in this setting.

What carries the argument

Class-dependent miscalibration, in which uncertainty estimates are inversely related to prediction correctness for underrepresented classes.

If this is right

Selective prediction cannot be assumed safe in clinical tasks without per-class calibration checks.
Aggregate metrics alone are insufficient to certify reliability for selective prediction.
Underrepresented conditions are the primary points of failure in uncertainty-driven deferral.
Calibration-aware evaluation is required to support safety claims in multimodal clinical AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar class-dependent miscalibration could appear in other imbalanced medical imaging or sensor tasks.
Per-class uncertainty histograms should become a standard reporting requirement for clinical models.
Retraining with explicit calibration objectives on rare classes offers a testable route to fix the observed failure mode.

Load-bearing premise

That the state-of-the-art models and multimodal ICU dataset tested here represent typical real-world clinical deployment scenarios.

What would settle it

A new clinical dataset or model in which uncertainty-based selective prediction improves aggregate performance while uncertainty remains positively correlated with error rate across all classes.

read the original abstract

As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows selective prediction can degrade performance in multimodal clinical multilabel classification because of class-dependent miscalibration on rare conditions, but the single-dataset experiments leave the generality of that failure mode unclear.

read the letter

The central observation is that uncertainty-based selective prediction hurts accuracy here even when standard metrics are strong, because the model is overconfident on errors for underrepresented classes and underconfident on correct predictions for those same classes. The work runs this check across several unimodal and multimodal models on ICU data and shows that aggregate scores mask the per-class breakdown. That part is useful: it gives a concrete example of why calibration-aware checks matter in imbalanced clinical settings where deferral decisions have real stakes. The empirical comparison of model types is direct and stays on the data rather than adding new theory. The stress-test concern holds up. Everything rests on one dataset and the models trained on it, with no reported checks on other clinical sources, different label distributions, or alternative architectures. Without those, the claim that this is a task-specific failure mode rather than a setup-specific artifact stays provisional. The abstract also gives no numbers on splits, statistical tests, or uncertainty quantification details, so the strength of the evidence depends entirely on what the full methods and results sections actually contain. This is for people who build or review selective prediction pipelines for medical multilabel tasks. A reader already working on calibration in imbalanced settings would pick up a practical warning. It deserves peer review because the safety implication is worth referee time, though any acceptance would need added robustness experiments to support the broader conclusion.

Referee Report

1 major / 2 minor

Summary. The paper claims that in multilabel clinical condition classification using multimodal ICU data, uncertainty-based selective prediction can substantially degrade performance due to severe class-dependent miscalibration, where models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, especially for underrepresented conditions. Aggregate metrics obscure these effects, necessitating calibration-aware evaluation for safety in clinical AI.

Significance. If the findings are robust, this work identifies a critical failure mode in selective prediction for clinical applications, showing how miscalibration can lead to unreliable deferral decisions in imbalanced multilabel settings. It provides empirical evidence across multiple models that standard evaluation can be misleading, which is valuable for the field of reliable machine learning in healthcare.

major comments (1)

[Abstract] The characterization of the observed miscalibration as a 'task-specific failure mode' relies on the assumption that the single ICU dataset and chosen models are representative. Without additional experiments on other datasets or tasks, this generalization is not fully supported and is load-bearing for the central claim.

minor comments (2)

Ensure that all experimental details, including data splits, statistical tests, and exact definitions of uncertainty measures, are clearly presented in the methods section to allow reproducibility.
[Figures] The figures illustrating the class-dependent effects should include error bars or confidence intervals to convey uncertainty in the reported metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address the major comment below and have revised the manuscript accordingly to clarify the scope of our claims.

read point-by-point responses

Referee: [Abstract] The characterization of the observed miscalibration as a 'task-specific failure mode' relies on the assumption that the single ICU dataset and chosen models are representative. Without additional experiments on other datasets or tasks, this generalization is not fully supported and is load-bearing for the central claim.

Authors: We agree that the phrasing 'task-specific failure mode' in the abstract could imply a broader generalization than our single-dataset empirical study supports. Our experiments demonstrate the failure mode consistently across multiple unimodal and multimodal models on the MIMIC-IV multimodal ICU data for multilabel clinical condition classification, with particular emphasis on class-dependent effects for underrepresented conditions. However, we do not claim this behavior holds for all clinical tasks or datasets. To address the concern, we will revise the abstract and introduction to state that the findings characterize this failure mode 'in multimodal clinical condition classification using ICU data' rather than as a general 'task-specific' property, and we will add explicit discussion of the single-dataset limitation as a scope condition for the observed effects. This revision removes the load-bearing generalization while preserving the core empirical contribution regarding aggregate metrics and calibration-aware evaluation in this setting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with no derivations or fitted predictions

full rationale

This paper performs an empirical evaluation of uncertainty-based selective prediction and class-dependent miscalibration on multimodal ICU data using existing models. No equations, derivations, parameter fits, or self-citation chains are present that could reduce claims to inputs by construction. All reported findings (degraded selective prediction performance, miscalibration patterns) are direct experimental observations from standard metrics on held-out data, independently verifiable against external benchmarks. The generalization concern raised in the skeptic note is a question of external validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the validity of the experimental observations from the tested models and ICU data. No free parameters, axioms beyond standard statistical assumptions, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5710 in / 1176 out tokens · 38615 ms · 2026-05-25T07:28:10.706565+00:00 · methodology

Review history (2 revisions) →

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)