Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models
Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3
The pith
Predictive entropy from one forward pass identifies both miscalibrated and rephrase-sensitive predictions in medical vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing, with AUROC 0.711 on MedGemma and 0.878 on LLaVA-RAD. This enables a single entropy threshold to flag both unreliable and rephrase sensitive predictions because both failures trace to proximity to the decision boundary.
What carries the argument
Predictive entropy computed from one forward pass, which quantifies uncertainty and signals proximity to the decision boundary to connect calibration failures with paraphrase sensitivity.
Load-bearing premise
Proximity to the decision boundary is the shared cause of miscalibration and rephrase sensitivity, and the tested models and chest X-ray datasets represent other medical VLMs and clinical uses.
What would settle it
Finding a medical VLM or new dataset where predictive entropy shows no link to which predictions flip under rephrasing, or where high-entropy samples stay stable when questions are reworded.
Figures
read the original abstract
Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that miscalibration and paraphrase sensitivity in medical VLMs share a common cause (proximity to the decision boundary), demonstrated empirically by showing that predictive entropy from a single forward pass predicts rephrase-induced prediction flips (AUROC 0.711 on MedGemma, 0.878 on LLaVA-RAD, p<10^-4). It benchmarks five UQ methods (including predictive entropy, MC Dropout, and a 5-member LoRA ensemble) on MedGemma-4bit and LLaVA-RAD across MIMIC-CXR (in-distribution) and PadChest (OOD) chest X-ray datasets, reporting that single-model entropy outperforms the ensemble on error detection (AUROC 0.743 vs 0.657) and paraphrase screening while MC Dropout achieves best calibration (ECE 4.3).
Significance. If the empirical correlations hold, the work offers a low-cost, single-pass method to flag both unreliable and rephrase-sensitive predictions in medical VLMs, with potential to improve selective prediction and deployment safety. Strengths include concrete metrics (ECE, AUROC, accuracy, coverage at 5% risk), cross-dataset and cross-architecture validation, and the finding that simple methods can outperform ensembles under distribution shift.
major comments (2)
- [Abstract] Abstract and results on paraphrase sensitivity: the claim that proximity to the decision boundary is the shared causal mechanism for miscalibration and rephrase sensitivity rests solely on predictive entropy correlations; no direct quantification of boundary proximity (logit margin, embedding distance to decision surface, or minimal perturbation) or controls for alternatives (e.g., token-level generation variance) is performed, leaving the single-threshold interpretation untested.
- [Abstract] Reported AUROC values (0.711, 0.878, 0.743) and ECE numbers lack error bars, confidence intervals, or full statistical details on variability across runs or question-generation methods, which is load-bearing for the cross-method and cross-shift comparisons.
minor comments (2)
- [Abstract] Abstract contains minor typos and formatting issues: 'outof distribution', 'mis calibrated', 'MedGemma 4BIT', 'LLaVA RAD7B', and 'LLaVARAD' should be standardized for clarity.
- [Results] The ensemble collapse on MIMIC-to-PadChest shift (42.9 ECE) is reported but the paper does not discuss whether this is due to LoRA fine-tuning specifics or general ensemble fragility under medical domain shift.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract and statistical reporting. We respond to each major point below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract and results on paraphrase sensitivity: the claim that proximity to the decision boundary is the shared causal mechanism for miscalibration and rephrase sensitivity rests solely on predictive entropy correlations; no direct quantification of boundary proximity (logit margin, embedding distance to decision surface, or minimal perturbation) or controls for alternatives (e.g., token-level generation variance) is performed, leaving the single-threshold interpretation untested.
Authors: We agree that the manuscript infers the shared mechanism from the observed predictive correlations with entropy rather than direct boundary measurements. Entropy is used as a standard proxy for proximity to the decision boundary, and the AUROC results empirically support the single-threshold screening utility. However, we did not quantify logit margins, embedding distances, or control for token-level variance. In revision we will replace causal phrasing with 'empirically linked via uncertainty' in the abstract, add a limitations paragraph discussing these alternatives, and note that direct boundary probes are left for future work. The single-threshold claim remains supported by the reported AUROCs but will be presented as an empirical finding rather than a fully tested causal mechanism. revision: partial
-
Referee: [Abstract] Reported AUROC values (0.711, 0.878, 0.743) and ECE numbers lack error bars, confidence intervals, or full statistical details on variability across runs or question-generation methods, which is load-bearing for the cross-method and cross-shift comparisons.
Authors: We acknowledge the lack of variability measures. The revised manuscript will add bootstrap 95% confidence intervals for all AUROC and ECE values, computed across multiple random seeds and distinct question-rephrasing generation procedures. These will be reported in the abstract, results tables, and text to support the cross-method and cross-dataset comparisons. revision: yes
Circularity Check
No circularity; purely empirical benchmarking with no derivations or self-referential steps
full rationale
The paper conducts an empirical benchmarking study of five uncertainty quantification methods (predictive entropy, MC Dropout, ensemble, etc.) on MedGemma and LLaVA-RAD models using MIMIC-CXR and PadChest datasets. It reports observed correlations such as AUROC 0.711/0.878 for entropy predicting paraphrase flips and AUROC 0.743 for error detection, without any mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations that bear the central claim. The inference that proximity to the decision boundary is a shared cause is presented as an interpretation of the empirical results rather than a reduction to inputs by construction. No self-definitional loops, uniqueness theorems, or ansatzes are invoked. This is a standard data-driven evaluation that remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Proximity to the decision boundary causes both miscalibration and paraphrase sensitivity in VLMs.
Reference graph
Works this paper leans on
-
[1]
Draw B= 2,000 bootstrap replicates by sampling n observations with replacement from the test set
-
[2]
Compute the metric ˆθ∗(b) on each bootstrap replicate b= 1, . . . , B
-
[3]
Report the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval: CI95 = [ˆθ∗ (0.025), ˆθ∗ (0.975)]. For metrics computed on paired data (e.g., ECE difference between two methods on the same test set), we bootstrap the pairs jointly to preserve the correlation structure. E.2 PAIRWISE METHOD COMPARISONS To test whether o...
work page 1947
-
[4]
ID→ID:Calibrate on MIMIC holdout, evaluate on MIMIC test
-
[5]
ID→OOD:Calibrate on MIMIC holdout, evaluate on PadChest
-
[6]
OOD→OOD:Calibrate on PadChest holdout, evaluate on PadChest test
-
[7]
OOD→ID:Calibrate on PadChest holdout, evaluate on MIMIC. All protocols use cached margins from the softmax_entropy JSONL files, applying temperature scaling offline without re-running the model. If ID→OOD transfer degrades ECE substantially compared to OOD→OOD, this motivates site- specific recalibration. Table 19: Temperature scaling cross-domain transfe...
work page 2021
-
[8]
Calibration LoRA reduces ECE from 44.1% to 4.3% MC-Drop
-
[9]
Shift AUGRC stable or improving under corruption Targeted LoRA
- [10]
-
[11]
Decomposition Ensemble MI = 0.082; MC-Drop MI≈0 Softmax (AUROC 0.743)
-
[12]
Bridge Single-modelHpredicts flips (0.711) Softmax / Margin
-
[13]
Cross-arch Bridge holds on LLaV A-RAD (0.706/0.878) Softmax Four themes run through these results. First, LoRA fine- tuning is the primary calibration mechanism; no post-hoc UQ method applied to the base model comes close. Second, deep ensembles of LoRA adapters can fail OOD, but the fail- ure is model-specific: MedGemma’s ensemble collapses on PadChest (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.