pith. machine review for the scientific record. sign in

arxiv: 2604.08941 · v1 · submitted 2026-04-10 · 💻 cs.LG

Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords predictive entropymodel calibrationparaphrase sensitivitymedical vision-language modelsuncertainty quantificationdecision boundaryensemble comparison
0
0 comments X

The pith

Predictive entropy from one forward pass identifies both miscalibrated and rephrase-sensitive predictions in medical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical vision-language models often give overconfident answers that change when questions are rephrased. The paper shows these two problems share a cause: predictions near the decision boundary. Across chest X-ray tasks with MedGemma and LLaVA-RAD, predictive entropy from a single model pass forecasts which answers will flip under rephrasing, reaching AUROC 0.711 and 0.878. This link lets one entropy threshold screen for both unreliable and sensitive outputs. Simple single-pass methods outperform ensembles at error detection and screening.

Core claim

For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing, with AUROC 0.711 on MedGemma and 0.878 on LLaVA-RAD. This enables a single entropy threshold to flag both unreliable and rephrase sensitive predictions because both failures trace to proximity to the decision boundary.

What carries the argument

Predictive entropy computed from one forward pass, which quantifies uncertainty and signals proximity to the decision boundary to connect calibration failures with paraphrase sensitivity.

Load-bearing premise

Proximity to the decision boundary is the shared cause of miscalibration and rephrase sensitivity, and the tested models and chest X-ray datasets represent other medical VLMs and clinical uses.

What would settle it

Finding a medical VLM or new dataset where predictive entropy shows no link to which predictions flip under rephrasing, or where high-entropy samples stay stable when questions are reworded.

Figures

Figures reproduced from arXiv: 2604.08941 by Binesh Sadanandan, Vahid Behzadan.

Figure 2
Figure 2. Figure 2: Reliability diagrams on clean test data. Targeted LoRA methods (center) track the diagonal on both datasets. The [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-member ensemble diagnostics on PadChest. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mutual information vs. number of forward passes [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reliability diagrams under Gaussian noise at three [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recommended deployment protocol (two tiers). [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean paraphrase margin variance for flip-prone [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: UQ-PSF bridge AUROC by method. Single-model [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that miscalibration and paraphrase sensitivity in medical VLMs share a common cause (proximity to the decision boundary), demonstrated empirically by showing that predictive entropy from a single forward pass predicts rephrase-induced prediction flips (AUROC 0.711 on MedGemma, 0.878 on LLaVA-RAD, p<10^-4). It benchmarks five UQ methods (including predictive entropy, MC Dropout, and a 5-member LoRA ensemble) on MedGemma-4bit and LLaVA-RAD across MIMIC-CXR (in-distribution) and PadChest (OOD) chest X-ray datasets, reporting that single-model entropy outperforms the ensemble on error detection (AUROC 0.743 vs 0.657) and paraphrase screening while MC Dropout achieves best calibration (ECE 4.3).

Significance. If the empirical correlations hold, the work offers a low-cost, single-pass method to flag both unreliable and rephrase-sensitive predictions in medical VLMs, with potential to improve selective prediction and deployment safety. Strengths include concrete metrics (ECE, AUROC, accuracy, coverage at 5% risk), cross-dataset and cross-architecture validation, and the finding that simple methods can outperform ensembles under distribution shift.

major comments (2)
  1. [Abstract] Abstract and results on paraphrase sensitivity: the claim that proximity to the decision boundary is the shared causal mechanism for miscalibration and rephrase sensitivity rests solely on predictive entropy correlations; no direct quantification of boundary proximity (logit margin, embedding distance to decision surface, or minimal perturbation) or controls for alternatives (e.g., token-level generation variance) is performed, leaving the single-threshold interpretation untested.
  2. [Abstract] Reported AUROC values (0.711, 0.878, 0.743) and ECE numbers lack error bars, confidence intervals, or full statistical details on variability across runs or question-generation methods, which is load-bearing for the cross-method and cross-shift comparisons.
minor comments (2)
  1. [Abstract] Abstract contains minor typos and formatting issues: 'outof distribution', 'mis calibrated', 'MedGemma 4BIT', 'LLaVA RAD7B', and 'LLaVARAD' should be standardized for clarity.
  2. [Results] The ensemble collapse on MIMIC-to-PadChest shift (42.9 ECE) is reported but the paper does not discuss whether this is due to LoRA fine-tuning specifics or general ensemble fragility under medical domain shift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and statistical reporting. We respond to each major point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results on paraphrase sensitivity: the claim that proximity to the decision boundary is the shared causal mechanism for miscalibration and rephrase sensitivity rests solely on predictive entropy correlations; no direct quantification of boundary proximity (logit margin, embedding distance to decision surface, or minimal perturbation) or controls for alternatives (e.g., token-level generation variance) is performed, leaving the single-threshold interpretation untested.

    Authors: We agree that the manuscript infers the shared mechanism from the observed predictive correlations with entropy rather than direct boundary measurements. Entropy is used as a standard proxy for proximity to the decision boundary, and the AUROC results empirically support the single-threshold screening utility. However, we did not quantify logit margins, embedding distances, or control for token-level variance. In revision we will replace causal phrasing with 'empirically linked via uncertainty' in the abstract, add a limitations paragraph discussing these alternatives, and note that direct boundary probes are left for future work. The single-threshold claim remains supported by the reported AUROCs but will be presented as an empirical finding rather than a fully tested causal mechanism. revision: partial

  2. Referee: [Abstract] Reported AUROC values (0.711, 0.878, 0.743) and ECE numbers lack error bars, confidence intervals, or full statistical details on variability across runs or question-generation methods, which is load-bearing for the cross-method and cross-shift comparisons.

    Authors: We acknowledge the lack of variability measures. The revised manuscript will add bootstrap 95% confidence intervals for all AUROC and ECE values, computed across multiple random seeds and distinct question-rephrasing generation procedures. These will be reported in the abstract, results tables, and text to support the cross-method and cross-dataset comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmarking with no derivations or self-referential steps

full rationale

The paper conducts an empirical benchmarking study of five uncertainty quantification methods (predictive entropy, MC Dropout, ensemble, etc.) on MedGemma and LLaVA-RAD models using MIMIC-CXR and PadChest datasets. It reports observed correlations such as AUROC 0.711/0.878 for entropy predicting paraphrase flips and AUROC 0.743 for error detection, without any mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations that bear the central claim. The inference that proximity to the decision boundary is a shared cause is presented as an interpretation of the empirical results rather than a reduction to inputs by construction. No self-definitional loops, uniqueness theorems, or ansatzes are invoked. This is a standard data-driven evaluation that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or ad-hoc axioms beyond standard assumptions of uncertainty quantification methods; central claim rests on empirical observation of the decision-boundary link.

axioms (1)
  • domain assumption Proximity to the decision boundary causes both miscalibration and paraphrase sensitivity in VLMs.
    Invoked as the common cause but not formally derived or proven in the abstract.

pith-pipeline@v0.9.0 · 5521 in / 1258 out tokens · 54707 ms · 2026-05-10T16:34:12.055495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Draw B= 2,000 bootstrap replicates by sampling n observations with replacement from the test set

  2. [2]

    Compute the metric ˆθ∗(b) on each bootstrap replicate b= 1, . . . , B

  3. [3]

    For metrics computed on paired data (e.g., ECE difference between two methods on the same test set), we bootstrap the pairs jointly to preserve the correlation structure

    Report the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval: CI95 = [ˆθ∗ (0.025), ˆθ∗ (0.975)]. For metrics computed on paired data (e.g., ECE difference between two methods on the same test set), we bootstrap the pairs jointly to preserve the correlation structure. E.2 PAIRWISE METHOD COMPARISONS To test whether o...

  4. [4]

    ID→ID:Calibrate on MIMIC holdout, evaluate on MIMIC test

  5. [5]

    ID→OOD:Calibrate on MIMIC holdout, evaluate on PadChest

  6. [6]

    OOD→OOD:Calibrate on PadChest holdout, evaluate on PadChest test

  7. [7]

    All protocols use cached margins from the softmax_entropy JSONL files, applying temperature scaling offline without re-running the model

    OOD→ID:Calibrate on PadChest holdout, evaluate on MIMIC. All protocols use cached margins from the softmax_entropy JSONL files, applying temperature scaling offline without re-running the model. If ID→OOD transfer degrades ECE substantially compared to OOD→OOD, this motivates site- specific recalibration. Table 19: Temperature scaling cross-domain transfe...

  8. [8]

    Calibration LoRA reduces ECE from 44.1% to 4.3% MC-Drop

  9. [9]

    Shift AUGRC stable or improving under corruption Targeted LoRA

  10. [10]

    7.3%) MC-Drop

    Selective MC-Drop: 21.5% Cov@5% (vs. 7.3%) MC-Drop

  11. [11]

    Decomposition Ensemble MI = 0.082; MC-Drop MI≈0 Softmax (AUROC 0.743)

  12. [12]

    Bridge Single-modelHpredicts flips (0.711) Softmax / Margin

  13. [13]

    First, LoRA fine- tuning is the primary calibration mechanism; no post-hoc UQ method applied to the base model comes close

    Cross-arch Bridge holds on LLaV A-RAD (0.706/0.878) Softmax Four themes run through these results. First, LoRA fine- tuning is the primary calibration mechanism; no post-hoc UQ method applied to the base model comes close. Second, deep ensembles of LoRA adapters can fail OOD, but the fail- ure is model-specific: MedGemma’s ensemble collapses on PadChest (...