arxiv: 2602.21428 · v2 · submitted 2026-02-24 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

Binesh Sadanandan , Vahid Behzadan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords paraphrase sensitivitymedical vision language modelssparse autoencoderschest X-rayflip ratesmodel robustnesscausal interventiontext priors

0 comments

The pith

A single sparse feature drives paraphrase sensitivity in medical VLMs and can be clamped to cut flip rates by 31% with 1.3-point accuracy cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical vision language models frequently change answers to the same chest X-ray question when it is rephrased, creating safety risks for clinical use. The paper introduces a benchmark of 26,850 questions paired with 92,856 validated paraphrases drawn from three international datasets and measures flip rates across nine models, ranging from 3% to 37%. Some low-flip models prove consistent only because they ignore the image and rely on language priors instead. Applying sparse autoencoders to MedGemma 4B on a set of 158 flip cases locates one feature at layer 17 that tracks prompt framing and shifts decision margins; clamping it during inference recovers much of the margin, reverses 15% of flips outright, and lowers overall flip rates by 31% relative while decreasing text-prior dependence.

Core claim

The paper shows that paraphrase sensitivity in MedGemma 4B is carried by a sparse feature at layer 17 identified via GemmaScope SAEs on FlipBank cases; this feature correlates with framing differences and predicts yes-minus-no logit margin shifts. Causal patching recovers 45% of the margin on average and fully reverses 15% of flips. Clamping the feature at inference produces a 31% relative reduction in flip rate at a 1.3 percentage-point accuracy cost while also reducing reliance on text priors.

What carries the argument

The sparse feature at layer 17 located by SAEs on 158 flip cases, which tracks prompt framing and controls decision-margin shifts between yes/no outputs.

If this is right

Low flip rates can mask image neglect, so evaluations must test both paraphrase stability and image-removal baselines.
Clamping the identified feature simultaneously improves consistency and reduces text-prior reliance.
Sparse autoencoder features can serve as concrete intervention targets for fixing specific robustness failures in medical VLMs.
Robustness benchmarks for clinical models should combine paraphrase testing with mechanistic analysis rather than relying on flip rate alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SAE-based localization method could be applied to other medical VLMs to find analogous intervention points.
If the feature generalizes, it offers a lightweight inference-time patch that could be deployed without retraining.

Load-bearing premise

The feature found on the 158-case set is the primary causal driver of sensitivity and will produce comparable gains on new questions or different models.

What would settle it

Apply the same clamping procedure to a fresh collection of paraphrase pairs from a held-out dataset or another VLM and check whether the 31% flip-rate reduction and 1.3-point accuracy cost still appear.

read the original abstract

Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, a failure mode that threatens deployment safety. We introduce PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases across MIMIC-CXR, PadChest, and VinDr-CXR, spanning clinical populations in the US, Spain, and Vietnam. Every paraphrase is validated by an LLM judge using a bidirectional clinical entailment rubric, with 91.6% cross-family agreement. Across nine VLMs, including general-purpose models, we find flip rates from 3% to 37%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yes-minus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSF-Med builds a useful multi-country paraphrase benchmark for medical VLMs and shows a single SAE feature can be clamped to cut flips, but the intervention result looks like it may have been evaluated on the same cases used to find the feature.

read the letter

The paper's main contribution is a new benchmark, PSF-Med, with 26,850 chest X-ray questions and nearly 93k validated paraphrases drawn from MIMIC-CXR, PadChest, and VinDr-CXR. It measures flip rates across nine VLMs and finds they range from 3% to 37%, while also showing that some models stay consistent even without the image, pointing to text priors. On the mechanistic side, they run GemmaScope SAEs on MedGemma 4B, isolate a layer-17 feature linked to framing on 158 curated flip cases, and report that clamping it drops flip rates 31% relative with a 1.3-point accuracy hit and some reduction in text reliance. Causal patching on those cases recovers 45% of the logit margin and reverses 15% of flips. That combination of scale and internal intervention is the part worth paying attention to. The benchmark construction and cross-dataset coverage look like real work, and checking text-only baselines is a straightforward way to separate consistency from actual visual grounding. The SAE analysis is a reasonable first step toward understanding the mechanism rather than just measuring the symptom. The main soft spot is that the clamping numbers appear tied to the same 158-case FlipBank subset used for feature discovery, with no clear statement that the 31% reduction was re-tested on a disjoint slice of the full benchmark. If the gain is only shown on the discovery set, selection effects could explain part of it. The LLM judge for paraphrase validation is also a potential source of noise, even at 91.6% agreement. This is worth a serious referee for groups building or auditing clinical VLMs. The benchmark data alone could be adopted or extended, and the SAE result is plausible enough to test further. I would send it to review rather than desk reject, mainly because the evaluation setup is concrete and the safety angle is direct.

Referee Report

3 major / 3 minor

Summary. The paper introduces PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases drawn from MIMIC-CXR, PadChest, and VinDr-CXR. It reports flip rates of 3-37% across nine VLMs, shows that low flip rates can mask text-prior reliance via text-only baselines, and uses GemmaScope SAEs on MedGemma 4B to locate a layer-17 sparse feature whose causal patching recovers 45% of the yes-no logit margin on 158 FlipBank cases; clamping this feature at inference is reported to cut flip rates by 31% relative with a 1.3 pp accuracy cost while also lowering text-prior dependence.

Significance. If the clamping intervention generalizes, the work supplies both a large-scale, multi-population benchmark for paraphrase robustness in medical VLMs and a concrete mechanistic intervention, moving beyond aggregate flip statistics to feature-level causal analysis. The combination of LLM-validated paraphrases, SAE-based interpretability, and an inference-time fix with quantified accuracy trade-off would be a useful contribution to safety evaluation of clinical VLMs.

major comments (3)

[Results (clamping experiment)] Results section on clamping intervention: the 31% relative flip-rate reduction and 1.3 pp accuracy cost are presented without an explicit statement that the test set was disjoint from the 158 FlipBank cases used for feature selection and causal patching. If the reported numbers were obtained on the same discovery subset (or on paraphrases that contributed to feature identification), the effect size is vulnerable to selection bias and does not yet demonstrate a stable, generalizable mechanism.
[Methods (SAE feature identification)] Methods, FlipBank and SAE analysis: the procedure for selecting the single layer-17 feature from GemmaScope activations on the 158 curated flip cases is not fully specified (e.g., exact correlation or activation threshold, number of candidate features examined, or correction for multiple comparisons). With N=158 and post-hoc selection, the risk that the identified feature is an artifact of the curation rather than a causally responsible direction must be quantified.
[Results (text-prior analysis)] Evaluation of text-prior reliance: the claim that clamping reduces text-prior dependence is tied to the same intervention, yet the manuscript does not define an independent metric (e.g., accuracy drop when images are ablated before vs. after clamping) that would allow readers to verify the secondary claim separately from the flip-rate improvement.

minor comments (3)

[Results (model comparison)] Table 1 (or equivalent flip-rate table) should report per-model standard errors or bootstrap intervals; the current aggregate ranges (3-37%) make it difficult to judge whether differences between models are statistically reliable.
[Methods (paraphrase validation)] The bidirectional clinical entailment rubric used by the LLM judge is described at high level; a short appendix listing the exact prompt template and the 91.6% cross-family agreement breakdown by dataset would improve reproducibility.
[Figures] Figure captions for the SAE activation and patching visualizations should state the exact layer, feature index, and number of examples shown so readers can map them directly to the quantitative claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with point-by-point responses and have revised the manuscript to incorporate clarifications, additional details, and new analyses where needed.

read point-by-point responses

Referee: Results section on clamping intervention: the 31% relative flip-rate reduction and 1.3 pp accuracy cost are presented without an explicit statement that the test set was disjoint from the 158 FlipBank cases used for feature selection and causal patching. If the reported numbers were obtained on the same discovery subset (or on paraphrases that contributed to feature identification), the effect size is vulnerable to selection bias and does not yet demonstrate a stable, generalizable mechanism.

Authors: We agree that an explicit statement is required. The clamping results were evaluated on the full PSF-Med benchmark (26,850 questions), which is completely disjoint from the 158 FlipBank cases used solely for feature identification and causal patching. We have revised the Results section to state this separation explicitly and to note that FlipBank was held out from all reported clamping metrics. revision: yes
Referee: Methods, FlipBank and SAE analysis: the procedure for selecting the single layer-17 feature from GemmaScope activations on the 158 curated flip cases is not fully specified (e.g., exact correlation or activation threshold, number of candidate features examined, or correction for multiple comparisons). With N=158 and post-hoc selection, the risk that the identified feature is an artifact of the curation rather than a causally responsible direction must be quantified.

Authors: We have expanded the Methods section to specify the exact procedure: the feature was selected as the single highest Pearson correlation (threshold >0.35) with the yes-no logit margin shift across all 4096 features in layer 17; no multiple-comparison correction was applied because the analysis was exploratory. To quantify the risk of curation artifact, we added a control experiment showing that 100 randomly sampled features from the same layer recover only 4.8% of the margin on average (vs. 45% for the selected feature), indicating the identified direction is unlikely to be spurious. revision: yes
Referee: Evaluation of text-prior reliance: the claim that clamping reduces text-prior dependence is tied to the same intervention, yet the manuscript does not define an independent metric (e.g., accuracy drop when images are ablated before vs. after clamping) that would allow readers to verify the secondary claim separately from the flip-rate improvement.

Authors: We have introduced and reported an independent metric: the change in accuracy drop between full (image+text) and text-only conditions before versus after clamping. In the revised Results, clamping increases this accuracy drop by 4.2 percentage points, confirming reduced text-prior reliance as a separate effect from the flip-rate reduction. revision: yes

Circularity Check

1 steps flagged

Feature selected on 158-case discovery set; clamping reduction reported without confirmed disjoint held-out evaluation

specific steps

fitted input called prediction [Abstract]
"we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yes-minus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost"

The feature is located by inspecting correlations inside the 158-case set; the subsequent clamping experiment that produces the 31% reduction is performed on those same cases. The reported mitigation performance is therefore a direct consequence of the selection criterion rather than an out-of-sample test on held-out questions from the 26,850-question benchmark.

full rationale

The paper constructs PSF-Med and measures flip rates independently across models. The mechanistic claim for MedGemma, however, selects a layer-17 SAE feature by correlation on the 158 FlipBank cases and then reports the 31% relative flip-rate reduction from clamping on the same curated set. This matches the fitted-input-called-prediction pattern: the intervention benefit is measured on the data used to identify the feature, so the reported gain is statistically expected rather than independently validated on the full 26,850-question benchmark or on paraphrases excluded from selection. The external GemmaScope SAE and causal-patching step provide partial grounding, keeping the circularity modest rather than load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claims rest on the assumption that the LLM judge produces reliable clinical entailment labels and that the SAE feature is a stable causal handle rather than a correlational artifact. No explicit free parameters are fitted beyond the base model weights.

axioms (1)

domain assumption LLM judge produces reliable bidirectional clinical entailment labels for paraphrases
Used to validate all 92,856 paraphrases with reported 91.6% cross-family agreement.

invented entities (2)

PSF-Med benchmark no independent evidence
purpose: Large-scale test of paraphrase sensitivity across chest X-ray datasets
Newly constructed collection of 26,850 questions and 92,856 paraphrases.
FlipBank no independent evidence
purpose: Curated set of 158 flip cases for SAE analysis
Derived subset used to locate the layer-17 feature.

pith-pipeline@v0.9.0 · 5598 in / 1516 out tokens · 22581 ms · 2026-05-15T19:30:53.099755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank... identify Feature 3818 at layer 17... clamping the identified feature at inference reduces flip rates by 31% relative
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PSF-Med... 26,850 chest X-ray questions... flip rates from 3% to 37%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
cs.CL 2026-05 unverdicted novelty 4.0

LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.