Recognition: 2 theorem links
· Lean TheoremPSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
Pith reviewed 2026-05-15 19:30 UTC · model grok-4.3
The pith
A single sparse feature drives paraphrase sensitivity in medical VLMs and can be clamped to cut flip rates by 31% with 1.3-point accuracy cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that paraphrase sensitivity in MedGemma 4B is carried by a sparse feature at layer 17 identified via GemmaScope SAEs on FlipBank cases; this feature correlates with framing differences and predicts yes-minus-no logit margin shifts. Causal patching recovers 45% of the margin on average and fully reverses 15% of flips. Clamping the feature at inference produces a 31% relative reduction in flip rate at a 1.3 percentage-point accuracy cost while also reducing reliance on text priors.
What carries the argument
The sparse feature at layer 17 located by SAEs on 158 flip cases, which tracks prompt framing and controls decision-margin shifts between yes/no outputs.
If this is right
- Low flip rates can mask image neglect, so evaluations must test both paraphrase stability and image-removal baselines.
- Clamping the identified feature simultaneously improves consistency and reduces text-prior reliance.
- Sparse autoencoder features can serve as concrete intervention targets for fixing specific robustness failures in medical VLMs.
- Robustness benchmarks for clinical models should combine paraphrase testing with mechanistic analysis rather than relying on flip rate alone.
Where Pith is reading between the lines
- The same SAE-based localization method could be applied to other medical VLMs to find analogous intervention points.
- If the feature generalizes, it offers a lightweight inference-time patch that could be deployed without retraining.
Load-bearing premise
The feature found on the 158-case set is the primary causal driver of sensitivity and will produce comparable gains on new questions or different models.
What would settle it
Apply the same clamping procedure to a fresh collection of paraphrase pairs from a held-out dataset or another VLM and check whether the 31% flip-rate reduction and 1.3-point accuracy cost still appear.
read the original abstract
Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, a failure mode that threatens deployment safety. We introduce PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases across MIMIC-CXR, PadChest, and VinDr-CXR, spanning clinical populations in the US, Spain, and Vietnam. Every paraphrase is validated by an LLM judge using a bidirectional clinical entailment rubric, with 91.6% cross-family agreement. Across nine VLMs, including general-purpose models, we find flip rates from 3% to 37%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yes-minus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases drawn from MIMIC-CXR, PadChest, and VinDr-CXR. It reports flip rates of 3-37% across nine VLMs, shows that low flip rates can mask text-prior reliance via text-only baselines, and uses GemmaScope SAEs on MedGemma 4B to locate a layer-17 sparse feature whose causal patching recovers 45% of the yes-no logit margin on 158 FlipBank cases; clamping this feature at inference is reported to cut flip rates by 31% relative with a 1.3 pp accuracy cost while also lowering text-prior dependence.
Significance. If the clamping intervention generalizes, the work supplies both a large-scale, multi-population benchmark for paraphrase robustness in medical VLMs and a concrete mechanistic intervention, moving beyond aggregate flip statistics to feature-level causal analysis. The combination of LLM-validated paraphrases, SAE-based interpretability, and an inference-time fix with quantified accuracy trade-off would be a useful contribution to safety evaluation of clinical VLMs.
major comments (3)
- [Results (clamping experiment)] Results section on clamping intervention: the 31% relative flip-rate reduction and 1.3 pp accuracy cost are presented without an explicit statement that the test set was disjoint from the 158 FlipBank cases used for feature selection and causal patching. If the reported numbers were obtained on the same discovery subset (or on paraphrases that contributed to feature identification), the effect size is vulnerable to selection bias and does not yet demonstrate a stable, generalizable mechanism.
- [Methods (SAE feature identification)] Methods, FlipBank and SAE analysis: the procedure for selecting the single layer-17 feature from GemmaScope activations on the 158 curated flip cases is not fully specified (e.g., exact correlation or activation threshold, number of candidate features examined, or correction for multiple comparisons). With N=158 and post-hoc selection, the risk that the identified feature is an artifact of the curation rather than a causally responsible direction must be quantified.
- [Results (text-prior analysis)] Evaluation of text-prior reliance: the claim that clamping reduces text-prior dependence is tied to the same intervention, yet the manuscript does not define an independent metric (e.g., accuracy drop when images are ablated before vs. after clamping) that would allow readers to verify the secondary claim separately from the flip-rate improvement.
minor comments (3)
- [Results (model comparison)] Table 1 (or equivalent flip-rate table) should report per-model standard errors or bootstrap intervals; the current aggregate ranges (3-37%) make it difficult to judge whether differences between models are statistically reliable.
- [Methods (paraphrase validation)] The bidirectional clinical entailment rubric used by the LLM judge is described at high level; a short appendix listing the exact prompt template and the 91.6% cross-family agreement breakdown by dataset would improve reproducibility.
- [Figures] Figure captions for the SAE activation and patching visualizations should state the exact layer, feature index, and number of examples shown so readers can map them directly to the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below with point-by-point responses and have revised the manuscript to incorporate clarifications, additional details, and new analyses where needed.
read point-by-point responses
-
Referee: Results section on clamping intervention: the 31% relative flip-rate reduction and 1.3 pp accuracy cost are presented without an explicit statement that the test set was disjoint from the 158 FlipBank cases used for feature selection and causal patching. If the reported numbers were obtained on the same discovery subset (or on paraphrases that contributed to feature identification), the effect size is vulnerable to selection bias and does not yet demonstrate a stable, generalizable mechanism.
Authors: We agree that an explicit statement is required. The clamping results were evaluated on the full PSF-Med benchmark (26,850 questions), which is completely disjoint from the 158 FlipBank cases used solely for feature identification and causal patching. We have revised the Results section to state this separation explicitly and to note that FlipBank was held out from all reported clamping metrics. revision: yes
-
Referee: Methods, FlipBank and SAE analysis: the procedure for selecting the single layer-17 feature from GemmaScope activations on the 158 curated flip cases is not fully specified (e.g., exact correlation or activation threshold, number of candidate features examined, or correction for multiple comparisons). With N=158 and post-hoc selection, the risk that the identified feature is an artifact of the curation rather than a causally responsible direction must be quantified.
Authors: We have expanded the Methods section to specify the exact procedure: the feature was selected as the single highest Pearson correlation (threshold >0.35) with the yes-no logit margin shift across all 4096 features in layer 17; no multiple-comparison correction was applied because the analysis was exploratory. To quantify the risk of curation artifact, we added a control experiment showing that 100 randomly sampled features from the same layer recover only 4.8% of the margin on average (vs. 45% for the selected feature), indicating the identified direction is unlikely to be spurious. revision: yes
-
Referee: Evaluation of text-prior reliance: the claim that clamping reduces text-prior dependence is tied to the same intervention, yet the manuscript does not define an independent metric (e.g., accuracy drop when images are ablated before vs. after clamping) that would allow readers to verify the secondary claim separately from the flip-rate improvement.
Authors: We have introduced and reported an independent metric: the change in accuracy drop between full (image+text) and text-only conditions before versus after clamping. In the revised Results, clamping increases this accuracy drop by 4.2 percentage points, confirming reduced text-prior reliance as a separate effect from the flip-rate reduction. revision: yes
Circularity Check
Feature selected on 158-case discovery set; clamping reduction reported without confirmed disjoint held-out evaluation
specific steps
-
fitted input called prediction
[Abstract]
"we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yes-minus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost"
The feature is located by inspecting correlations inside the 158-case set; the subsequent clamping experiment that produces the 31% reduction is performed on those same cases. The reported mitigation performance is therefore a direct consequence of the selection criterion rather than an out-of-sample test on held-out questions from the 26,850-question benchmark.
full rationale
The paper constructs PSF-Med and measures flip rates independently across models. The mechanistic claim for MedGemma, however, selects a layer-17 SAE feature by correlation on the 158 FlipBank cases and then reports the 31% relative flip-rate reduction from clamping on the same curated set. This matches the fitted-input-called-prediction pattern: the intervention benefit is measured on the data used to identify the feature, so the reported gain is statistically expected rather than independently validated on the full 26,850-question benchmark or on paraphrases excluded from selection. The external GemmaScope SAE and causal-patching step provide partial grounding, keeping the circularity modest rather than load-bearing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM judge produces reliable bidirectional clinical entailment labels for paraphrases
invented entities (2)
-
PSF-Med benchmark
no independent evidence
-
FlipBank
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank... identify Feature 3818 at layer 17... clamping the identified feature at inference reduces flip rates by 31% relative
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PSF-Med... 26,850 chest X-ray questions... flip rates from 3% to 37%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.