When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Akash Ghosh; Sriparna Saha; Subhadip Baidya; Xiuying Chen

arxiv: 2604.17318 · v1 · submitted 2026-04-19 · 💻 cs.CV

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Akash Ghosh , Subhadip Baidya , Sriparna Saha , Xiuying Chen This is my paper

Pith reviewed 2026-05-10 05:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial attacksmedical vision-language modelsblack-box attacktransferable attackimage perturbationsdiagnostic robustness

0 comments

The pith

MedFocusLeak fools medical vision-language models by perturbing only non-diagnostic background regions to induce wrong but plausible diagnoses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedFocusLeak as a black-box multimodal attack designed specifically for clinical VLMs. It works by adding coordinated, imperceptible changes to background areas that carry no diagnostic information and by distracting the model's attention away from actual pathology. Tests across six different medical imaging modalities show this produces incorrect yet realistic diagnostic outputs that transfer well between models. The authors also supply a new unified evaluation setup with metrics that measure both how often the attack succeeds and how natural the altered images remain.

Core claim

MedFocusLeak achieves state-of-the-art performance in generating misleading yet realistic diagnostic outputs across diverse VLMs by injecting coordinated perturbations into non-diagnostic background regions and employing an attention distraction mechanism.

What carries the argument

MedFocusLeak, the attack method that limits perturbations to background regions while using attention distraction to shift model focus away from pathological areas.

If this is right

Medical VLMs will need robustness techniques that protect against background-only manipulations.
Current clinical deployment of VLMs carries hidden risk of imperceptible adversarial inputs.
Evaluation of future medical VLMs should include joint measures of attack success and image fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that modern medical VLMs may be relying more on surrounding context than on lesion-specific features.
Training procedures that explicitly penalize background sensitivity could reduce this vulnerability.
The attack pattern might extend to other multimodal medical tasks such as report generation or treatment planning.

Load-bearing premise

That coordinated perturbations limited to non-diagnostic background regions combined with attention distraction will reliably produce clinically plausible incorrect diagnoses that remain imperceptible to clinicians.

What would settle it

A controlled test in which practicing radiologists review original and attacked image pairs side-by-side, fail to detect the changes, and consistently rate the induced wrong diagnoses as medically plausible.

read the original abstract

Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. The method injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model's focus away from pathological areas. Extensive evaluations across six medical imaging modalities show that MedFocusLeak achieves state-of-the-art performance, generating misleading yet realistic diagnostic outputs across diverse VLMs. We further introduce a unified evaluation framework with novel metrics that jointly capture attack success and image fidelity, revealing a critical weakness in the reasoning capabilities of modern clinical VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Only the abstract is available, so the claims of a new SOTA transferable attack on medical VLMs can't be checked yet.

read the letter

The main thing to know about this paper is that only the abstract is available, which means the big claims about MedFocusLeak being a highly transferable attack that breaks medical VLMs with imperceptible changes can't be checked at all right now. What they propose looks like a reasonable step forward. They target background regions that aren't part of the diagnosis and combine that with an attention distraction trick to push the model toward incorrect but believable outputs. This tries to fix two issues from earlier attacks: the ones from natural images that look too fake, and medical-specific ones that aren't good at transferring across models. Adding a unified evaluation framework with metrics for both success and image quality is also a good idea on paper. The problem is everything else. There are no methods, no results, no tables, and no way to see if it really works across modalities or stays hidden from clinicians. The central idea rests on the hope that these subtle background tweaks plus distraction will create plausible errors, but without any numbers or examples, that stays speculative. It's hard to judge the soundness when the support is missing. This would be of interest to people studying robustness in medical vision-language models or adversarial ML applied to healthcare. A reader could take the high-level idea and try to implement something similar, but there's not much to engage with deeply yet. I'd recommend sending the full version to peer review if the experiments back up the abstract. The topic matters for real-world deployment of these models, so it's worth referee time once the details are there.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MedFocusLeak, a highly transferable black-box multimodal adversarial attack on medical vision-language models (VLMs). The attack injects coordinated perturbations limited to non-diagnostic background regions combined with an attention distraction mechanism to induce incorrect yet clinically plausible diagnoses while remaining imperceptible to clinicians. It claims state-of-the-art performance across six medical imaging modalities, supported by extensive evaluations, and introduces a unified evaluation framework with novel metrics that jointly assess attack success and image fidelity.

Significance. If the empirical results and method details were provided and substantiated, this work would be significant for exposing vulnerabilities in clinical VLMs, particularly their reliance on background context and attention patterns, and for providing a standardized framework to evaluate such attacks. It could inform robustness research in safety-critical medical AI. However, with only the abstract available and no methods, data, results, or quantitative evidence, the actual significance cannot be determined.

major comments (2)

[Abstract] Abstract: The central claims of 'state-of-the-art performance' and 'extensive evaluations across six medical imaging modalities' are asserted without any quantitative results, baseline comparisons, specific attack success rates, image fidelity metrics, error bars, or implementation details of the coordinated perturbations and attention distraction mechanism. This absence is load-bearing, as it prevents verification of whether the attack reliably produces clinically plausible incorrect diagnoses or remains imperceptible.
[Abstract] Abstract: The method description ('injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model's focus away from pathological areas') is too high-level and lacks any equations, algorithmic steps, or pseudocode, making it impossible to assess technical soundness, novelty relative to existing transferable attacks, or reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for highlighting areas where the abstract could be clarified. We provide point-by-point responses below. The full manuscript contains the detailed methods, results, and evaluations that support our claims.

read point-by-point responses

Referee: The central claims of 'state-of-the-art performance' and 'extensive evaluations across six medical imaging modalities' are asserted without any quantitative results, baseline comparisons, specific attack success rates, image fidelity metrics, error bars, or implementation details of the coordinated perturbations and attention distraction mechanism. This absence is load-bearing, as it prevents verification of whether the attack reliably produces clinically plausible incorrect diagnoses or remains imperceptible.

Authors: The abstract is a concise summary and does not include specific numerical results or detailed implementation, consistent with standard academic practice to keep abstracts brief. The full manuscript provides extensive quantitative evaluations across the six modalities, including attack success rates, comparisons to baselines, image fidelity metrics, error bars, and implementation details of the perturbations and attention distraction mechanism. These substantiate the state-of-the-art performance and the production of clinically plausible misdiagnoses while maintaining imperceptibility. If only the abstract was available for review, we apologize for any submission issue and are happy to provide the complete paper. revision: no
Referee: The method description ('injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model's focus away from pathological areas') is too high-level and lacks any equations, algorithmic steps, or pseudocode, making it impossible to assess technical soundness, novelty relative to existing transferable attacks, or reproducibility.

Authors: We agree that the abstract's method description is high-level. The full manuscript includes the mathematical formulations for generating the coordinated perturbations in background regions, the steps of the attention distraction mechanism, and pseudocode for the overall attack algorithm. This allows assessment of technical soundness, novelty in the medical VLM context, and reproducibility. We can add a brief reference to the methods section in the abstract if the editor deems it necessary, but we believe the current form is appropriate. revision: no

Circularity Check

0 steps flagged

No circularity: empirical attack proposal with no derivation chain or self-referential elements

full rationale

The provided abstract describes an empirical adversarial attack method (MedFocusLeak) that injects perturbations and uses attention distraction, claiming SOTA performance via 'extensive evaluations across six modalities.' No equations, derivations, fitted parameters, predictions, or self-citations appear in the text. The central claim rests on asserted experimental results rather than any mathematical reduction or self-definition, making the derivation chain empty and self-contained by default. This matches the expected non-circular outcome for a methods paper without load-bearing theoretical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new entities; the work is an empirical attack method relying on standard adversarial ML techniques.

pith-pipeline@v0.9.0 · 5434 in / 965 out tokens · 54184 ms · 2026-05-10T05:51:18.643891+00:00 · methodology

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)