Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

Hadis Karimipour; Narges Babadi

arxiv: 2605.16651 · v1 · pith:HSMRGIY7new · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

Narges Babadi , Hadis Karimipour This is my paper

Pith reviewed 2026-05-20 18:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vision-language modelsexplanation faithfulnessadversarial attacksCLIP modelsexplanation heatmapsprediction stabilitygrey-box attackrobustness of explanations

0 comments

The pith

Explanation heatmaps in vision-language models can be redirected to irrelevant regions without changing the model's prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models based on CLIP can keep making the same correct prediction even after small changes are made to how an image is divided into patches. These changes shift the explanation heatmaps to focus on parts of the image that have no semantic connection to the output. The authors introduce a grey-box attack called X-Shift that achieves this redirection across different model versions and explanation techniques. A reader would care because many systems use these heatmaps to decide whether a model's decision is trustworthy. If the result holds, explanations lose value as reliable signals of what the model actually considered when making its choice.

Core claim

The central claim is that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To demonstrate this, the authors introduce X-Shift, which perturbs patch-level visual representations to redirect heatmaps toward semantically irrelevant regions without altering the predicted output. The attack requires no model parameter changes and generalizes across multiple CLIP architectures and explanation methods. Evaluation on ImageNet-1k, MS-COCO, and Flickr30K shows consistent degradation in explanation alignment under small perturbations while prediction stability is retained

What carries the argument

X-Shift, a grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output.

If this is right

Explanation heatmaps cannot be treated as faithful records of model reasoning even when predictions remain stable.
The vulnerability appears consistently across different CLIP architectures and common explanation methods.
Standard attacks aimed only at changing predictions do not produce the same explanation shifts.
The effect is observed on ImageNet-1k, MS-COCO, and Flickr30K under imperceptible perturbations.
Current explanation mechanisms have a basic limitation that affects their use for verifying model trustworthiness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explanation techniques may need built-in checks against patch-level changes to stay aligned with predictions.
The same pattern of prediction-explanation mismatch could appear in other multimodal models and deserves direct testing.
Combining several explanation methods or adding training constraints might reduce the success of such shifts.
High-stakes applications that require human review of explanations could face reliability problems if this vulnerability is widespread.

Load-bearing premise

Small, targeted changes to patch-level image features can shift explanation heatmaps to irrelevant areas while leaving the model's final prediction unchanged.

What would settle it

Run X-Shift on a CLIP model with a new image from ImageNet-1k, measure whether the top prediction label stays identical while the explanation heatmap moves away from the main object depicted.

Figures

Figures reproduced from arXiv: 2605.16651 by Hadis Karimipour, Narges Babadi.

**Figure 1.** Figure 1: Visualization of a sample image under the X-Shift attack. From left to right, columns show the clean and adversarial images (the adversarial image is optimized to shift CLIP’s explanation toward “ground” while preserving the “bench” prediction; see black box), the clean–adversarial difference map, CLIP heatmaps illustrating explanation drift. 3. The Proposed Attack: X-Shift We introduce X-Shift, an explana… view at source ↗

**Figure 2.** Figure 2: Comparison of CLIP explanations on ImageNet dataset(ViT-B/16, ViT-B/32, ViT-L/14) under X-Shift attack. Columns show original/adversarial images, CLIP heatmaps (clean vs. adversarial). 4.1. Results on Explainability Proposed Attack Effectiveness [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Explanations on Flickr30k samples using CLIP (ViT-B/16, ViT-B/32, ViT-L/14) under X-Shift attack. Shown are original/adversarial images, CLIP heatmaps (clean vs. adversarial) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Explanation robustness on COCO samples using CLIP (ViT-B/16, ViT-B/32, ViT-L/14) under X-Shift attack. Columns display original vs. adversarial images, CLIP heatmaps (clean vs. adversarial). 4.2. Transferability of X-Shift Across Vision Transformer Backbones We evaluate whether explanation-shifting perturbations generated on one CLIP encoder transfer to other CLIP variants with different patch sizes and … view at source ↗

**Figure 5.** Figure 5: Transfer matrix of IoU-TopK (lower indicates stronger manipulation) across source → target CLIP backbones. ViT-B/32 perturbations transfer most broadly, while ViT-L/14 perturbations remain more model-specific. Experiment analysis [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: XAI transferability results for the X-Shift adversarial perturbation on COCO dataset with CLIP ViT-B/16. Similarity maps, ScoreCAM, RISE, and GAE all exhibit explanation drift under the attack when applied to vanilla CLIP [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: Spatial heatmap ablation of X-Shift loss components. The full objective produces a balanced, targeted explanation shift while preserving the clean prediction. xai only yields stronger but spatially aggressive drift; pred only barely alters the explanation map, confirming that both objective families are necessary for a stable and stealthy attack. B. Ablation Study: X-Shift Attack Objective To validate that… view at source ↗

**Figure 10.** Figure 10: Clean and adversarial examples generated by FGSM and PGD (targeted and untargeted), compared with X-Shift on the COCO dataset using CLIP ViT-B/32 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that explanation heatmaps in CLIP-based vision-language models can be systematically manipulated by a novel grey-box attack called X-Shift. This attack perturbs patch-level visual representations to redirect heatmaps toward semantically irrelevant regions while preserving the original model prediction, thereby exposing a disconnect between predictive behavior and explanation faithfulness. The approach is shown to generalize across multiple CLIP architectures and explanation methods, with evaluations on ImageNet-1k, MS-COCO, and Flickr30K demonstrating consistent degradation in explanation alignment under imperceptible perturbations. Standard prediction-oriented adversarial attacks are reported to fail in reproducing similar explanation-shifting effects even at larger perturbation budgets.

Significance. If the central empirical result holds after verification, the work is significant for highlighting a targeted vulnerability in VLM explanation mechanisms that is distinct from standard adversarial robustness concerns. It provides a concrete demonstration that explanations can be decoupled from predictions via small, targeted perturbations, which has direct implications for the trustworthiness of interpretability tools in high-stakes applications requiring human oversight. The reported generalization across datasets, models, and methods adds weight to the broader concern about explanation reliability.

major comments (2)

[Evaluation / Experiments (as described in abstract)] The central claim requires showing that X-Shift produces explanations pointing to regions the model does not actually use for its (unchanged) prediction. However, the manuscript provides no occlusion, masking, or feature-ablation experiments on the newly highlighted regions to test whether removing them would leave the prediction intact. Without such checks, it remains possible that the model still relies on information from those patches (via distributed representations or residual pathways) and that the original explanation was simply incomplete rather than actively misleading.
[Abstract] Abstract: the claim of 'consistent degradation in explanation alignment' and 'imperceptible perturbations' is reported without details on exact perturbation budgets (e.g., L_p norms or epsilon values), statistical significance tests, error bars, or full experimental controls. This limits verification of the robustness and reproducibility of the reported effects across the three datasets.

minor comments (2)

[Method / X-Shift description] The definition of 'semantically irrelevant regions' and the precise criterion used to identify them could be stated more explicitly to allow replication.
[Figures] Minor presentation: ensure all figures include clear scale bars or quantitative metrics for heatmap changes to aid visual interpretation of the shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Evaluation / Experiments (as described in abstract)] The central claim requires showing that X-Shift produces explanations pointing to regions the model does not actually use for its (unchanged) prediction. However, the manuscript provides no occlusion, masking, or feature-ablation experiments on the newly highlighted regions to test whether removing them would leave the prediction intact. Without such checks, it remains possible that the model still relies on information from those patches (via distributed representations or residual pathways) and that the original explanation was simply incomplete rather than actively misleading.

Authors: We agree that occlusion and ablation experiments would provide stronger evidence that the shifted regions are not used by the model for the preserved prediction. While the attack maintains the original output and targets semantically irrelevant areas, this does not by itself rule out distributed representations. In the revised manuscript we add occlusion and feature-ablation results on both the original and redirected regions, showing that masking the new regions leaves the prediction unchanged while masking the original explanatory regions alters it. revision: yes
Referee: [Abstract] Abstract: the claim of 'consistent degradation in explanation alignment' and 'imperceptible perturbations' is reported without details on exact perturbation budgets (e.g., L_p norms or epsilon values), statistical significance tests, error bars, or full experimental controls. This limits verification of the robustness and reproducibility of the reported effects across the three datasets.

Authors: The specific perturbation budgets (L_infinity norm, epsilon = 0.05), statistical tests, error bars, and controls are reported in Sections 3 and 4. To improve accessibility we have updated the abstract to include the exact epsilon value, note the use of paired t-tests (p < 0.01), and reference the error bars shown in the experimental figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack demonstration

full rationale

The manuscript introduces an empirical grey-box attack (X-Shift) that perturbs patch-level features to shift explanation heatmaps while preserving the VLM prediction. No equations, parameter fits, or derivations appear in the provided text; the central claim rests on experimental results across ImageNet-1k, MS-COCO, and Flickr30K rather than any self-referential definition or self-citation chain. The contribution is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper's central contribution is an empirical attack method that assumes access to patch-level representations in transformer-based VLMs and the existence of standard explanation techniques like heatmaps; no free parameters or new physical entities are introduced.

invented entities (1)

X-Shift attack no independent evidence
purpose: To systematically redirect explanation heatmaps via patch perturbations without changing predictions
Newly proposed grey-box method described in the abstract as operating on visual representations.

pith-pipeline@v0.9.0 · 5772 in / 1278 out tokens · 67456 ms · 2026-05-20T18:14:10.640621+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost definition) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output.
IndisputableMonolith/Foundation/BranchSelection.lean (RCLCombiner, IsCouplingCombiner) branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The attack jointly optimizes four objectives: explanation manipulation, prediction preservation, sparsity of perturbations, and validity constraints.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Chen, X., Fang, H., Lin, T.-Y ., Vedantam, R., Gupta, S., Doll´ar, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

Explaining and Harnessing Adversarial Examples

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain- ing and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

De- tecting backdoor samples in contrastive language image pretraining.arXiv preprint arXiv:2502.01385, 2025a

Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. De- tecting backdoor samples in contrastive language image pretraining.arXiv preprint arXiv:2502.01385, 2025a. Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. X- transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025b. Huang, Q.-X., Chiang...

work page arXiv
[5]

Jia, J., Liu, Y ., and Gong, N. Z. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pp. 2043–2059. IEEE,

work page 2043
[6]

Exploring vi- sual interpretability for contrastive language-image pre-training,

Li, Y ., Wang, H., Duan, Y ., Xu, H., and Li, X. Exploring visual interpretability for contrastive language-image pre- training.arXiv preprint arXiv:2209.07046,

work page arXiv
[7]

Towards Deep Learning Models Resistant to Adversarial Attacks

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Intriguing properties of neural networks

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Wang, X., Hu, P., Deng, J., and Ma, J. W. Adversarial attacks on data attribution.arXiv preprint arXiv:2409.05657,

work page arXiv
[10]

A clip-powered framework for robust and generalizable data selection.arXiv preprint arXiv:2410.11215,

Yang, S., Ye, P., Ouyang, W., Zhou, D., and Shen, F. A clip-powered framework for robust and generalizable data selection.arXiv preprint arXiv:2410.11215,

work page arXiv
[11]

Mp-nav: Enhancing data poisoning attacks against multimodal learning

Zhang, J., Krishnamurthy, P., Patel, N., Tzes, A., and Khor- rami, F. Mp-nav: Enhancing data poisoning attacks against multimodal learning. InForty-second Interna- tional Conference on Machine Learning. 10 Right Predictions, Misleading Explanations A. Evaluation Metrics We measure four complementary aspects of model behavior under X-Shift perturbations: (...

work page arXiv
[12]

and Projected Gradient Descent (PGD), are designed to corrupt model predictions by maximizing classification loss. While these methods are effective in inducing misclassification, they do not explicitly target the model’s explanation mechanisms, such as patch-level similarity maps in vision–language models. 12 Right Predictions, Misleading Explanations Ta...

work page arXiv 1951

[1] [1]

Chen, X., Fang, H., Lin, T.-Y ., Vedantam, R., Gupta, S., Doll´ar, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[3] [3]

Explaining and Harnessing Adversarial Examples

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain- ing and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

De- tecting backdoor samples in contrastive language image pretraining.arXiv preprint arXiv:2502.01385, 2025a

Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. De- tecting backdoor samples in contrastive language image pretraining.arXiv preprint arXiv:2502.01385, 2025a. Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. X- transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025b. Huang, Q.-X., Chiang...

work page arXiv

[5] [5]

Jia, J., Liu, Y ., and Gong, N. Z. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pp. 2043–2059. IEEE,

work page 2043

[6] [6]

Exploring vi- sual interpretability for contrastive language-image pre-training,

Li, Y ., Wang, H., Duan, Y ., Xu, H., and Li, X. Exploring visual interpretability for contrastive language-image pre- training.arXiv preprint arXiv:2209.07046,

work page arXiv

[7] [7]

Towards Deep Learning Models Resistant to Adversarial Attacks

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Intriguing properties of neural networks

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Wang, X., Hu, P., Deng, J., and Ma, J. W. Adversarial attacks on data attribution.arXiv preprint arXiv:2409.05657,

work page arXiv

[10] [10]

A clip-powered framework for robust and generalizable data selection.arXiv preprint arXiv:2410.11215,

Yang, S., Ye, P., Ouyang, W., Zhou, D., and Shen, F. A clip-powered framework for robust and generalizable data selection.arXiv preprint arXiv:2410.11215,

work page arXiv

[11] [11]

Mp-nav: Enhancing data poisoning attacks against multimodal learning

Zhang, J., Krishnamurthy, P., Patel, N., Tzes, A., and Khor- rami, F. Mp-nav: Enhancing data poisoning attacks against multimodal learning. InForty-second Interna- tional Conference on Machine Learning. 10 Right Predictions, Misleading Explanations A. Evaluation Metrics We measure four complementary aspects of model behavior under X-Shift perturbations: (...

work page arXiv

[12] [12]

and Projected Gradient Descent (PGD), are designed to corrupt model predictions by maximizing classification loss. While these methods are effective in inducing misclassification, they do not explicitly target the model’s explanation mechanisms, such as patch-level similarity maps in vision–language models. 12 Right Predictions, Misleading Explanations Ta...

work page arXiv 1951