Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
Pith reviewed 2026-05-20 18:14 UTC · model grok-4.3
The pith
Explanation heatmaps in vision-language models can be redirected to irrelevant regions without changing the model's prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To demonstrate this, the authors introduce X-Shift, which perturbs patch-level visual representations to redirect heatmaps toward semantically irrelevant regions without altering the predicted output. The attack requires no model parameter changes and generalizes across multiple CLIP architectures and explanation methods. Evaluation on ImageNet-1k, MS-COCO, and Flickr30K shows consistent degradation in explanation alignment under small perturbations while prediction stability is retained
What carries the argument
X-Shift, a grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output.
If this is right
- Explanation heatmaps cannot be treated as faithful records of model reasoning even when predictions remain stable.
- The vulnerability appears consistently across different CLIP architectures and common explanation methods.
- Standard attacks aimed only at changing predictions do not produce the same explanation shifts.
- The effect is observed on ImageNet-1k, MS-COCO, and Flickr30K under imperceptible perturbations.
- Current explanation mechanisms have a basic limitation that affects their use for verifying model trustworthiness.
Where Pith is reading between the lines
- Explanation techniques may need built-in checks against patch-level changes to stay aligned with predictions.
- The same pattern of prediction-explanation mismatch could appear in other multimodal models and deserves direct testing.
- Combining several explanation methods or adding training constraints might reduce the success of such shifts.
- High-stakes applications that require human review of explanations could face reliability problems if this vulnerability is widespread.
Load-bearing premise
Small, targeted changes to patch-level image features can shift explanation heatmaps to irrelevant areas while leaving the model's final prediction unchanged.
What would settle it
Run X-Shift on a CLIP model with a new image from ImageNet-1k, measure whether the top prediction label stays identical while the explanation heatmap moves away from the main object depicted.
Figures
read the original abstract
Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that explanation heatmaps in CLIP-based vision-language models can be systematically manipulated by a novel grey-box attack called X-Shift. This attack perturbs patch-level visual representations to redirect heatmaps toward semantically irrelevant regions while preserving the original model prediction, thereby exposing a disconnect between predictive behavior and explanation faithfulness. The approach is shown to generalize across multiple CLIP architectures and explanation methods, with evaluations on ImageNet-1k, MS-COCO, and Flickr30K demonstrating consistent degradation in explanation alignment under imperceptible perturbations. Standard prediction-oriented adversarial attacks are reported to fail in reproducing similar explanation-shifting effects even at larger perturbation budgets.
Significance. If the central empirical result holds after verification, the work is significant for highlighting a targeted vulnerability in VLM explanation mechanisms that is distinct from standard adversarial robustness concerns. It provides a concrete demonstration that explanations can be decoupled from predictions via small, targeted perturbations, which has direct implications for the trustworthiness of interpretability tools in high-stakes applications requiring human oversight. The reported generalization across datasets, models, and methods adds weight to the broader concern about explanation reliability.
major comments (2)
- [Evaluation / Experiments (as described in abstract)] The central claim requires showing that X-Shift produces explanations pointing to regions the model does not actually use for its (unchanged) prediction. However, the manuscript provides no occlusion, masking, or feature-ablation experiments on the newly highlighted regions to test whether removing them would leave the prediction intact. Without such checks, it remains possible that the model still relies on information from those patches (via distributed representations or residual pathways) and that the original explanation was simply incomplete rather than actively misleading.
- [Abstract] Abstract: the claim of 'consistent degradation in explanation alignment' and 'imperceptible perturbations' is reported without details on exact perturbation budgets (e.g., L_p norms or epsilon values), statistical significance tests, error bars, or full experimental controls. This limits verification of the robustness and reproducibility of the reported effects across the three datasets.
minor comments (2)
- [Method / X-Shift description] The definition of 'semantically irrelevant regions' and the precise criterion used to identify them could be stated more explicitly to allow replication.
- [Figures] Minor presentation: ensure all figures include clear scale bars or quantitative metrics for heatmap changes to aid visual interpretation of the shifts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Evaluation / Experiments (as described in abstract)] The central claim requires showing that X-Shift produces explanations pointing to regions the model does not actually use for its (unchanged) prediction. However, the manuscript provides no occlusion, masking, or feature-ablation experiments on the newly highlighted regions to test whether removing them would leave the prediction intact. Without such checks, it remains possible that the model still relies on information from those patches (via distributed representations or residual pathways) and that the original explanation was simply incomplete rather than actively misleading.
Authors: We agree that occlusion and ablation experiments would provide stronger evidence that the shifted regions are not used by the model for the preserved prediction. While the attack maintains the original output and targets semantically irrelevant areas, this does not by itself rule out distributed representations. In the revised manuscript we add occlusion and feature-ablation results on both the original and redirected regions, showing that masking the new regions leaves the prediction unchanged while masking the original explanatory regions alters it. revision: yes
-
Referee: [Abstract] Abstract: the claim of 'consistent degradation in explanation alignment' and 'imperceptible perturbations' is reported without details on exact perturbation budgets (e.g., L_p norms or epsilon values), statistical significance tests, error bars, or full experimental controls. This limits verification of the robustness and reproducibility of the reported effects across the three datasets.
Authors: The specific perturbation budgets (L_infinity norm, epsilon = 0.05), statistical tests, error bars, and controls are reported in Sections 3 and 4. To improve accessibility we have updated the abstract to include the exact epsilon value, note the use of paired t-tests (p < 0.01), and reference the error bars shown in the experimental figures. revision: yes
Circularity Check
No circularity: purely empirical attack demonstration
full rationale
The manuscript introduces an empirical grey-box attack (X-Shift) that perturbs patch-level features to shift explanation heatmaps while preserving the VLM prediction. No equations, parameter fits, or derivations appear in the provided text; the central claim rests on experimental results across ImageNet-1k, MS-COCO, and Flickr30K rather than any self-referential definition or self-citation chain. The contribution is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
invented entities (1)
-
X-Shift attack
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost definition)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output.
-
IndisputableMonolith/Foundation/BranchSelection.lean (RCLCombiner, IsCouplingCombiner)branch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The attack jointly optimizes four objectives: explanation manipulation, prediction preservation, sparsity of perturbations, and validity constraints.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chen, X., Fang, H., Lin, T.-Y ., Vedantam, R., Gupta, S., Doll´ar, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[3]
Explaining and Harnessing Adversarial Examples
Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain- ing and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. De- tecting backdoor samples in contrastive language image pretraining.arXiv preprint arXiv:2502.01385, 2025a. Huang, H., Erfani, S., Li, Y ., Ma, X., and Bailey, J. X- transfer attacks: Towards super transferable adversarial attacks on clip.arXiv preprint arXiv:2505.05528, 2025b. Huang, Q.-X., Chiang...
-
[5]
Jia, J., Liu, Y ., and Gong, N. Z. Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In2022 IEEE Symposium on Security and Privacy (SP), pp. 2043–2059. IEEE,
work page 2043
-
[6]
Exploring vi- sual interpretability for contrastive language-image pre-training,
Li, Y ., Wang, H., Duan, Y ., Xu, H., and Li, X. Exploring visual interpretability for contrastive language-image pre- training.arXiv preprint arXiv:2209.07046,
-
[7]
Towards Deep Learning Models Resistant to Adversarial Attacks
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Intriguing properties of neural networks
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199,
work page internal anchor Pith review Pith/arXiv arXiv
- [9]
-
[10]
Yang, S., Ye, P., Ouyang, W., Zhou, D., and Shen, F. A clip-powered framework for robust and generalizable data selection.arXiv preprint arXiv:2410.11215,
-
[11]
Mp-nav: Enhancing data poisoning attacks against multimodal learning
Zhang, J., Krishnamurthy, P., Patel, N., Tzes, A., and Khor- rami, F. Mp-nav: Enhancing data poisoning attacks against multimodal learning. InForty-second Interna- tional Conference on Machine Learning. 10 Right Predictions, Misleading Explanations A. Evaluation Metrics We measure four complementary aspects of model behavior under X-Shift perturbations: (...
-
[12]
and Projected Gradient Descent (PGD), are designed to corrupt model predictions by maximizing classification loss. While these methods are effective in inducing misclassification, they do not explicitly target the model’s explanation mechanisms, such as patch-level similarity maps in vision–language models. 12 Right Predictions, Misleading Explanations Ta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.