pith. sign in

arxiv: 2605.22168 · v1 · pith:Z6T47CSGnew · submitted 2026-05-21 · 💻 cs.AI · cs.LG

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

Pith reviewed 2026-05-22 06:01 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords Vision-Language ModelsExplainable AICross-modal synergyShapley Interaction IndexFaithfulness metricsMultimodal reasoningXAI evaluation
0
0 comments X

The pith

VLMs can answer visual questions from text alone due to language priors, so unimodal faithfulness tests give contradictory results and a new joint-contribution metric is needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard ways of testing how well explanations capture VLM reasoning are broken. Multimodal datasets have biases that let models use language priors instead of truly combining vision and text, so perturbing one modality at a time gives inconsistent results across modalities. To address this, the authors define a new metric called Synergistic Faithfulness that measures the extra value from using both modalities together via Shapley interactions. This metric matches detailed calculations closely but runs much faster, and when applied it shows that most VLM explainers focus too much on images and lag behind attention-based approaches. A reliable way to check cross-modal explanations matters for trusting these models in real applications where mistakes could be costly.

Core claim

Vision-Language Models frequently exhibit cross-modal redundancy due to language priors in datasets, allowing them to answer using text alone. This causes unimodal perturbation metrics to penalize faithful explainers and produce contradictory rankings between visual and textual evaluations. The paper introduces Synergistic Faithfulness (F_syn), derived from the Shapley Interaction Index, to isolate the joint Harsanyi dividend between modalities. This serves as an accurate surrogate for true synergy with high correlation while being 24 times faster computationally. Evaluations across multiple methods, architectures, and datasets indicate that VLM-proposed explainers overemphasize visual salie

What carries the argument

Synergistic Faithfulness (F_syn), a metric based on the Shapley Interaction Index that isolates the joint contribution, or Harsanyi dividend, from both visual and textual inputs in VLMs.

Load-bearing premise

Multimodal datasets inherently contain language priors and modality biases that cause VLMs to exhibit cross-modal redundancy, allowing answers from text alone.

What would settle it

An experiment where unimodal perturbation metrics show positive correlation between visual and textual faithfulness rankings on a dataset without language priors, or where F_syn fails to correlate with the actual joint performance improvement from combined modalities.

Figures

Figures reproduced from arXiv: 2605.22168 by Jo\"el Roman Ky, Maxime Cordy, Salah Ghamizi.

Figure 1
Figure 1. Figure 1: Limitations of unimodal faithfulness metrics [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ranking instability of explainers when evaluated using isolated unimodal perturbation. The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between the ground-truth SII baseline and the synergistic faithfulness (Fsyn). The strong rank alignment is consis￾tently preserved across fundamentally different explainer types and VLM architectures. Rollout TAM InputxGradients Random Explainer method 10 1 10 2 Execution time per instance (seconds) [Log Scale] Evaluation Method Exact SII (Downsampled) syn [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 5
Figure 5. Figure 5: Average explainer rankings across all benchmark instances. Significance indicators denote the results of pairwise Wilcoxon signed￾rank tests evaluated against the top-performing method. 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 LMM fixed Effect ( ) vs. Random explainer baseline TAM LLaVACAM GradCAM IntegratedGradients InputxGradients GradxRollout Rollout AttnLRP +0.008 +0.008 +0.011 +0.016 +0.017 +0.… view at source ↗
Figure 7
Figure 7. Figure 7: Forest plots of the dataset-specific LMM fixed effects ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Unimodal rank instability across datasets. The lines track how each explainer’s ranking [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decomposed ordinal rankings of explainers across metrics and datasets. The top row [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of synergistic faithfulness ( [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of textual faithfulness (µ srg T ) scores evaluated globally and across individual datasets. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of visual faithfulness (µ srg I ) scores evaluated globally and across individual datasets. time against Synergistic Faithfulness (Fsyn). The global and dataset-specific plots reveal that while attention-based explainers achieve the highest faithfulness, they incur a significant execution cost, occupying the top-right quadrant. Similarly, methods like Integrated Gradients (which requires mult… view at source ↗
Figure 13
Figure 13. Figure 13: Pareto frontier of explainer efficiency vs. synergistic faithfulness ( [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example on CVBench (InternVL2-2B). The prompt queries the quantity of "scarfs" (a [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example from RePOPE (InternVL-2B). The prompt asks to verify the presence of a [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example from CVBench (InternVL2-2B). The prompt requires localizing highly specific, [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure example from MMStar dataset (InternVL). [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $\tau = -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($\rho = 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript argues that unimodal perturbation metrics for evaluating post-hoc explainers in Vision-Language Models (VLMs) are flawed because multimodal datasets contain language priors and modality biases, enabling VLMs to answer visual queries from text alone and producing contradictory visual/textual rankings. It introduces Synergistic Faithfulness (F_syn), a metric derived from the Shapley Interaction Index that isolates the joint Harsanyi dividend between modalities. The paper reports that F_syn serves as an accurate surrogate (ρ = 0.92) with 24× computational speedup and, when applied to evaluate 8 XAI methods across 3 VLM architectures and 3 benchmark datasets, shows that VLM-proposed explainers over-index on visual salience while adapted attention-based methods better capture true cross-modal synergy.

Significance. If the derivation of F_syn holds and cleanly isolates cross-modal interactions without residual unimodal contributions, the work offers a practical and more reliable evaluation framework for VLM explainability. The reported computational speedup and multi-model/multi-dataset evaluation are concrete strengths that could aid auditing of VLM reasoning in high-stakes settings. The identification of an evaluation collapse in current unimodal metrics addresses a genuine limitation in multimodal XAI.

major comments (3)
  1. [§3] §3 (definition of F_syn and characteristic function v(S)): The central claim that F_syn 'strictly isolates the joint Harsanyi dividend' requires that unimodal first-order terms cancel exactly when computing the interaction index over mixed continuous visual patches and discrete text tokens. The manuscript must provide the explicit formulation of v(S) (e.g., the perturbation or masking scheme) and demonstrate or verify that marginal contributions contain no residual pure-visual or pure-textual value; without this, the isolation is unverified and the surrogate status of ρ = 0.92 cannot be assessed.
  2. [Results] Results section (surrogate validation): The reported ρ = 0.92 is presented as evidence that F_syn is a 'highly accurate surrogate,' yet if this correlation is computed against the full Shapley computation on the same evaluation instances, it constitutes an empirical fit rather than independent validation. This raises the circularity concern noted in the reader's assessment and must be addressed by clarifying the validation procedure or external grounding.
  3. [Evaluation] Evaluation (comparative claims): The conclusion that 'explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods' is load-bearing for the practical contribution. This ranking depends directly on the soundness of F_syn; any residual unimodal bias in the metric would undermine the comparative result across the 8 methods, 3 architectures, and 3 datasets.
minor comments (3)
  1. [Abstract] Abstract: The sentence beginning 'Evaluating 8 distinct XAI methods ... reveals that' has a subject-verb agreement issue; rephrase for grammatical clarity.
  2. [Abstract] Abstract: The parenthetical reference to Kendall's τ = -0.06 is commented out; either restore the supporting observation with its exact dataset and section reference or remove the claim of contradictory rankings.
  3. [Throughout] Notation: Ensure the symbol F_syn (or mathcal{F}_{syn}) is defined at first use and consistently referenced to the corresponding equation number throughout the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (definition of F_syn and characteristic function v(S)): The central claim that F_syn 'strictly isolates the joint Harsanyi dividend' requires that unimodal first-order terms cancel exactly when computing the interaction index over mixed continuous visual patches and discrete text tokens. The manuscript must provide the explicit formulation of v(S) (e.g., the perturbation or masking scheme) and demonstrate or verify that marginal contributions contain no residual pure-visual or pure-textual value; without this, the isolation is unverified and the surrogate status of ρ = 0.92 cannot be assessed.

    Authors: We agree that additional explicit details will improve rigor. In the revised manuscript we will expand Section 3 with the complete definition of the characteristic function v(S), including the precise perturbation scheme: visual patches are masked by zeroing pixel values or substituting the dataset mean, while text tokens are removed or replaced by a special mask token. We will also insert a short derivation showing that the Shapley Interaction Index subtracts all first-order marginal contributions by definition, leaving only the pure joint Harsanyi dividend. An appendix will contain numerical checks on representative instances confirming that residual unimodal terms fall below a negligible threshold after subtraction. revision: yes

  2. Referee: [Results] Results section (surrogate validation): The reported ρ = 0.92 is presented as evidence that F_syn is a 'highly accurate surrogate,' yet if this correlation is computed against the full Shapley computation on the same evaluation instances, it constitutes an empirical fit rather than independent validation. This raises the circularity concern noted in the reader's assessment and must be addressed by clarifying the validation procedure or external grounding.

    Authors: We acknowledge the concern. The reported correlation was obtained by comparing F_syn against the full Shapley Interaction Index on the same evaluation instances, which is a direct empirical check of approximation fidelity rather than an independent external test. In the revision we will explicitly describe this procedure, state the number of instances used, and note that the high correlation demonstrates the practical accuracy of the surrogate. We will also add a sentence clarifying that this validation is internal to the approximation and that future work could pursue fully held-out or external grounding. revision: yes

  3. Referee: [Evaluation] Evaluation (comparative claims): The conclusion that 'explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods' is load-bearing for the practical contribution. This ranking depends directly on the soundness of F_syn; any residual unimodal bias in the metric would undermine the comparative result across the 8 methods, 3 architectures, and 3 datasets.

    Authors: We agree that the comparative ranking is central and rests on the properties of F_syn. With the expanded formulation, derivation, and validation details added in response to the first two comments, the isolation of cross-modal synergy is now more clearly established. In the revised Evaluation section we will add a short robustness paragraph that ties the observed ranking directly to the clarified properties of F_syn and reports consistency of the ordering across all three datasets and architectures. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract presents F_syn as derived from the established Shapley Interaction Index to isolate the joint Harsanyi dividend, with the reported ρ=0.92 serving as an empirical correlation to an external benchmark rather than a self-referential fit. No equations, self-citations, or definitional reductions are visible in the provided text that would make the metric equivalent to its inputs by construction. The central claim rests on standard cooperative game theory applied to a new multimodal value function, which remains independent of the evaluation results. This is the most common honest finding for papers that introduce metrics grounded in prior mathematical frameworks without load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that language priors in multimodal data create redundancy that invalidates unimodal metrics; the metric itself is positioned as a direct application of Shapley Interaction Index without additional free parameters stated in the abstract.

axioms (1)
  • domain assumption Multimodal datasets contain language priors and modality biases leading to cross-modal redundancy in VLMs
    Explicitly stated in abstract as the root cause of evaluation collapse in unimodal metrics.

pith-pipeline@v0.9.0 · 5761 in / 1442 out tokens · 39768 ms · 2026-05-22T06:01:07.038730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 8 internal anchors

  1. [1]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  2. [2]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  3. [3]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  4. [4]

    Mmro: Are multimodal llms eligible as the brain for in-home robotics?arXiv preprint arXiv:2406.19693, 2024

    Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu, Ning Liu, Yaxin Peng, Feifei Feng, and Jian Tang. Mmro: Are multimodal llms eligible as the brain for in-home robotics?arXiv preprint arXiv:2406.19693, 2024

  5. [5]

    A survey on multimodal large language models for autonomous driving

    Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024

  6. [6]

    A comprehensive survey of large language models and multimodal large language models in medicine.Information Fusion, 117:102888, 2025

    Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine.Information Fusion, 117:102888, 2025

  7. [7]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017

  8. [8]

    Token activation map to visually explain multimodal llms

    Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 48–58, 2025

  9. [9]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models.arXiv preprint arXiv:1806.07421, 2018

  10. [10]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  11. [11]

    Naturalbench: Evaluating vision-language models on natural adversarial samples.Advances in Neural Information Processing Systems, 37:17044–17068, 2024

    Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. Naturalbench: Evaluating vision-language models on natural adversarial samples.Advances in Neural Information Processing Systems, 37:17044–17068, 2024

  12. [12]

    Mllms are deeply affected by modality bias.arXiv preprint arXiv:2505.18657, 2025

    Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, et al. Mllms are deeply affected by modality bias.arXiv preprint arXiv:2505.18657, 2025

  13. [13]

    Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective

    Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16449–16469, 2024. 10

  14. [14]

    Vlind-bench: Measuring language priors in large vision-language models

    Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144, 2025

  15. [15]

    Insight over sight: Exploring the vision-knowledge conflicts in multimodal llms

    Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Pinjia He, and Zhaopeng Tu. Insight over sight: Exploring the vision-knowledge conflicts in multimodal llms. arXiv preprint arXiv:2410.08145, 2024

  16. [16]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  17. [17]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  18. [18]

    Attnlrp: Attention-aware layer-wise relevance propagation for transformers

    Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Attnlrp: Attention-aware layer-wise relevance propagation for transformers. InInternational Conference on Machine Learning, pages 135–168. PMLR, 2024

  19. [19]

    From redundancy to relevance: Information flow in lvlms across reasoning tasks

    Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. From redundancy to relevance: Information flow in lvlms across reasoning tasks. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

  20. [20]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034, 2013

  21. [21]

    Quantifying attention flow in transformers

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

  22. [22]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021

  23. [23]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

  24. [24]

    Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

    Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

  25. [25]

    Decoupling pixel flipping and occlusion strategy for consistent xai benchmarks.Transactions on Machine Learning Research

    Stefan Bluecher, Johanna Vielhaben, and Nils Strodthoff. Decoupling pixel flipping and occlusion strategy for consistent xai benchmarks.Transactions on Machine Learning Research

  26. [26]

    Rethinking explainability in the era of multimodal ai.arXiv preprint arXiv:2506.13060, 2025

    Chirag Agarwal. Rethinking explainability in the era of multimodal ai.arXiv preprint arXiv:2506.13060, 2025

  27. [27]

    Nonnegative Decomposition of Multivariate Information

    Paul L Williams and Randall D Beer. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515, 2010

  28. [28]

    Introduction to the shapley value.The Shapley value, 1:3, 1988

    Alvin E Roth. Introduction to the shapley value.The Shapley value, 1:3, 1988

  29. [29]

    An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of game theory, 28(4):547–565, 1999

    Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of game theory, 28(4):547–565, 1999. 11

  30. [30]

    Measuring cross-modal interactions in multimodal models

    Laura Wenderoth, Konstantin Hemker, Nikola Simidjievski, and Mateja Jamnik. Measuring cross-modal interactions in multimodal models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21501–21509, 2025

  31. [31]

    Explaining similarity in vision-language encoders with weighted banzhaf interactions

    Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüller- meier, and Przemyslaw Biecek. Explaining similarity in vision-language encoders with weighted banzhaf interactions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  32. [32]

    Grad-cam: Visual explanations from deep networks via gradient-based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

  33. [33]

    Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891, 2025

    Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891, 2025

  34. [34]

    Where mllms attend and what they rely on: Explaining autoregressive token generation

    Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Si Yuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, and Xiaochun Cao. Where mllms attend and what they rely on: Explaining autoregressive token generation. 2025

  35. [35]

    Multishap: A shapley-based framework for explaining cross- modal interactions in multimodal ai models.arXiv preprint arXiv:2508.00576, 2025

    Zhanliang Wang and Kai Wang. Multishap: A shapley-based framework for explaining cross- modal interactions in multimodal ai models.arXiv preprint arXiv:2508.00576, 2025

  36. [36]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  37. [37]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  38. [38]

    Repope: Impact of annotation errors on the pope benchmark

    Yannic Neuhaus and Matthias Hein. Repope: Impact of annotation errors on the pope benchmark. arXiv preprint arXiv:2504.15707, 2025

  39. [39]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  41. [41]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  42. [42]

    shapiq: Shapley interactions for machine learning.Advances in Neural Information Processing Systems, 37:130324–130357, 2024

    Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Ham- mer, and Eyke Hüllermeier. shapiq: Shapley interactions for machine learning.Advances in Neural Information Processing Systems, 37:130324–130357, 2024

  43. [43]

    Improving kernelshap: Practical shapley value estimation using linear regression

    Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. InInternational conference on artificial intelligence and statistics, pages 3457–3465. PMLR, 2021

  44. [44]

    \nAnswer directly with only the letter inside parentheses, and nothing else. \n

    Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling permutations for shapley value estimation.Journal of Machine Learning Research, 23(43):1–46, 2022. 12 A Reproducibility and implementation details A.1 VLM Configurations and Generation Strategy Model Checkpoints:We evaluate all methods across three distinct Vision-Language Model archi...

  45. [45]

    Player 2 is defined as the subset of top-attributed textual tokens (Tk ⊆T)

    Unimodal target players (Players 1 & 2):Player 1 is defined as the subset of top-attributed visual patches (Ik ⊆I ). Player 2 is defined as the subset of top-attributed textual tokens (Tk ⊆T)

  46. [46]

    How many scarfs are in the image? Select from the following choices. (A) 3 (B) 1 (C) 0 (D) 2 Answer directly with only the letter inside parentheses, and nothing else. Answer :

    Cross-modal background coalitions (Players 3-8):The remaining background image patches ( I\I k) and background text tokens ( T\T k) are independently shuffled and partitioned into C= 6 equal subsets. We form 6 bimodal macro-players by explicitly coupling these modalities. Thus, each background Player c is inherently multimodal: activating it simultaneousl...