Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
Pith reviewed 2026-05-22 06:01 UTC · model grok-4.3
The pith
VLMs can answer visual questions from text alone due to language priors, so unimodal faithfulness tests give contradictory results and a new joint-contribution metric is needed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-Language Models frequently exhibit cross-modal redundancy due to language priors in datasets, allowing them to answer using text alone. This causes unimodal perturbation metrics to penalize faithful explainers and produce contradictory rankings between visual and textual evaluations. The paper introduces Synergistic Faithfulness (F_syn), derived from the Shapley Interaction Index, to isolate the joint Harsanyi dividend between modalities. This serves as an accurate surrogate for true synergy with high correlation while being 24 times faster computationally. Evaluations across multiple methods, architectures, and datasets indicate that VLM-proposed explainers overemphasize visual salie
What carries the argument
Synergistic Faithfulness (F_syn), a metric based on the Shapley Interaction Index that isolates the joint contribution, or Harsanyi dividend, from both visual and textual inputs in VLMs.
Load-bearing premise
Multimodal datasets inherently contain language priors and modality biases that cause VLMs to exhibit cross-modal redundancy, allowing answers from text alone.
What would settle it
An experiment where unimodal perturbation metrics show positive correlation between visual and textual faithfulness rankings on a dataset without language priors, or where F_syn fails to correlate with the actual joint performance improvement from combined modalities.
Figures
read the original abstract
Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $\tau = -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($\rho = 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that unimodal perturbation metrics for evaluating post-hoc explainers in Vision-Language Models (VLMs) are flawed because multimodal datasets contain language priors and modality biases, enabling VLMs to answer visual queries from text alone and producing contradictory visual/textual rankings. It introduces Synergistic Faithfulness (F_syn), a metric derived from the Shapley Interaction Index that isolates the joint Harsanyi dividend between modalities. The paper reports that F_syn serves as an accurate surrogate (ρ = 0.92) with 24× computational speedup and, when applied to evaluate 8 XAI methods across 3 VLM architectures and 3 benchmark datasets, shows that VLM-proposed explainers over-index on visual salience while adapted attention-based methods better capture true cross-modal synergy.
Significance. If the derivation of F_syn holds and cleanly isolates cross-modal interactions without residual unimodal contributions, the work offers a practical and more reliable evaluation framework for VLM explainability. The reported computational speedup and multi-model/multi-dataset evaluation are concrete strengths that could aid auditing of VLM reasoning in high-stakes settings. The identification of an evaluation collapse in current unimodal metrics addresses a genuine limitation in multimodal XAI.
major comments (3)
- [§3] §3 (definition of F_syn and characteristic function v(S)): The central claim that F_syn 'strictly isolates the joint Harsanyi dividend' requires that unimodal first-order terms cancel exactly when computing the interaction index over mixed continuous visual patches and discrete text tokens. The manuscript must provide the explicit formulation of v(S) (e.g., the perturbation or masking scheme) and demonstrate or verify that marginal contributions contain no residual pure-visual or pure-textual value; without this, the isolation is unverified and the surrogate status of ρ = 0.92 cannot be assessed.
- [Results] Results section (surrogate validation): The reported ρ = 0.92 is presented as evidence that F_syn is a 'highly accurate surrogate,' yet if this correlation is computed against the full Shapley computation on the same evaluation instances, it constitutes an empirical fit rather than independent validation. This raises the circularity concern noted in the reader's assessment and must be addressed by clarifying the validation procedure or external grounding.
- [Evaluation] Evaluation (comparative claims): The conclusion that 'explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods' is load-bearing for the practical contribution. This ranking depends directly on the soundness of F_syn; any residual unimodal bias in the metric would undermine the comparative result across the 8 methods, 3 architectures, and 3 datasets.
minor comments (3)
- [Abstract] Abstract: The sentence beginning 'Evaluating 8 distinct XAI methods ... reveals that' has a subject-verb agreement issue; rephrase for grammatical clarity.
- [Abstract] Abstract: The parenthetical reference to Kendall's τ = -0.06 is commented out; either restore the supporting observation with its exact dataset and section reference or remove the claim of contradictory rankings.
- [Throughout] Notation: Ensure the symbol F_syn (or mathcal{F}_{syn}) is defined at first use and consistently referenced to the corresponding equation number throughout the text.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (definition of F_syn and characteristic function v(S)): The central claim that F_syn 'strictly isolates the joint Harsanyi dividend' requires that unimodal first-order terms cancel exactly when computing the interaction index over mixed continuous visual patches and discrete text tokens. The manuscript must provide the explicit formulation of v(S) (e.g., the perturbation or masking scheme) and demonstrate or verify that marginal contributions contain no residual pure-visual or pure-textual value; without this, the isolation is unverified and the surrogate status of ρ = 0.92 cannot be assessed.
Authors: We agree that additional explicit details will improve rigor. In the revised manuscript we will expand Section 3 with the complete definition of the characteristic function v(S), including the precise perturbation scheme: visual patches are masked by zeroing pixel values or substituting the dataset mean, while text tokens are removed or replaced by a special mask token. We will also insert a short derivation showing that the Shapley Interaction Index subtracts all first-order marginal contributions by definition, leaving only the pure joint Harsanyi dividend. An appendix will contain numerical checks on representative instances confirming that residual unimodal terms fall below a negligible threshold after subtraction. revision: yes
-
Referee: [Results] Results section (surrogate validation): The reported ρ = 0.92 is presented as evidence that F_syn is a 'highly accurate surrogate,' yet if this correlation is computed against the full Shapley computation on the same evaluation instances, it constitutes an empirical fit rather than independent validation. This raises the circularity concern noted in the reader's assessment and must be addressed by clarifying the validation procedure or external grounding.
Authors: We acknowledge the concern. The reported correlation was obtained by comparing F_syn against the full Shapley Interaction Index on the same evaluation instances, which is a direct empirical check of approximation fidelity rather than an independent external test. In the revision we will explicitly describe this procedure, state the number of instances used, and note that the high correlation demonstrates the practical accuracy of the surrogate. We will also add a sentence clarifying that this validation is internal to the approximation and that future work could pursue fully held-out or external grounding. revision: yes
-
Referee: [Evaluation] Evaluation (comparative claims): The conclusion that 'explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods' is load-bearing for the practical contribution. This ranking depends directly on the soundness of F_syn; any residual unimodal bias in the metric would undermine the comparative result across the 8 methods, 3 architectures, and 3 datasets.
Authors: We agree that the comparative ranking is central and rests on the properties of F_syn. With the expanded formulation, derivation, and validation details added in response to the first two comments, the isolation of cross-modal synergy is now more clearly established. In the revised Evaluation section we will add a short robustness paragraph that ties the observed ranking directly to the clarified properties of F_syn and reports consistency of the ordering across all three datasets and architectures. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The abstract presents F_syn as derived from the established Shapley Interaction Index to isolate the joint Harsanyi dividend, with the reported ρ=0.92 serving as an empirical correlation to an external benchmark rather than a self-referential fit. No equations, self-citations, or definitional reductions are visible in the provided text that would make the metric equivalent to its inputs by construction. The central claim rests on standard cooperative game theory applied to a new multimodal value function, which remains independent of the evaluation results. This is the most common honest finding for papers that introduce metrics grounded in prior mathematical frameworks without load-bearing self-references.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal datasets contain language priors and modality biases leading to cross-modal redundancy in VLMs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Synergistic Faithfulness (F_syn) ... rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu, Ning Liu, Yaxin Peng, Feifei Feng, and Jian Tang. Mmro: Are multimodal llms eligible as the brain for in-home robotics?arXiv preprint arXiv:2406.19693, 2024
-
[5]
A survey on multimodal large language models for autonomous driving
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024
work page 2024
-
[6]
Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine.Information Fusion, 117:102888, 2025
work page 2025
-
[7]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017
work page 2017
-
[8]
Token activation map to visually explain multimodal llms
Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 48–58, 2025
work page 2025
-
[9]
RISE: Randomized Input Sampling for Explanation of Black-box Models
Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models.arXiv preprint arXiv:1806.07421, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[11]
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. Naturalbench: Evaluating vision-language models on natural adversarial samples.Advances in Neural Information Processing Systems, 37:17044–17068, 2024
work page 2024
-
[12]
Mllms are deeply affected by modality bias.arXiv preprint arXiv:2505.18657, 2025
Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, et al. Mllms are deeply affected by modality bias.arXiv preprint arXiv:2505.18657, 2025
-
[13]
Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective
Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16449–16469, 2024. 10
work page 2024
-
[14]
Vlind-bench: Measuring language priors in large vision-language models
Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144, 2025
work page 2025
-
[15]
Insight over sight: Exploring the vision-knowledge conflicts in multimodal llms
Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen-tse Huang, Qiuzhi Liu, Pinjia He, and Zhaopeng Tu. Insight over sight: Exploring the vision-knowledge conflicts in multimodal llms. arXiv preprint arXiv:2410.08145, 2024
-
[16]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023
work page 2023
-
[17]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024
work page 2024
-
[18]
Attnlrp: Attention-aware layer-wise relevance propagation for transformers
Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Attnlrp: Attention-aware layer-wise relevance propagation for transformers. InInternational Conference on Machine Learning, pages 135–168. PMLR, 2024
work page 2024
-
[19]
From redundancy to relevance: Information flow in lvlms across reasoning tasks
Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. From redundancy to relevance: Information flow in lvlms across reasoning tasks. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...
work page 2025
-
[20]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[21]
Quantifying attention flow in transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020
work page 2020
-
[22]
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers
Hila Chefer, Shir Gur, and Lior Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021
work page 2021
-
[23]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020
work page 2020
-
[24]
Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018
work page 2018
-
[25]
Stefan Bluecher, Johanna Vielhaben, and Nils Strodthoff. Decoupling pixel flipping and occlusion strategy for consistent xai benchmarks.Transactions on Machine Learning Research
-
[26]
Rethinking explainability in the era of multimodal ai.arXiv preprint arXiv:2506.13060, 2025
Chirag Agarwal. Rethinking explainability in the era of multimodal ai.arXiv preprint arXiv:2506.13060, 2025
-
[27]
Nonnegative Decomposition of Multivariate Information
Paul L Williams and Randall D Beer. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515, 2010
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[28]
Introduction to the shapley value.The Shapley value, 1:3, 1988
Alvin E Roth. Introduction to the shapley value.The Shapley value, 1:3, 1988
work page 1988
-
[29]
Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of game theory, 28(4):547–565, 1999. 11
work page 1999
-
[30]
Measuring cross-modal interactions in multimodal models
Laura Wenderoth, Konstantin Hemker, Nikola Simidjievski, and Mateja Jamnik. Measuring cross-modal interactions in multimodal models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21501–21509, 2025
work page 2025
-
[31]
Explaining similarity in vision-language encoders with weighted banzhaf interactions
Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüller- meier, and Przemyslaw Biecek. Explaining similarity in vision-language encoders with weighted banzhaf interactions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[32]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017
work page 2017
-
[33]
Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891, 2025
-
[34]
Where mllms attend and what they rely on: Explaining autoregressive token generation
Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Si Yuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, and Xiaochun Cao. Where mllms attend and what they rely on: Explaining autoregressive token generation. 2025
work page 2025
-
[35]
Zhanliang Wang and Kai Wang. Multishap: A shapley-based framework for explaining cross- modal interactions in multimodal ai models.arXiv preprint arXiv:2508.00576, 2025
-
[36]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[37]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Repope: Impact of annotation errors on the pope benchmark
Yannic Neuhaus and Matthias Hein. Repope: Impact of annotation errors on the pope benchmark. arXiv preprint arXiv:2504.15707, 2025
-
[39]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
work page 2024
-
[40]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[41]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[42]
Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Ham- mer, and Eyke Hüllermeier. shapiq: Shapley interactions for machine learning.Advances in Neural Information Processing Systems, 37:130324–130357, 2024
work page 2024
-
[43]
Improving kernelshap: Practical shapley value estimation using linear regression
Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. InInternational conference on artificial intelligence and statistics, pages 3457–3465. PMLR, 2021
work page 2021
-
[44]
\nAnswer directly with only the letter inside parentheses, and nothing else. \n
Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling permutations for shapley value estimation.Journal of Machine Learning Research, 23(43):1–46, 2022. 12 A Reproducibility and implementation details A.1 VLM Configurations and Generation Strategy Model Checkpoints:We evaluate all methods across three distinct Vision-Language Model archi...
work page 2022
-
[45]
Player 2 is defined as the subset of top-attributed textual tokens (Tk ⊆T)
Unimodal target players (Players 1 & 2):Player 1 is defined as the subset of top-attributed visual patches (Ik ⊆I ). Player 2 is defined as the subset of top-attributed textual tokens (Tk ⊆T)
-
[46]
Cross-modal background coalitions (Players 3-8):The remaining background image patches ( I\I k) and background text tokens ( T\T k) are independently shuffled and partitioned into C= 6 equal subsets. We form 6 bimodal macro-players by explicitly coupling these modalities. Thus, each background Player c is inherently multimodal: activating it simultaneousl...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.