V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Junjie Hu; Ming Jiang; Qidong Wang

arxiv: 2509.14837 · v2 · submitted 2025-09-18 · 💻 cs.CL

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Qidong Wang , Junjie Hu , Ming Jiang This is my paper

Pith reviewed 2026-05-18 16:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords visual semantic editingattention modulationcausal interpretabilityvision-language modelsVQA benchmarksmultimodal integrationattention heads

0 comments

The pith

V-SEAM uses concept-level visual edits to identify attention heads that causally shape vision-language model predictions and modulates them to raise VQA accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents V-SEAM to move causal interpretability in vision-language models beyond coarse pixel changes to targeted concept edits at the level of objects, attributes, and relationships. By intervening on images in this semantic way the method locates attention heads whose activity supports or opposes correct predictions. Positive heads cluster within a given semantic level while negative heads appear more broadly shared. An automatic modulation step then adjusts the embeddings of these heads and produces measurable gains on standard VQA benchmarks for both LLaVA and InstructBLIP.

Core claim

V-SEAM combines visual semantic editing with attention-head modulation to reveal that specific heads contribute positively or negatively to VLM predictions at three distinct semantic levels, with positive heads largely shared inside each level and negative heads generalizing across levels; automatic embedding modulation of the identified heads then improves performance on three VQA benchmarks for LLaVA and InstructBLIP.

What carries the argument

V-SEAM framework that performs concept-level visual semantic edits to intervene on inputs and then selects and modulates attention-head embeddings according to their measured positive or negative causal effect.

If this is right

Positive attention heads are shared within each semantic level but differ across object, attribute, and relationship levels.
Negative attention heads generalize across semantic levels rather than staying level-specific.
Automatic modulation of the selected head embeddings raises accuracy on three separate VQA benchmarks for both LLaVA and InstructBLIP.
The approach supplies a causal account of how multimodal integration occurs inside the attention layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same editing-plus-modulation pipeline could be tested on additional VLMs beyond the two reported here to check consistency of head roles.
If semantic levels are handled by partly distinct head groups, future architectures might explicitly route or regularize those groups to reduce cross-level interference.
Causal head identification of this kind offers a route to targeted model editing that avoids retraining entire networks.

Load-bearing premise

The attention heads located by the semantic editing interventions are causally responsible for the observed predictions rather than merely correlated with them.

What would settle it

Modulating the identified heads produces no performance gain or introduces new errors on the three VQA benchmarks while leaving the original predictions unchanged under the same visual edits.

Figures

Figures reproduced from arXiv: 2509.14837 by Junjie Hu, Ming Jiang, Qidong Wang.

**Figure 1.** Figure 1: An example of visual intervention comparisons: Visual non-semantic vs. semantic interventions. VLMs, with particular emphasis on causal intervention methods given their dual benefits: causally unraveling model behaviors and providing systematic pathways for model improvement, like model editing (Zhao et al., 2024; Lin et al., 2025). Existing work on causal interpretability primarily focuses on large lan… view at source ↗

**Figure 2.** Figure 2: Our proposed semantic-level causal interpretability framework. The pipeline starts from semantic-guided [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Logit change analysis for image and query token patching in LLaVA and InstructBLIP. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise causal impact of MLP (blue) and self-attention (green) on cross-modal semantic understanding. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of the MLP (blue) and self-attention (green) logit lens in LLaVA on the Action VQA task. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison of different perturbation strategies for the object-level task ("Is the sky blue?"). [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of different perturbation strategies for the object-level task ("Is there a sandwich in this [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Visual comparison of different perturbation strategies for the relation-level task ("Is the man standing [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Case of MLP (blue) and self-attention (green) Logit Lens in LLAVA on Material VQA. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Case of MLP (blue) and self-attention (green) Logit Lens in InstructBLIP on Indoor VQA. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Layer-wise causal impact of MLP and self-attention on the Material VQA task [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Layer-wise causal impact of attention and MLP on the Vehicle VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Layer-wise causal impact of MLP and self-attention on the Animal VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Layer-wise causal impact of MLP and self-attention on the Spatial VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

read the original abstract

Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLaVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-SEAM gives a pipeline for concept-level visual edits to attribute and modulate attention heads in VLMs with reported VQA gains, but the causal claims rest on how well the edits isolate semantic levels without side effects.

read the letter

The punchline on this paper is that it gives a concrete method for doing semantic-level visual edits on images fed to VLMs, uses those to score attention heads as positive or negative contributors at the object, attribute, and relationship levels, and then modulates the embeddings of key heads to get accuracy lifts on standard VQA tasks. What the paper does well is lay out a pipeline that goes past simple pixel noise or masking. By editing at the concept level, it aims for more interpretable interventions that tie directly to the semantics the model is reasoning about. The finding that positive heads tend to be shared within a semantic level but differ across levels, while negative heads are more broadly active, is a clear observation that could help map out how these models handle different parts of a scene. Showing results on both LLaVA and InstructBLIP across three benchmarks adds some breadth. The decision to open-source the code and data is also a solid move that lets the community verify the head selection and modulation details. Where it gets softer is on the causal side. The core claim is that the identified heads are causally responsible for the predictions at each level. For that to hold, the visual edits really have to change only the intended concept without creating side effects in other visual features or across semantic categories that might activate the same heads. The stress-test note flags this isolation issue, and without seeing the full methods on how the edits were constructed and validated, it's difficult to rule out that the head attributions are partly driven by correlated changes. On the performance side, the automatic modulation is presented as improving results, but if the heads were chosen using the same intervention data without a separate validation set, the gains could reflect some degree of overfitting to the selection rather than a general causal effect. The abstract does not detail statistical significance or controls, so those would need checking in the full paper. This work is aimed at the interpretability community working on VLMs, particularly those interested in attention mechanisms and causal probing. A reader who wants practical ways to intervene in multimodal models or to boost performance via head modulation would get value from the approach. It deserves a serious referee because the combination of semantic editing, level-specific attribution, and the modulation step is distinct from prior coarse interventions, even if the evidence for strong causality is not yet airtight. I recommend putting it through peer review, with the referees asked to focus on the edit isolation experiments and any held-out validation for the modulation results.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces V-SEAM, a framework combining visual semantic editing at object, attribute, and relationship levels with attention-head modulation to probe causal mechanisms in vision-language models. It identifies heads with positive or negative contributions to predictions, reports patterns of head sharing within and across semantic levels, and claims that automatic modulation of selected heads improves VQA performance for LLaVA and InstructBLIP on three benchmarks.

Significance. If the causal claims are substantiated, the work would advance multimodal interpretability by moving beyond pixel-level perturbations to concept-level visual interventions and by linking identified heads to measurable performance gains. The public release of data and code is a clear strength that supports reproducibility and follow-up studies.

major comments (3)

[§4] §4 (Head Identification via Semantic Edits): The claim that identified heads are causally responsible for predictions at each semantic level rests on the assumption that the visual edits isolate the targeted concept without introducing correlated pixel-level or cross-level changes. No ablation or control experiments (e.g., non-semantic visual perturbations or cross-level edit controls) are reported to rule out these confounds; this directly affects the validity of the positive/negative head attributions.
[§5.2] §5.2 (Automatic Modulation and Benchmark Results): The modulation procedure re-uses heads selected from the same editing interventions that are later used to demonstrate performance gains. Without a held-out validation split or explicit non-causal control conditions (random heads, shuffled attributions), the reported improvements on the three VQA benchmarks may partly reflect selection bias rather than causal modulation, weakening the link between interpretability findings and performance claims.
[§4.3] §4.3 (Head-Sharing Patterns): The observation that positive heads are shared within semantic levels but vary across levels lacks statistical controls for multiple comparisons or baseline comparisons against randomly selected heads; without these, the reported patterns could arise from noise or model idiosyncrasies rather than genuine semantic specialization.

minor comments (2)

[Abstract] The abstract states results on 'three diverse VQA benchmarks' but does not name them; listing the specific datasets (e.g., VQAv2, GQA, OK-VQA) would improve immediate clarity.
[§3] Notation for contribution scores (positive/negative head metrics) is introduced without an explicit equation; adding a short formal definition would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the causal claims in V-SEAM, and we address each point below with plans for revisions to improve the manuscript.

read point-by-point responses

Referee: [§4] §4 (Head Identification via Semantic Edits): The claim that identified heads are causally responsible for predictions at each semantic level rests on the assumption that the visual edits isolate the targeted concept without introducing correlated pixel-level or cross-level changes. No ablation or control experiments (e.g., non-semantic visual perturbations or cross-level edit controls) are reported to rule out these confounds; this directly affects the validity of the positive/negative head attributions.

Authors: We agree that explicit controls are needed to rule out potential confounds and fully support the causal attributions. Our semantic editing method targets specific concepts (objects, attributes, relationships) through precise, localized visual changes intended to isolate the relevant semantics while preserving other elements. However, to directly address this concern, we will add ablation experiments in the revised manuscript, including non-semantic perturbations (such as random pixel noise or unrelated edits) and cross-level edit controls. These will quantify whether the positive/negative head identifications are specific to the targeted semantic interventions. revision: yes
Referee: [§5.2] §5.2 (Automatic Modulation and Benchmark Results): The modulation procedure re-uses heads selected from the same editing interventions that are later used to demonstrate performance gains. Without a held-out validation split or explicit non-causal control conditions (random heads, shuffled attributions), the reported improvements on the three VQA benchmarks may partly reflect selection bias rather than causal modulation, weakening the link between interpretability findings and performance claims.

Authors: We acknowledge the potential selection bias arising from reusing the same interventions for both head selection and performance evaluation. In the revision, we will introduce a held-out validation split to separate head identification from the final modulation experiments. We will also add explicit control conditions using randomly selected heads and shuffled attributions. These additions will strengthen the evidence that the observed VQA improvements for LLaVA and InstructBLIP on the three benchmarks result from causal modulation of the identified heads rather than bias. revision: yes
Referee: [§4.3] §4.3 (Head-Sharing Patterns): The observation that positive heads are shared within semantic levels but vary across levels lacks statistical controls for multiple comparisons or baseline comparisons against randomly selected heads; without these, the reported patterns could arise from noise or model idiosyncrasies rather than genuine semantic specialization.

Authors: We will revise §4.3 to include rigorous statistical controls. This will involve applying corrections for multiple comparisons across the tested heads and semantic levels, as well as baseline comparisons against randomly sampled head sets of equivalent size. These analyses will help establish whether the observed within-level sharing and cross-level variation reflect genuine semantic specialization beyond what would be expected from noise or model-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents V-SEAM as a sequential framework: visual semantic edits at object/attribute/relationship levels are used to measure changes in predictions and thereby identify positive/negative attention heads, followed by a separate automatic modulation step applied to those heads to report benchmark gains on VQA tasks. No equations or self-citations are shown that reduce the identification or modulation steps to a direct fit or re-use of the same quantities by construction. The performance claims rest on empirical application across models and benchmarks rather than a closed definitional loop, and the released code allows external verification independent of any internal fitting procedure described in the manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard interpretability assumptions about the validity of semantic interventions and the causal role of attention heads; no explicit free parameters or new entities are named in the abstract.

axioms (1)

domain assumption Semantic edits at the concept level produce valid causal interventions that isolate contributions at object, attribute, and relationship levels.
This premise underpins the head identification step described in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1334 out tokens · 42001 ms · 2026-05-18T16:16:21.261948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships... automatic method to modulate key head embeddings
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

activation patching... ∆ℓl_τ(x,˜z, y) = ˆℓl_τ(x,˜z, y) − ℓ(x,˜z, y)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 6 internal anchors

[1]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, and Iain Barr. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems

work page 2022
[3]

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674--3683

work page 2018
[4]

Yuxuan Bai, Hao Cheng, Yuwei Gu, and 1 others. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2310.07904

work page arXiv 2023
[5]

Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti. 2024. https://openreview.net/forum?id=s63dtq0mwA Understanding information storage and transfer in multi-modal large language models . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[6]

Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219

work page 2022
[7]

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. 2024. Lvlm-interpret: An interpretability tool for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182--8187

work page 2024
[8]

Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim. 2024. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120

work page 2024
[9]

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. 2025. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. arXiv preprint arXiv:2503.01773

work page arXiv 2025
[10]

Weijie Chen, Yizhe Zhang, Qian Wu, and 1 others. 2024. Internvl: Scaling up vision-language pretraining with multimodal reinforcement learning. arXiv preprint arXiv:2402.00028

work page arXiv 2024
[11]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS) 36

work page 2023
[13]

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. 2023. Interpreting clip's image representation via text-based decomposition. arXiv preprint arXiv:2310.05916

work page arXiv 2023
[14]

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767

work page arXiv 2023
[15]

Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025. https://aclanthology.org/2025.naacl-long.571/ What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the North American Chapter of the Association for...

work page 2025
[16]

Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141

work page arXiv 2021
[17]

Kaichen Huang, Jiahao Huo, Yibo Yan, Kun Wang, Yutao Yue, and Xuming Hu. 2024. Miner: Mining the underlying pattern of modality-specific neurons in multimodal large language models. arXiv preprint arXiv:2410.04819

work page arXiv 2024
[18]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

work page 2019
[19]

Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. 2024. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193

work page arXiv 2024
[20]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, and 1 others. 2023. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015--4026

work page 2023
[21]

Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388--395

work page 2004
[22]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Junnan Li, Dongxu Li, Yixuan Xie, Yixuan Guo, Xiyang Dai, Jianfeng Gao, Jianwei Sang, and Lijuan Wang. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (ICML)

work page 2023
[24]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, and 1 others. 2025. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516

work page arXiv 2025
[26]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296--26306

work page 2024
[27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

work page 2023
[28]

Y Liu, Y Zhang, and S Yeung-Levy. 2025. Mechanistic interpretability meets vision language models: Insights and limitations. In The Fourth Blogpost Track at ICLR 2025

work page 2025
[29]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29

work page 2016
[30]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359--17372

work page 2022
[31]

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. 2024. Towards interpreting visual information processing in vision-language models. arXiv preprint arXiv:2410.07149

work page arXiv 2024
[32]

Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang. 2023. https://arxiv.org/abs/2308.14179 Towards vision-language mechanistic interpretability: A causal tracing tool for blip . In Proceedings of the ICCV Workshop on Computational Linguistics for Vision and Language (CLVL)

work page arXiv 2023
[33]

o gel, Philipp M \

Ekta Sood, Fabian K \"o gel, Philipp M \"u ller, Dominike Thomas, Mihai B \^a ce, and Andreas Bulling. 2023. Multimodal integration of human-like attention in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648--2658

work page 2023
[34]

Student. 1908. The probable error of a mean. Biometrika, pages 1--25

work page 1908
[35]

Shimon Ullman. 1987. Visual routines. In Readings in computer vision, pages 298--328. Elsevier

work page 1987
[36]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388--12401

work page 2020
[37]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, and 1 others. 2024. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475--121499

work page 2024
[38]

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. 2025. Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891

work page arXiv 2025
[39]

Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579--5588

work page 2021
[41]

Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. 2024. From redundancy to relevance: Enhancing explainability in multimodal large language models. arXiv e-prints, pages arXiv--2406

work page 2024
[42]

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1--38

work page 2024
[43]

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. 2024. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In European Conference on Computer Vision, pages 260--278. Springer

work page 2024
[44]

Deyao Zhu, Xiangxin Zhou, Xiang Wang, Xiubo Geng, Fan Liu, and Jiashen Zhu. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. 2024. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In European Conference on Computer Vision, pages 195--211. Springer

work page 2024
[46]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[47]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, and Iain Barr. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems

work page 2022

[3] [3]

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674--3683

work page 2018

[4] [4]

Yuxuan Bai, Hao Cheng, Yuwei Gu, and 1 others. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2310.07904

work page arXiv 2023

[5] [5]

Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti. 2024. https://openreview.net/forum?id=s63dtq0mwA Understanding information storage and transfer in multi-modal large language models . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[6] [6]

Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219

work page 2022

[7] [7]

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. 2024. Lvlm-interpret: An interpretability tool for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182--8187

work page 2024

[8] [8]

Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim. 2024. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120

work page 2024

[9] [9]

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. 2025. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. arXiv preprint arXiv:2503.01773

work page arXiv 2025

[10] [10]

Weijie Chen, Yizhe Zhang, Qian Wu, and 1 others. 2024. Internvl: Scaling up vision-language pretraining with multimodal reinforcement learning. arXiv preprint arXiv:2402.00028

work page arXiv 2024

[11] [11]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS) 36

work page 2023

[13] [13]

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. 2023. Interpreting clip's image representation via text-based decomposition. arXiv preprint arXiv:2310.05916

work page arXiv 2023

[14] [14]

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767

work page arXiv 2023

[15] [15]

Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025. https://aclanthology.org/2025.naacl-long.571/ What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the North American Chapter of the Association for...

work page 2025

[16] [16]

Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141

work page arXiv 2021

[17] [17]

Kaichen Huang, Jiahao Huo, Yibo Yan, Kun Wang, Yutao Yue, and Xuming Hu. 2024. Miner: Mining the underlying pattern of modality-specific neurons in multimodal large language models. arXiv preprint arXiv:2410.04819

work page arXiv 2024

[18] [18]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

work page 2019

[19] [19]

Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. 2024. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193

work page arXiv 2024

[20] [20]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, and 1 others. 2023. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015--4026

work page 2023

[21] [21]

Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388--395

work page 2004

[22] [22]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Junnan Li, Dongxu Li, Yixuan Xie, Yixuan Guo, Xiyang Dai, Jianfeng Gao, Jianwei Sang, and Lijuan Wang. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (ICML)

work page 2023

[24] [24]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, and 1 others. 2025. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516

work page arXiv 2025

[26] [26]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296--26306

work page 2024

[27] [27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

work page 2023

[28] [28]

Y Liu, Y Zhang, and S Yeung-Levy. 2025. Mechanistic interpretability meets vision language models: Insights and limitations. In The Fourth Blogpost Track at ICLR 2025

work page 2025

[29] [29]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29

work page 2016

[30] [30]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359--17372

work page 2022

[31] [31]

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. 2024. Towards interpreting visual information processing in vision-language models. arXiv preprint arXiv:2410.07149

work page arXiv 2024

[32] [32]

Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang. 2023. https://arxiv.org/abs/2308.14179 Towards vision-language mechanistic interpretability: A causal tracing tool for blip . In Proceedings of the ICCV Workshop on Computational Linguistics for Vision and Language (CLVL)

work page arXiv 2023

[33] [33]

o gel, Philipp M \

Ekta Sood, Fabian K \"o gel, Philipp M \"u ller, Dominike Thomas, Mihai B \^a ce, and Andreas Bulling. 2023. Multimodal integration of human-like attention in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648--2658

work page 2023

[34] [34]

Student. 1908. The probable error of a mean. Biometrika, pages 1--25

work page 1908

[35] [35]

Shimon Ullman. 1987. Visual routines. In Readings in computer vision, pages 298--328. Elsevier

work page 1987

[36] [36]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388--12401

work page 2020

[37] [37]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, and 1 others. 2024. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475--121499

work page 2024

[38] [38]

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. 2025. Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891

work page arXiv 2025

[39] [39]

Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579--5588

work page 2021

[41] [41]

Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. 2024. From redundancy to relevance: Enhancing explainability in multimodal large language models. arXiv e-prints, pages arXiv--2406

work page 2024

[42] [42]

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1--38

work page 2024

[43] [43]

Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. 2024. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In European Conference on Computer Vision, pages 260--278. Springer

work page 2024

[44] [44]

Deyao Zhu, Xiangxin Zhou, Xiang Wang, Xiubo Geng, Fan Liu, and Jiashen Zhu. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. 2024. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In European Conference on Computer Vision, pages 195--211. Springer

work page 2024

[46] [46]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[47] [47]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page