When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
Pith reviewed 2026-05-19 03:47 UTC · model grok-4.3
The pith
Vision-language models resolve conflicts between internal knowledge and visual input via a small set of attention heads that can be identified and edited.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through logit inspection on the WHOOPS-AHA! dataset of multimodal counterfactual queries that contradict commonsense knowledge, the authors locate a small set of attention heads that mediate cross-modal conflicts. Interventions on these heads steer model outputs toward either parametric knowledge or visual information, and the heads' attention patterns localize the image regions responsible for visual overrides with greater precision than gradient-based attribution.
What carries the argument
A small set of attention heads located via logit inspection; these heads mediate the choice between internal parametric knowledge and incoming visual evidence during conflict resolution.
If this is right
- Editing the identified heads can increase or decrease reliance on visual input without retraining the full model.
- Attention patterns from these heads give a direct map of which image patches trigger overrides of stored knowledge.
- The same inspection technique may generalize to other multimodal conflict settings beyond the tested dataset.
- Precise head-level control offers a route to reduce hallucinations caused by visual-textual mismatches.
Where Pith is reading between the lines
- If the same heads prove stable across model scales, targeted editing could become a lightweight safety layer for deployed VLMs.
- The finding suggests that conflict resolution may be localized enough to study in isolation from other capabilities.
- Extending the dataset to include non-commonsense conflicts could test whether the same heads handle factual versus perceptual clashes.
Load-bearing premise
The heads found by logit inspection are causally responsible for resolving the conflict rather than merely correlated with the model's final choice.
What would settle it
An intervention experiment in which editing the reported heads leaves the model's preference for visual input versus internal knowledge unchanged.
Figures
read the original abstract
Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA!, a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal parametric knowledge or the visual information. Our results show that attention patterns on these heads effectively locate image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries that pit VLMs' internal parametric commonsense knowledge against contradictory visual inputs. Using logit inspection, it identifies a small set of attention heads claimed to mediate cross-modal knowledge conflicts. Targeted interventions on these heads are shown to steer model outputs toward either internal knowledge or visual information, while attention patterns within the heads are reported to yield more precise image-region attribution than gradient-based methods.
Significance. If the causal claims are substantiated, the work would contribute a new diagnostic dataset and mechanistic insights into how VLMs resolve knowledge-visual conflicts, with potential applications to reducing hallucinations. The dataset itself and the head-level steering results, if properly controlled, would be useful for the interpretability community.
major comments (2)
- [§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.
- [§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.
minor comments (2)
- [Table 1] Table 1: The dataset statistics table would benefit from an additional column reporting inter-annotator agreement or human validation of the counterfactual contradictions.
- [Figure 4] Figure 4: The attention-map visualizations lack a quantitative comparison (e.g., IoU or precision-recall against human-annotated regions) to support the claim of superiority over gradient-based attribution.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.
Authors: We agree that control interventions are necessary to substantiate the specificity of the identified attention heads. In the revised version of the manuscript, we will add experiments intervening on randomly selected heads within the same layers as well as on heads that were not selected by the logit inspection criterion. These additional controls will demonstrate that the observed steering effects toward parametric knowledge or visual input are indeed specific to the heads mediating the cross-modal conflicts, rather than a nonspecific effect of editing any attention head. We plan to include these results in an updated §4.2 and the supplementary material. revision: yes
-
Referee: [§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.
Authors: We acknowledge that a quantitative summary would strengthen the claim that the identified heads are the causal locus. In the revised manuscript, we will expand §3.1 to include: (1) the exact number of heads retained for each model (typically a small set of 4-8 heads), (2) consistency metrics across different models and query types in the WHOOPS-AHA! dataset, and (3) an ablation study varying the inspection threshold to show that the selected heads remain stable and effective. This will help differentiate them from potential downstream correlates. We will also report these details in a new table or figure. revision: yes
Circularity Check
No significant circularity: standard methods on new dataset
full rationale
The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries and applies established logit inspection plus activation patching from prior interpretability literature to locate attention heads. No equations or claims reduce by construction to fitted parameters defined in terms of the target result, no self-citation chains justify uniqueness or ansatzes, and the central findings are empirical observations rather than self-referential derivations. The analysis is therefore self-contained against external benchmarks in mechanistic interpretability.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention heads in transformer-based VLMs can be causally responsible for specific behavioral decisions such as modality preference.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko aj Bi\' n...
work page 2022
-
[2]
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://doi.org/10.48550/ARXIV.2303.08112 Eliciting latent predictions from transformers with the tuned lens . CoRR, abs/2303.08112
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023
-
[3]
Hung-Ting Chen, Michael Zhang, and Eunsol Choi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.146 Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292--2307, Abu Dhabi, United Arab Emirates. ...
-
[4]
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. https://doi.org/10.48550/ARXIV.2311.03287 Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges . CoRR, abs/2311.03287
-
[5]
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 8493--8502. Associat...
-
[6]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , pages 5484--5495...
work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
-
[8]
Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. 2025 a . https://doi.org/10.48550/ARXIV.2505.17127 Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts . CoRR, abs/2505.17127
-
[9]
Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025 b . https://aclanthology.org/2025.naacl-long.571/ What do VLM s NOTICE ? a mechanistic interpretability pipeline for G aussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the ...
work page 2025
-
[10]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. https://arxiv.org/abs/2310.14566 Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models . Preprint, arXiv:2310.14566
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Nitzan Bitton Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. 2023. https://doi.org/10.1109/ICCV51070.2023.00247 Breaking common sense: Whoops! A vision-and-language benchmark of synthetic and compositional images . In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, O...
-
[12]
Danny Halawi, Jean - Stanislas Denain, and Jacob Steinhardt. 2023. https://doi.org/10.48550/arXiv.2307.09476 Overthinking the truth: Understanding how language models process false demonstrations . CoRR, abs/2307.09476
-
[13]
Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.904 The instinctive bias: Spurious images lead to illusion in MLLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16163--16177, Miami, Florida, USA. Associ...
-
[14]
Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-acl.70 Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models . In Findings of the Association for Computational Linguistics: ...
-
[15]
Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \' e , Morgane Rivi \` e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean - Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \" e l Liu, Francesco Visin, Kathleen Kenealy, Luca...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
-
[16]
Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tom \'a s Ko c isk \'y , Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. https://openreview.net/forum?id=73OmmrCfSyy Mind the gap: Assessing temporal generalization in neural language...
work page 2021
-
[17]
Tiep Le, Vasudev Lal, and Phillip Howard. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/e14e4cb8266184ceb234973dfe07faed-Abstract-Datasets\_and\_Benchmarks.html Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs . In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor...
work page 2023
- [18]
-
[19]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. https://proceedings.mlr.press/v162/li22n.html BLIP : Bootstrapping language-image pre-training for unified vision-language understanding and generation . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888--12...
work page 2022
-
[20]
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2024 a . https://openreview.net/forum?id=J44HfH4JCg Mitigating hallucination in large multi-modal models via robust instruction tuning . In The Twelfth International Conference on Learning Representations
work page 2024
-
[21]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge
work page 2024
-
[22]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc
work page 2023
- [23]
- [24]
-
[25]
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024 d . https://arxiv.org/abs/2306.05499 Prompt injection attack against llm-integrated applications . Preprint, arXiv:2306.05499
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.565 Entity-based knowledge conflicts in question answering . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052--7063, Online and Punta Cana, Dominican Republic....
-
[27]
Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. 2022. https://doi.org/10.18653/v1/2022.naacl-main.435 Time waits for no one! analysis and challenges of temporal misalignment . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...
-
[28]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In NeurIPS
work page 2022
-
[29]
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/pdf?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net
work page 2023
-
[30]
Nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting gpt: the logit lens . Accessed: Nov 2023
work page 2020
-
[31]
Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, and Bernhard Sch \" o lkopf. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.458 Competition of mechanisms: Tracing how language models handle facts and counterfactuals . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...
-
[32]
Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. https://openreview.net/forum?id=ptvV5HGTNN Resolving knowledge conflicts in large language models . In First Conference on Language Modeling
work page 2024
-
[34]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...
work page 2020
-
[35]
Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.486 Knowledge conflicts for LLM s: A survey . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8541--8565, Miami, Florida, USA. Association for Computational Linguistics
-
[36]
Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.615 Characterizing mechanisms for factual recall in language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924--9959, Singapore. Association for Computational Linguistics
-
[37]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[38]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.