When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Alberto Cazzaniga; Diego Doimo; Francesco Ortu; Zhijing Jin

arxiv: 2507.13868 · v2 · submitted 2025-07-18 · 💻 cs.CV · cs.AI

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Francesco Ortu , Zhijing Jin , Diego Doimo , Alberto Cazzaniga This is my paper

Pith reviewed 2026-05-19 03:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsknowledge conflictsattention headshallucinationscounterfactual querieslogit inspectionmultimodal reasoning

0 comments

The pith

Vision-language models resolve conflicts between internal knowledge and visual input via a small set of attention heads that can be identified and edited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often produce unreliable outputs when visual evidence contradicts their stored knowledge. This paper creates a dataset of deliberately conflicting multimodal queries to expose how the models handle such clashes. Logit inspection reveals a compact group of attention heads that decide whether the output follows the image or the model's prior facts. Editing these heads shifts the balance toward one source or the other. Attention maps from the same heads also point to the exact image areas driving visual overrides more cleanly than gradient methods do.

Core claim

Through logit inspection on the WHOOPS-AHA! dataset of multimodal counterfactual queries that contradict commonsense knowledge, the authors locate a small set of attention heads that mediate cross-modal conflicts. Interventions on these heads steer model outputs toward either parametric knowledge or visual information, and the heads' attention patterns localize the image regions responsible for visual overrides with greater precision than gradient-based attribution.

What carries the argument

A small set of attention heads located via logit inspection; these heads mediate the choice between internal parametric knowledge and incoming visual evidence during conflict resolution.

If this is right

Editing the identified heads can increase or decrease reliance on visual input without retraining the full model.
Attention patterns from these heads give a direct map of which image patches trigger overrides of stored knowledge.
The same inspection technique may generalize to other multimodal conflict settings beyond the tested dataset.
Precise head-level control offers a route to reduce hallucinations caused by visual-textual mismatches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same heads prove stable across model scales, targeted editing could become a lightweight safety layer for deployed VLMs.
The finding suggests that conflict resolution may be localized enough to study in isolation from other capabilities.
Extending the dataset to include non-commonsense conflicts could test whether the same heads handle factual versus perceptual clashes.

Load-bearing premise

The heads found by logit inspection are causally responsible for resolving the conflict rather than merely correlated with the model's final choice.

What would settle it

An intervention experiment in which editing the reported heads leaves the model's preference for visual input versus internal knowledge unchanged.

Figures

Figures reproduced from arXiv: 2507.13868 by Alberto Cazzaniga, Diego Doimo, Francesco Ortu, Zhijing Jin.

**Figure 1.** Figure 1: Overview of Our Approach. (Top) We construct prompts that induce a conflict between a visionlanguage model’s internal factual knowledge and counterfactual visual context. (Bottom) We then analyze which components in the model mediate this tension, identifying attention heads and visual patches that favor factual or visually grounded predictions. lead to hallucinations in model responses (Cui et al., 202… view at source ↗

**Figure 2.** Figure 2: Factual Prevalence in Attention and MLP Blocks. The plot shows the factual prevalence of attention and MLP blocks in LLaVA-NeXT across layers, indicating whether each component promotes predictions aligned with factual knowledge or counterfactual visual context. Positive values correspond to blocks favoring the factual (commonsense) continuation. Negative values indicate preference for the counterfactua… view at source ↗

**Figure 3.** Figure 3: Contribution of Attention Heads to Factual and Counterfactual Predictions. (Left) Factual accuracy of individual attention heads in LLaVA-NeXT, based on Logit Lens projections at the final token position. Blue indicates heads that tend to favor the factual token (reflecting inner knowledge), while red indicates heads that favor the counterfactual token (introduced by the visual context). (Right) Mean atten… view at source ↗

**Figure 4.** Figure 4: Intervention on Target Attention Heads. Change in factual accuracy under different levels of intervention strength (λ). For λ < 0, we boost the counterfactual heads (on image tokens) and weaken the factual heads (on text tokens); for λ > 0, we do the opposite. The intervention is applied at the final token position, modifying only the relevant attention values in the last row. ence of factual and counterfa… view at source ↗

**Figure 6.** Figure 6: Qualitative Examples of Visual Regions Driving Counterfactual Predictions. Highlighted image regions correspond to visual patches identified as most responsible for counterfactual predictions using attention-based attribution. In both examples, the model generates a visually grounded but factually incorrect token (e.g., rainbow, fruit) instead of the commonsense alternative (black, tissue). The highlighte… view at source ↗

**Figure 5.** Figure 5: Ablation of Relevant Pixels. The plot shows the effect of ablating different percentages of image pixels in LLaVA-NeXT. The green line corresponds to pixels selected based on the highest attention from counterfactual heads, while the orange line corresponds to pixels with the highest gradient magnitude with respect to the counterfactual token. The gray line shows a random baseline where pixels are remove… view at source ↗

**Figure 7.** Figure 7: Factual and Counterfactual Contributions of MLP and Attention Blocks in Gemma3. Layerwise deviation from 50% factual accuracy for attention and MLP blocks, as measured by the relative logits of tfact and tcofa via Logit Lens. Positive values indicate a bias toward the factual token, while negative values indicate preference for the counterfactual token. Consistent with trends observed in LLaVA-NeXT, atte… view at source ↗

**Figure 8.** Figure 8: Factual and Counterfactual Contributions of Attention Heads for Gemma3. (Left) Factual accuracy of individual attention heads in Gemma3, computed using Logit Lens projections of the final token’s hidden state. Blue indicates heads that more frequently favor the factual token (tfact), while red indicates those that favor the counterfactual token (tcofa). As in LLaVA-NeXT, highly polarized heads are concentr… view at source ↗

**Figure 9.** Figure 9: Control Experiment: Intervention on Random Attention Heads. Change in factual accuracy under varying levels of intervention strength (λ) applied to 100 randomly selected attention heads. The results show no substantial deviation from baseline, confirming the specificity of the identified target heads. 20 40 60 80 1 20 40 60 Number of Attention Heads Factual Accuracy (%) Model Gemma3 LLaVA−NeXT [PITH_FULL… view at source ↗

**Figure 11.** Figure 11: KL Divergence Between Generated Captions at Different Intervention Strengths in LLaVANeXT. Symmetric increase in KL divergence around λ = 0, with rapid divergence until |λ| = 3 and stabilization near |λ| = 10. Higher intervention magnitudes cause substantial shifts in the generated token distribution, indicating degradation in caption quality. D Prompts For Dataset Generation Prompt Used to Generate D… view at source ↗

read the original abstract

Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA!, a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal parametric knowledge or the visual information. Our results show that attention patterns on these heads effectively locate image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a useful counterfactual dataset for VLM knowledge conflicts and locates candidate attention heads via logit inspection, but the causal claims for those heads rest on thin evidence without intervention controls.

read the letter

The main takeaway is that this work builds WHOOPS-AHA!, a dataset of multimodal queries that force conflicts between a model's internal commonsense and the provided image. They run logit inspection to surface a small set of attention heads tied to these overrides, then show that editing those heads can shift the output toward either parametric knowledge or the visual input. The attention maps on the identified heads also seem to highlight relevant image regions more cleanly than gradient-based attribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries that pit VLMs' internal parametric commonsense knowledge against contradictory visual inputs. Using logit inspection, it identifies a small set of attention heads claimed to mediate cross-modal knowledge conflicts. Targeted interventions on these heads are shown to steer model outputs toward either internal knowledge or visual information, while attention patterns within the heads are reported to yield more precise image-region attribution than gradient-based methods.

Significance. If the causal claims are substantiated, the work would contribute a new diagnostic dataset and mechanistic insights into how VLMs resolve knowledge-visual conflicts, with potential applications to reducing hallucinations. The dataset itself and the head-level steering results, if properly controlled, would be useful for the interpretability community.

major comments (2)

[§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.
[§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.

minor comments (2)

[Table 1] Table 1: The dataset statistics table would benefit from an additional column reporting inter-annotator agreement or human validation of the counterfactual contradictions.
[Figure 4] Figure 4: The attention-map visualizations lack a quantitative comparison (e.g., IoU or precision-recall against human-annotated regions) to support the claim of superiority over gradient-based attribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.

Authors: We agree that control interventions are necessary to substantiate the specificity of the identified attention heads. In the revised version of the manuscript, we will add experiments intervening on randomly selected heads within the same layers as well as on heads that were not selected by the logit inspection criterion. These additional controls will demonstrate that the observed steering effects toward parametric knowledge or visual input are indeed specific to the heads mediating the cross-modal conflicts, rather than a nonspecific effect of editing any attention head. We plan to include these results in an updated §4.2 and the supplementary material. revision: yes
Referee: [§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.

Authors: We acknowledge that a quantitative summary would strengthen the claim that the identified heads are the causal locus. In the revised manuscript, we will expand §3.1 to include: (1) the exact number of heads retained for each model (typically a small set of 4-8 heads), (2) consistency metrics across different models and query types in the WHOOPS-AHA! dataset, and (3) an ablation study varying the inspection threshold to show that the selected heads remain stable and effective. This will help differentiate them from potential downstream correlates. We will also report these details in a new table or figure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard methods on new dataset

full rationale

The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries and applies established logit inspection plus activation patching from prior interpretability literature to locate attention heads. No equations or claims reduce by construction to fitted parameters defined in terms of the target result, no self-citation chains justify uniqueness or ansatzes, and the central findings are empirical observations rather than self-referential derivations. The analysis is therefore self-contained against external benchmarks in mechanistic interpretability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new dataset and applies existing mechanistic interpretability tools; it does not introduce new free parameters, axioms beyond standard transformer assumptions, or invented entities.

axioms (1)

domain assumption Attention heads in transformer-based VLMs can be causally responsible for specific behavioral decisions such as modality preference.
The intervention experiments rest on this standard assumption from mechanistic interpretability.

pith-pipeline@v0.9.0 · 5653 in / 1266 out tokens · 31781 ms · 2026-05-19T03:47:58.511764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko aj Bi\' n...

work page 2022
[2]

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://doi.org/10.48550/ARXIV.2303.08112 Eliciting latent predictions from transformers with the tuned lens . CoRR, abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023
[3]

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.146 Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292--2307, Abu Dhabi, United Arab Emirates. ...

work page doi:10.18653/v1/2022.emnlp-main.146 2022
[4]

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. https://doi.org/10.48550/ARXIV.2311.03287 Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges . CoRR, abs/2311.03287

work page doi:10.48550/arxiv.2311.03287 2023
[5]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 8493--8502. Associat...

work page doi:10.18653/v1/2022.acl-long.581 2022
[6]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , pages 5484--5495...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[8]

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. 2025 a . https://doi.org/10.48550/ARXIV.2505.17127 Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts . CoRR, abs/2505.17127

work page doi:10.48550/arxiv.2505.17127 2025
[9]

Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025 b . https://aclanthology.org/2025.naacl-long.571/ What do VLM s NOTICE ? a mechanistic interpretability pipeline for G aussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the ...

work page 2025
[10]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. https://arxiv.org/abs/2310.14566 Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models . Preprint, arXiv:2310.14566

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Nitzan Bitton Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. 2023. https://doi.org/10.1109/ICCV51070.2023.00247 Breaking common sense: Whoops! A vision-and-language benchmark of synthetic and compositional images . In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, O...

work page doi:10.1109/iccv51070.2023.00247 2023
[12]

Danny Halawi, Jean - Stanislas Denain, and Jacob Steinhardt. 2023. https://doi.org/10.48550/arXiv.2307.09476 Overthinking the truth: Understanding how language models process false demonstrations . CoRR, abs/2307.09476

work page doi:10.48550/arxiv.2307.09476 2023
[13]

Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.904 The instinctive bias: Spurious images lead to illusion in MLLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16163--16177, Miami, Florida, USA. Associ...

work page doi:10.18653/v1/2024.emnlp-main.904 2024
[14]

Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-acl.70 Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models . In Findings of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2024.findings-acl.70 2024
[15]

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \' e , Morgane Rivi \` e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean - Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \" e l Liu, Francesco Visin, Kathleen Kenealy, Luca...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[16]

Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tom \'a s Ko c isk \'y , Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. https://openreview.net/forum?id=73OmmrCfSyy Mind the gap: Assessing temporal generalization in neural language...

work page 2021
[17]

Tiep Le, Vasudev Lal, and Phillip Howard. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/e14e4cb8266184ceb234973dfe07faed-Abstract-Datasets\_and\_Benchmarks.html Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs . In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor...

work page 2023
[18]

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2023. https://arxiv.org/abs/2311.16922 Mitigating object hallucinations in large vision-language models through visual contrastive decoding . Preprint, arXiv:2311.16922

work page arXiv 2023
[19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. https://proceedings.mlr.press/v162/li22n.html BLIP : Bootstrapping language-image pre-training for unified vision-language understanding and generation . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888--12...

work page 2022
[20]

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2024 a . https://openreview.net/forum?id=J44HfH4JCg Mitigating hallucination in large multi-modal models via robust instruction tuning . In The Twelfth International Conference on Learning Representations

work page 2024
[21]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

work page 2024
[22]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc

work page 2023
[23]

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. 2025. https://arxiv.org/abs/2403.11116 Phd: A chatgpt-prompted visual hallucination evaluation dataset . Preprint, arXiv:2403.11116

work page arXiv 2025
[24]

Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen tse Huang, Qiuzhi Liu, Pinjia He, and Zhaopeng Tu. 2024 c . https://arxiv.org/abs/2410.08145 Insight over sight? exploring the vision-knowledge conflicts in multimodal llms . Preprint, arXiv:2410.08145

work page arXiv 2024
[25]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024 d . https://arxiv.org/abs/2306.05499 Prompt injection attack against llm-integrated applications . Preprint, arXiv:2306.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.565 Entity-based knowledge conflicts in question answering . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052--7063, Online and Punta Cana, Dominican Republic....

work page doi:10.18653/v1/2021.emnlp-main.565 2021
[27]

Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. 2022. https://doi.org/10.18653/v1/2022.naacl-main.435 Time waits for no one! analysis and challenges of temporal misalignment . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

work page doi:10.18653/v1/2022.naacl-main.435 2022
[28]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In NeurIPS

work page 2022
[29]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/pdf?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

work page 2023
[30]

Nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting gpt: the logit lens . Accessed: Nov 2023

work page 2020
[31]

Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, and Bernhard Sch \" o lkopf. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.458 Competition of mechanisms: Tracing how language models handle facts and counterfactuals . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page doi:10.18653/v1/2024.acl-long.458 2024
[32]

Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. https://openreview.net/forum?id=ptvV5HGTNN Resolving knowledge conflicts in large language models . In First Conference on Language Modeling

work page 2024
[34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

work page 2020
[35]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.486 Knowledge conflicts for LLM s: A survey . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8541--8565, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.486 2024
[36]

Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.615 Characterizing mechanisms for factual recall in language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924--9959, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.615 2023
[37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko aj Bi\' n...

work page 2022

[2] [2]

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://doi.org/10.48550/ARXIV.2303.08112 Eliciting latent predictions from transformers with the tuned lens . CoRR, abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023

[3] [3]

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.146 Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292--2307, Abu Dhabi, United Arab Emirates. ...

work page doi:10.18653/v1/2022.emnlp-main.146 2022

[4] [4]

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. https://doi.org/10.48550/ARXIV.2311.03287 Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges . CoRR, abs/2311.03287

work page doi:10.48550/arxiv.2311.03287 2023

[5] [5]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 8493--8502. Associat...

work page doi:10.18653/v1/2022.acl-long.581 2022

[6] [6]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , pages 5484--5495...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[8] [8]

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. 2025 a . https://doi.org/10.48550/ARXIV.2505.17127 Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts . CoRR, abs/2505.17127

work page doi:10.48550/arxiv.2505.17127 2025

[9] [9]

Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025 b . https://aclanthology.org/2025.naacl-long.571/ What do VLM s NOTICE ? a mechanistic interpretability pipeline for G aussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the ...

work page 2025

[10] [10]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. https://arxiv.org/abs/2310.14566 Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models . Preprint, arXiv:2310.14566

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Nitzan Bitton Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. 2023. https://doi.org/10.1109/ICCV51070.2023.00247 Breaking common sense: Whoops! A vision-and-language benchmark of synthetic and compositional images . In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, O...

work page doi:10.1109/iccv51070.2023.00247 2023

[12] [12]

Danny Halawi, Jean - Stanislas Denain, and Jacob Steinhardt. 2023. https://doi.org/10.48550/arXiv.2307.09476 Overthinking the truth: Understanding how language models process false demonstrations . CoRR, abs/2307.09476

work page doi:10.48550/arxiv.2307.09476 2023

[13] [13]

Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.904 The instinctive bias: Spurious images lead to illusion in MLLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16163--16177, Miami, Florida, USA. Associ...

work page doi:10.18653/v1/2024.emnlp-main.904 2024

[14] [14]

Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-acl.70 Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models . In Findings of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2024.findings-acl.70 2024

[15] [15]

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \' e , Morgane Rivi \` e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean - Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \" e l Liu, Francesco Visin, Kathleen Kenealy, Luca...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025

[16] [16]

Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tom \'a s Ko c isk \'y , Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. https://openreview.net/forum?id=73OmmrCfSyy Mind the gap: Assessing temporal generalization in neural language...

work page 2021

[17] [17]

Tiep Le, Vasudev Lal, and Phillip Howard. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/e14e4cb8266184ceb234973dfe07faed-Abstract-Datasets\_and\_Benchmarks.html Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs . In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor...

work page 2023

[18] [18]

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2023. https://arxiv.org/abs/2311.16922 Mitigating object hallucinations in large vision-language models through visual contrastive decoding . Preprint, arXiv:2311.16922

work page arXiv 2023

[19] [19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. https://proceedings.mlr.press/v162/li22n.html BLIP : Bootstrapping language-image pre-training for unified vision-language understanding and generation . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888--12...

work page 2022

[20] [20]

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2024 a . https://openreview.net/forum?id=J44HfH4JCg Mitigating hallucination in large multi-modal models via robust instruction tuning . In The Twelfth International Conference on Learning Representations

work page 2024

[21] [21]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

work page 2024

[22] [22]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc

work page 2023

[23] [23]

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. 2025. https://arxiv.org/abs/2403.11116 Phd: A chatgpt-prompted visual hallucination evaluation dataset . Preprint, arXiv:2403.11116

work page arXiv 2025

[24] [24]

Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen tse Huang, Qiuzhi Liu, Pinjia He, and Zhaopeng Tu. 2024 c . https://arxiv.org/abs/2410.08145 Insight over sight? exploring the vision-knowledge conflicts in multimodal llms . Preprint, arXiv:2410.08145

work page arXiv 2024

[25] [25]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024 d . https://arxiv.org/abs/2306.05499 Prompt injection attack against llm-integrated applications . Preprint, arXiv:2306.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.565 Entity-based knowledge conflicts in question answering . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052--7063, Online and Punta Cana, Dominican Republic....

work page doi:10.18653/v1/2021.emnlp-main.565 2021

[27] [27]

Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. 2022. https://doi.org/10.18653/v1/2022.naacl-main.435 Time waits for no one! analysis and challenges of temporal misalignment . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

work page doi:10.18653/v1/2022.naacl-main.435 2022

[28] [28]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In NeurIPS

work page 2022

[29] [29]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/pdf?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

work page 2023

[30] [30]

Nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting gpt: the logit lens . Accessed: Nov 2023

work page 2020

[31] [31]

Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, and Bernhard Sch \" o lkopf. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.458 Competition of mechanisms: Tracing how language models handle facts and counterfactuals . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page doi:10.18653/v1/2024.acl-long.458 2024

[32] [32]

Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. https://openreview.net/forum?id=ptvV5HGTNN Resolving knowledge conflicts in large language models . In First Conference on Language Modeling

work page 2024

[34] [34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

work page 2020

[35] [35]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.486 Knowledge conflicts for LLM s: A survey . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8541--8565, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.486 2024

[36] [36]

Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.615 Characterizing mechanisms for factual recall in language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924--9959, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.615 2023

[37] [37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[38] [38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page