pith. sign in

arxiv: 2507.13868 · v2 · submitted 2025-07-18 · 💻 cs.CV · cs.AI

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Pith reviewed 2026-05-19 03:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsknowledge conflictsattention headshallucinationscounterfactual querieslogit inspectionmultimodal reasoning
0
0 comments X

The pith

Vision-language models resolve conflicts between internal knowledge and visual input via a small set of attention heads that can be identified and edited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often produce unreliable outputs when visual evidence contradicts their stored knowledge. This paper creates a dataset of deliberately conflicting multimodal queries to expose how the models handle such clashes. Logit inspection reveals a compact group of attention heads that decide whether the output follows the image or the model's prior facts. Editing these heads shifts the balance toward one source or the other. Attention maps from the same heads also point to the exact image areas driving visual overrides more cleanly than gradient methods do.

Core claim

Through logit inspection on the WHOOPS-AHA! dataset of multimodal counterfactual queries that contradict commonsense knowledge, the authors locate a small set of attention heads that mediate cross-modal conflicts. Interventions on these heads steer model outputs toward either parametric knowledge or visual information, and the heads' attention patterns localize the image regions responsible for visual overrides with greater precision than gradient-based attribution.

What carries the argument

A small set of attention heads located via logit inspection; these heads mediate the choice between internal parametric knowledge and incoming visual evidence during conflict resolution.

If this is right

  • Editing the identified heads can increase or decrease reliance on visual input without retraining the full model.
  • Attention patterns from these heads give a direct map of which image patches trigger overrides of stored knowledge.
  • The same inspection technique may generalize to other multimodal conflict settings beyond the tested dataset.
  • Precise head-level control offers a route to reduce hallucinations caused by visual-textual mismatches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same heads prove stable across model scales, targeted editing could become a lightweight safety layer for deployed VLMs.
  • The finding suggests that conflict resolution may be localized enough to study in isolation from other capabilities.
  • Extending the dataset to include non-commonsense conflicts could test whether the same heads handle factual versus perceptual clashes.

Load-bearing premise

The heads found by logit inspection are causally responsible for resolving the conflict rather than merely correlated with the model's final choice.

What would settle it

An intervention experiment in which editing the reported heads leaves the model's preference for visual input versus internal knowledge unchanged.

Figures

Figures reproduced from arXiv: 2507.13868 by Alberto Cazzaniga, Diego Doimo, Francesco Ortu, Zhijing Jin.

Figure 1
Figure 1. Figure 1: Overview of Our Approach. (Top) We con￾struct prompts that induce a conflict between a vision￾language model’s internal factual knowledge and coun￾terfactual visual context. (Bottom) We then analyze which components in the model mediate this tension, identifying attention heads and visual patches that favor factual or visually grounded predictions. lead to hallucinations in model responses (Cui et al., 202… view at source ↗
Figure 2
Figure 2. Figure 2: Factual Prevalence in Attention and MLP Blocks. The plot shows the factual prevalence of atten￾tion and MLP blocks in LLaVA-NeXT across layers, indicating whether each component promotes predic￾tions aligned with factual knowledge or counterfactual visual context. Positive values correspond to blocks favoring the factual (commonsense) continuation. Neg￾ative values indicate preference for the counterfactua… view at source ↗
Figure 3
Figure 3. Figure 3: Contribution of Attention Heads to Factual and Counterfactual Predictions. (Left) Factual accuracy of individual attention heads in LLaVA-NeXT, based on Logit Lens projections at the final token position. Blue indicates heads that tend to favor the factual token (reflecting inner knowledge), while red indicates heads that favor the counterfactual token (introduced by the visual context). (Right) Mean atten… view at source ↗
Figure 4
Figure 4. Figure 4: Intervention on Target Attention Heads. Change in factual accuracy under different levels of intervention strength (λ). For λ < 0, we boost the counterfactual heads (on image tokens) and weaken the factual heads (on text tokens); for λ > 0, we do the opposite. The intervention is applied at the final token position, modifying only the relevant attention values in the last row. ence of factual and counterfa… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Examples of Visual Regions Driving Counterfactual Predictions. Highlighted im￾age regions correspond to visual patches identified as most responsible for counterfactual predictions using attention-based attribution. In both examples, the model generates a visually grounded but factually incorrect token (e.g., rainbow, fruit) instead of the commonsense alternative (black, tissue). The highlighte… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of Relevant Pixels. The plot shows the effect of ablating different percentages of image pixels in LLaVA-NeXT. The green line corresponds to pixels selected based on the highest attention from coun￾terfactual heads, while the orange line corresponds to pixels with the highest gradient magnitude with respect to the counterfactual token. The gray line shows a ran￾dom baseline where pixels are remove… view at source ↗
Figure 7
Figure 7. Figure 7: Factual and Counterfactual Contributions of MLP and Attention Blocks in Gemma3. Layer￾wise deviation from 50% factual accuracy for attention and MLP blocks, as measured by the relative logits of tfact and tcofa via Logit Lens. Positive values indicate a bias toward the factual token, while negative values indicate preference for the counterfactual token. Con￾sistent with trends observed in LLaVA-NeXT, atte… view at source ↗
Figure 8
Figure 8. Figure 8: Factual and Counterfactual Contributions of Attention Heads for Gemma3. (Left) Factual accuracy of individual attention heads in Gemma3, computed using Logit Lens projections of the final token’s hidden state. Blue indicates heads that more frequently favor the factual token (tfact), while red indicates those that favor the counterfactual token (tcofa). As in LLaVA-NeXT, highly polarized heads are concentr… view at source ↗
Figure 9
Figure 9. Figure 9: Control Experiment: Intervention on Ran￾dom Attention Heads. Change in factual accuracy under varying levels of intervention strength (λ) applied to 100 randomly selected attention heads. The results show no substantial deviation from baseline, confirming the specificity of the identified target heads. 20 40 60 80 1 20 40 60 Number of Attention Heads Factual Accuracy (%) Model Gemma3 LLaVA−NeXT [PITH_FULL… view at source ↗
Figure 11
Figure 11. Figure 11: KL Divergence Between Generated Cap￾tions at Different Intervention Strengths in LLaVA￾NeXT. Symmetric increase in KL divergence around λ = 0, with rapid divergence until |λ| = 3 and stabi￾lization near |λ| = 10. Higher intervention magnitudes cause substantial shifts in the generated token distribu￾tion, indicating degradation in caption quality. D Prompts For Dataset Generation Prompt Used to Generate D… view at source ↗
read the original abstract

Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA!, a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal parametric knowledge or the visual information. Our results show that attention patterns on these heads effectively locate image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries that pit VLMs' internal parametric commonsense knowledge against contradictory visual inputs. Using logit inspection, it identifies a small set of attention heads claimed to mediate cross-modal knowledge conflicts. Targeted interventions on these heads are shown to steer model outputs toward either internal knowledge or visual information, while attention patterns within the heads are reported to yield more precise image-region attribution than gradient-based methods.

Significance. If the causal claims are substantiated, the work would contribute a new diagnostic dataset and mechanistic insights into how VLMs resolve knowledge-visual conflicts, with potential applications to reducing hallucinations. The dataset itself and the head-level steering results, if properly controlled, would be useful for the interpretability community.

major comments (2)
  1. [§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.
  2. [§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.
minor comments (2)
  1. [Table 1] Table 1: The dataset statistics table would benefit from an additional column reporting inter-annotator agreement or human validation of the counterfactual contradictions.
  2. [Figure 4] Figure 4: The attention-map visualizations lack a quantitative comparison (e.g., IoU or precision-recall against human-annotated regions) to support the claim of superiority over gradient-based attribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Intervention Experiments): The steering results are presented without control interventions on matched heads (e.g., same-layer heads not selected by the logit criterion or randomly chosen heads). This leaves open whether the observed shifts toward parametric knowledge or visual input are specific to the identified heads or arise from any attention-head editing.

    Authors: We agree that control interventions are necessary to substantiate the specificity of the identified attention heads. In the revised version of the manuscript, we will add experiments intervening on randomly selected heads within the same layers as well as on heads that were not selected by the logit inspection criterion. These additional controls will demonstrate that the observed steering effects toward parametric knowledge or visual input are indeed specific to the heads mediating the cross-modal conflicts, rather than a nonspecific effect of editing any attention head. We plan to include these results in an updated §4.2 and the supplementary material. revision: yes

  2. Referee: [§3.1] §3.1 (Logit Inspection): The method identifies heads via marginal logit contribution, yet the manuscript provides no quantitative summary (number of heads retained, consistency across models/queries, or ablation of the inspection threshold) that would demonstrate these heads are the causal locus rather than downstream correlates.

    Authors: We acknowledge that a quantitative summary would strengthen the claim that the identified heads are the causal locus. In the revised manuscript, we will expand §3.1 to include: (1) the exact number of heads retained for each model (typically a small set of 4-8 heads), (2) consistency metrics across different models and query types in the WHOOPS-AHA! dataset, and (3) an ablation study varying the inspection threshold to show that the selected heads remain stable and effective. This will help differentiate them from potential downstream correlates. We will also report these details in a new table or figure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard methods on new dataset

full rationale

The paper introduces the WHOOPS-AHA! dataset of multimodal counterfactual queries and applies established logit inspection plus activation patching from prior interpretability literature to locate attention heads. No equations or claims reduce by construction to fitted parameters defined in terms of the target result, no self-citation chains justify uniqueness or ansatzes, and the central findings are empirical observations rather than self-referential derivations. The analysis is therefore self-contained against external benchmarks in mechanistic interpretability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new dataset and applies existing mechanistic interpretability tools; it does not introduce new free parameters, axioms beyond standard transformer assumptions, or invented entities.

axioms (1)
  • domain assumption Attention heads in transformer-based VLMs can be causally responsible for specific behavioral decisions such as modality preference.
    The intervention experiments rest on this standard assumption from mechanistic interpretability.

pith-pipeline@v0.9.0 · 5653 in / 1266 out tokens · 31781 ms · 2026-05-19T03:47:58.511764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko aj Bi\' n...

  2. [2]

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://doi.org/10.48550/ARXIV.2303.08112 Eliciting latent predictions from transformers with the tuned lens . CoRR, abs/2303.08112

  3. [3]

    Hung-Ting Chen, Michael Zhang, and Eunsol Choi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.146 Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292--2307, Abu Dhabi, United Arab Emirates. ...

  4. [4]

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. https://doi.org/10.48550/ARXIV.2311.03287 Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges . CoRR, abs/2311.03287

  5. [5]

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. https://doi.org/10.18653/V1/2022.ACL-LONG.581 Knowledge neurons in pretrained transformers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 8493--8502. Associat...

  6. [6]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

  7. [7]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , pages 5484--5495...

  8. [8]

    Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. 2025 a . https://doi.org/10.48550/ARXIV.2505.17127 Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts . CoRR, abs/2505.17127

  9. [9]

    Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025 b . https://aclanthology.org/2025.naacl-long.571/ What do VLM s NOTICE ? a mechanistic interpretability pipeline for G aussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the ...

  10. [10]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. https://arxiv.org/abs/2310.14566 Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models . Preprint, arXiv:2310.14566

  11. [11]

    Nitzan Bitton Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. 2023. https://doi.org/10.1109/ICCV51070.2023.00247 Breaking common sense: Whoops! A vision-and-language benchmark of synthetic and compositional images . In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, O...

  12. [12]

    Danny Halawi, Jean - Stanislas Denain, and Jacob Steinhardt. 2023. https://doi.org/10.48550/arXiv.2307.09476 Overthinking the truth: Understanding how language models process false demonstrations . CoRR, abs/2307.09476

  13. [13]

    Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.904 The instinctive bias: Spurious images lead to illusion in MLLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16163--16177, Miami, Florida, USA. Associ...

  14. [14]

    Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-acl.70 Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models . In Findings of the Association for Computational Linguistics: ...

  15. [15]

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \' e , Morgane Rivi \` e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean - Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \" e l Liu, Francesco Visin, Kathleen Kenealy, Luca...

  16. [16]

    Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tom \'a s Ko c isk \'y , Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. https://openreview.net/forum?id=73OmmrCfSyy Mind the gap: Assessing temporal generalization in neural language...

  17. [17]

    Tiep Le, Vasudev Lal, and Phillip Howard. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/e14e4cb8266184ceb234973dfe07faed-Abstract-Datasets\_and\_Benchmarks.html Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs . In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor...

  18. [18]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2023. https://arxiv.org/abs/2311.16922 Mitigating object hallucinations in large vision-language models through visual contrastive decoding . Preprint, arXiv:2311.16922

  19. [19]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. https://proceedings.mlr.press/v162/li22n.html BLIP : Bootstrapping language-image pre-training for unified vision-language understanding and generation . In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888--12...

  20. [20]

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2024 a . https://openreview.net/forum?id=J44HfH4JCg Mitigating hallucination in large multi-modal models via robust instruction tuning . In The Twelfth International Conference on Learning Representations

  21. [21]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 b . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

  22. [22]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc

  23. [23]

    Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. 2025. https://arxiv.org/abs/2403.11116 Phd: A chatgpt-prompted visual hallucination evaluation dataset . Preprint, arXiv:2403.11116

  24. [24]

    Xiaoyuan Liu, Wenxuan Wang, Youliang Yuan, Jen tse Huang, Qiuzhi Liu, Pinjia He, and Zhaopeng Tu. 2024 c . https://arxiv.org/abs/2410.08145 Insight over sight? exploring the vision-knowledge conflicts in multimodal llms . Preprint, arXiv:2410.08145

  25. [25]

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024 d . https://arxiv.org/abs/2306.05499 Prompt injection attack against llm-integrated applications . Preprint, arXiv:2306.05499

  26. [26]

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.565 Entity-based knowledge conflicts in question answering . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052--7063, Online and Punta Cana, Dominican Republic....

  27. [27]

    Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. 2022. https://doi.org/10.18653/v1/2022.naacl-main.435 Time waits for no one! analysis and challenges of temporal misalignment . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

  28. [28]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In NeurIPS

  29. [29]

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/pdf?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

  30. [30]

    Nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting gpt: the logit lens . Accessed: Nov 2023

  31. [31]

    Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, and Bernhard Sch \" o lkopf. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.458 Competition of mechanisms: Tracing how language models handle facts and counterfactuals . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  32. [32]

    Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818

  33. [33]

    Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. https://openreview.net/forum?id=ptvV5HGTNN Resolving knowledge conflicts in large language models . In First Conference on Language Modeling

  34. [34]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

  35. [35]

    Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.486 Knowledge conflicts for LLM s: A survey . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8541--8565, Miami, Florida, USA. Association for Computational Linguistics

  36. [36]

    Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.615 Characterizing mechanisms for factual recall in language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924--9959, Singapore. Association for Computational Linguistics

  37. [37]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  38. [38]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...