pith. sign in

arxiv: 2509.14837 · v2 · submitted 2025-09-18 · 💻 cs.CL

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Pith reviewed 2026-05-18 16:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords visual semantic editingattention modulationcausal interpretabilityvision-language modelsVQA benchmarksmultimodal integrationattention heads
0
0 comments X

The pith

V-SEAM uses concept-level visual edits to identify attention heads that causally shape vision-language model predictions and modulates them to raise VQA accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents V-SEAM to move causal interpretability in vision-language models beyond coarse pixel changes to targeted concept edits at the level of objects, attributes, and relationships. By intervening on images in this semantic way the method locates attention heads whose activity supports or opposes correct predictions. Positive heads cluster within a given semantic level while negative heads appear more broadly shared. An automatic modulation step then adjusts the embeddings of these heads and produces measurable gains on standard VQA benchmarks for both LLaVA and InstructBLIP.

Core claim

V-SEAM combines visual semantic editing with attention-head modulation to reveal that specific heads contribute positively or negatively to VLM predictions at three distinct semantic levels, with positive heads largely shared inside each level and negative heads generalizing across levels; automatic embedding modulation of the identified heads then improves performance on three VQA benchmarks for LLaVA and InstructBLIP.

What carries the argument

V-SEAM framework that performs concept-level visual semantic edits to intervene on inputs and then selects and modulates attention-head embeddings according to their measured positive or negative causal effect.

If this is right

  • Positive attention heads are shared within each semantic level but differ across object, attribute, and relationship levels.
  • Negative attention heads generalize across semantic levels rather than staying level-specific.
  • Automatic modulation of the selected head embeddings raises accuracy on three separate VQA benchmarks for both LLaVA and InstructBLIP.
  • The approach supplies a causal account of how multimodal integration occurs inside the attention layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same editing-plus-modulation pipeline could be tested on additional VLMs beyond the two reported here to check consistency of head roles.
  • If semantic levels are handled by partly distinct head groups, future architectures might explicitly route or regularize those groups to reduce cross-level interference.
  • Causal head identification of this kind offers a route to targeted model editing that avoids retraining entire networks.

Load-bearing premise

The attention heads located by the semantic editing interventions are causally responsible for the observed predictions rather than merely correlated with them.

What would settle it

Modulating the identified heads produces no performance gain or introduces new errors on the three VQA benchmarks while leaving the original predictions unchanged under the same visual edits.

Figures

Figures reproduced from arXiv: 2509.14837 by Junjie Hu, Ming Jiang, Qidong Wang.

Figure 1
Figure 1. Figure 1: An example of visual intervention comparisons: Visual non-semantic vs. semantic interventions. VLMs, with particular emphasis on causal inter￾vention methods given their dual benefits: causally unraveling model behaviors and providing system￾atic pathways for model improvement, like model editing (Zhao et al., 2024; Lin et al., 2025). Existing work on causal interpretability primar￾ily focuses on large lan… view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed semantic-level causal interpretability framework. The pipeline starts from semantic-guided [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Logit change analysis for image and query token patching in LLaVA and InstructBLIP. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise causal impact of MLP (blue) and self-attention (green) on cross-modal semantic understanding. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study of the MLP (blue) and self-attention (green) logit lens in LLaVA on the Action VQA task. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of different perturbation strategies for the object-level task ("Is the sky blue?"). [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison of different perturbation strategies for the object-level task ("Is there a sandwich in this [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison of different perturbation strategies for the relation-level task ("Is the man standing [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case of MLP (blue) and self-attention (green) Logit Lens in LLAVA on Material VQA. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case of MLP (blue) and self-attention (green) Logit Lens in InstructBLIP on Indoor VQA. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise causal impact of MLP and self-attention on the Material VQA task [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise causal impact of attention and MLP on the Vehicle VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layer-wise causal impact of MLP and self-attention on the Animal VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Layer-wise causal impact of MLP and self-attention on the Spatial VQA task [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
read the original abstract

Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLaVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces V-SEAM, a framework combining visual semantic editing at object, attribute, and relationship levels with attention-head modulation to probe causal mechanisms in vision-language models. It identifies heads with positive or negative contributions to predictions, reports patterns of head sharing within and across semantic levels, and claims that automatic modulation of selected heads improves VQA performance for LLaVA and InstructBLIP on three benchmarks.

Significance. If the causal claims are substantiated, the work would advance multimodal interpretability by moving beyond pixel-level perturbations to concept-level visual interventions and by linking identified heads to measurable performance gains. The public release of data and code is a clear strength that supports reproducibility and follow-up studies.

major comments (3)
  1. [§4] §4 (Head Identification via Semantic Edits): The claim that identified heads are causally responsible for predictions at each semantic level rests on the assumption that the visual edits isolate the targeted concept without introducing correlated pixel-level or cross-level changes. No ablation or control experiments (e.g., non-semantic visual perturbations or cross-level edit controls) are reported to rule out these confounds; this directly affects the validity of the positive/negative head attributions.
  2. [§5.2] §5.2 (Automatic Modulation and Benchmark Results): The modulation procedure re-uses heads selected from the same editing interventions that are later used to demonstrate performance gains. Without a held-out validation split or explicit non-causal control conditions (random heads, shuffled attributions), the reported improvements on the three VQA benchmarks may partly reflect selection bias rather than causal modulation, weakening the link between interpretability findings and performance claims.
  3. [§4.3] §4.3 (Head-Sharing Patterns): The observation that positive heads are shared within semantic levels but vary across levels lacks statistical controls for multiple comparisons or baseline comparisons against randomly selected heads; without these, the reported patterns could arise from noise or model idiosyncrasies rather than genuine semantic specialization.
minor comments (2)
  1. [Abstract] The abstract states results on 'three diverse VQA benchmarks' but does not name them; listing the specific datasets (e.g., VQAv2, GQA, OK-VQA) would improve immediate clarity.
  2. [§3] Notation for contribution scores (positive/negative head metrics) is introduced without an explicit equation; adding a short formal definition would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the causal claims in V-SEAM, and we address each point below with plans for revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Head Identification via Semantic Edits): The claim that identified heads are causally responsible for predictions at each semantic level rests on the assumption that the visual edits isolate the targeted concept without introducing correlated pixel-level or cross-level changes. No ablation or control experiments (e.g., non-semantic visual perturbations or cross-level edit controls) are reported to rule out these confounds; this directly affects the validity of the positive/negative head attributions.

    Authors: We agree that explicit controls are needed to rule out potential confounds and fully support the causal attributions. Our semantic editing method targets specific concepts (objects, attributes, relationships) through precise, localized visual changes intended to isolate the relevant semantics while preserving other elements. However, to directly address this concern, we will add ablation experiments in the revised manuscript, including non-semantic perturbations (such as random pixel noise or unrelated edits) and cross-level edit controls. These will quantify whether the positive/negative head identifications are specific to the targeted semantic interventions. revision: yes

  2. Referee: [§5.2] §5.2 (Automatic Modulation and Benchmark Results): The modulation procedure re-uses heads selected from the same editing interventions that are later used to demonstrate performance gains. Without a held-out validation split or explicit non-causal control conditions (random heads, shuffled attributions), the reported improvements on the three VQA benchmarks may partly reflect selection bias rather than causal modulation, weakening the link between interpretability findings and performance claims.

    Authors: We acknowledge the potential selection bias arising from reusing the same interventions for both head selection and performance evaluation. In the revision, we will introduce a held-out validation split to separate head identification from the final modulation experiments. We will also add explicit control conditions using randomly selected heads and shuffled attributions. These additions will strengthen the evidence that the observed VQA improvements for LLaVA and InstructBLIP on the three benchmarks result from causal modulation of the identified heads rather than bias. revision: yes

  3. Referee: [§4.3] §4.3 (Head-Sharing Patterns): The observation that positive heads are shared within semantic levels but vary across levels lacks statistical controls for multiple comparisons or baseline comparisons against randomly selected heads; without these, the reported patterns could arise from noise or model idiosyncrasies rather than genuine semantic specialization.

    Authors: We will revise §4.3 to include rigorous statistical controls. This will involve applying corrections for multiple comparisons across the tested heads and semantic levels, as well as baseline comparisons against randomly sampled head sets of equivalent size. These analyses will help establish whether the observed within-level sharing and cross-level variation reflect genuine semantic specialization beyond what would be expected from noise or model-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents V-SEAM as a sequential framework: visual semantic edits at object/attribute/relationship levels are used to measure changes in predictions and thereby identify positive/negative attention heads, followed by a separate automatic modulation step applied to those heads to report benchmark gains on VQA tasks. No equations or self-citations are shown that reduce the identification or modulation steps to a direct fit or re-use of the same quantities by construction. The performance claims rest on empirical application across models and benchmarks rather than a closed definitional loop, and the released code allows external verification independent of any internal fitting procedure described in the manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard interpretability assumptions about the validity of semantic interventions and the causal role of attention heads; no explicit free parameters or new entities are named in the abstract.

axioms (1)
  • domain assumption Semantic edits at the concept level produce valid causal interventions that isolate contributions at object, attribute, and relationship levels.
    This premise underpins the head identification step described in the abstract.

pith-pipeline@v0.9.0 · 5701 in / 1334 out tokens · 42001 ms · 2026-05-18T16:16:21.261948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 6 internal anchors

  1. [1]

    Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

  2. [2]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, and Iain Barr. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems

  3. [3]

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674--3683

  4. [4]

    Yuxuan Bai, Hao Cheng, Yuwei Gu, and 1 others. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2310.07904

  5. [5]

    Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti. 2024. https://openreview.net/forum?id=s63dtq0mwA Understanding information storage and transfer in multi-modal large language models . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  6. [6]

    Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219

  7. [7]

    Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. 2024. Lvlm-interpret: An interpretability tool for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182--8187

  8. [8]

    Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim. 2024. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120

  9. [9]

    Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. 2025. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. arXiv preprint arXiv:2503.01773

  10. [10]

    Weijie Chen, Yizhe Zhang, Qian Wu, and 1 others. 2024. Internvl: Scaling up vision-language pretraining with multimodal reinforcement learning. arXiv preprint arXiv:2402.00028

  11. [11]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600

  12. [12]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS) 36

  13. [13]

    Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. 2023. Interpreting clip's image representation via text-based decomposition. arXiv preprint arXiv:2310.05916

  14. [14]

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767

  15. [15]

    Michal Golovanevsky, William Rudman, Vedant Palit, Carsten Eickhoff, and Ritambhara Singh. 2025. https://aclanthology.org/2025.naacl-long.571/ What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation . In Proceedings of the 2025 Conference of the North American Chapter of the Association for...

  16. [16]

    Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141

  17. [17]

    Kaichen Huang, Jiahao Huo, Yibo Yan, Kun Wang, Yutao Yue, and Xuming Hu. 2024. Miner: Mining the underlying pattern of modality-specific neurons in multimodal large language models. arXiv preprint arXiv:2410.04819

  18. [18]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

  19. [19]

    Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. 2024. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193

  20. [20]

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, and 1 others. 2023. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015--4026

  21. [21]

    Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388--395

  22. [22]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

  23. [23]

    Junnan Li, Dongxu Li, Yixuan Xie, Yixuan Guo, Xiyang Dai, Jianfeng Gao, Jianwei Sang, and Lijuan Wang. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (ICML)

  24. [24]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355

  25. [25]

    Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, and 1 others. 2025. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516

  26. [26]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296--26306

  27. [27]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

  28. [28]

    Y Liu, Y Zhang, and S Yeung-Levy. 2025. Mechanistic interpretability meets vision language models: Insights and limitations. In The Fourth Blogpost Track at ICLR 2025

  29. [29]

    Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29

  30. [30]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359--17372

  31. [31]

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. 2024. Towards interpreting visual information processing in vision-language models. arXiv preprint arXiv:2410.07149

  32. [32]

    Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang. 2023. https://arxiv.org/abs/2308.14179 Towards vision-language mechanistic interpretability: A causal tracing tool for blip . In Proceedings of the ICCV Workshop on Computational Linguistics for Vision and Language (CLVL)

  33. [33]

    o gel, Philipp M \

    Ekta Sood, Fabian K \"o gel, Philipp M \"u ller, Dominike Thomas, Mihai B \^a ce, and Andreas Bulling. 2023. Multimodal integration of human-like attention in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2648--2658

  34. [34]

    Student. 1908. The probable error of a mean. Biometrika, pages 1--25

  35. [35]

    Shimon Ullman. 1987. Visual routines. In Readings in computer vision, pages 298--328. Elsevier

  36. [36]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388--12401

  37. [37]

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, and 1 others. 2024. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475--121499

  38. [38]

    Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, and Sijie Zhu. 2025. Where do large vision-language models look at when answering questions? arXiv preprint arXiv:2503.13891

  39. [39]

    Fred Zhang and Neel Nanda. 2023. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042

  40. [40]

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579--5588

  41. [41]

    Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. 2024. From redundancy to relevance: Enhancing explainability in multimodal large language models. arXiv e-prints, pages arXiv--2406

  42. [42]

    Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1--38

  43. [43]

    Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. 2024. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In European Conference on Computer Vision, pages 260--278. Springer

  44. [44]

    Deyao Zhu, Xiangxin Zhou, Xiang Wang, Xiubo Geng, Fan Liu, and Jiashen Zhu. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592

  45. [45]

    Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. 2024. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In European Conference on Computer Vision, pages 195--211. Springer

  46. [46]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  47. [47]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...