arxiv: 2605.13737 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL

Recognition: unknown

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang , Yiming Gao , Fanyi Pu , Kaichen Zhang , Shuo Sun , Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords omnimodal LLMsrepresentation-action gapmultimodal groundingpremise-perception mismatchIMAVB benchmarkconflict detectionhidden state probes

0 comments

The pith

Omnimodal LLMs encode premise-perception mismatches in hidden states but almost never reject the conflicting claims in their outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests basic grounding in models that process video, audio, and text together: whether they notice when a question's text contradicts their own sensory input. It introduces a 500-clip benchmark called IMAVB that crosses vision versus audio with standard versus misleading premises. Across eight open models and Gemini, hidden states reliably flag the mismatch, yet the models answer the questions as if the false premise were true. The gap appears in two behavioral patterns—under-rejection of misleading questions and over-rejection that also harms normal comprehension—and resists prompt changes while being worse for audio than vision. A simple probe-guided adjustment to the output logits improves rejection without retraining.

Core claim

Hidden states in omnimodal LLMs reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs, revealing a representation-action gap that is modality-asymmetric and prompt-resistant.

What carries the argument

The Representation-Action Gap: internal hidden-state encoding of premise-perception conflict versus behavioral failure to reject the false premise during answer generation.

If this is right

A probe-guided logit adjustment that re-injects the mismatch signal improves rejection rates across models.
Audio grounding lags behind vision in both detection and rejection.
Seven prompt variants fail to close the gap, indicating the issue is not easily fixed at inference time.
Models split into under-rejection (accepting false premises) and over-rejection (rejecting valid questions too) patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bottleneck for omnimodal grounding is in how encoded mismatches are translated into output decisions rather than in perception itself.
Similar gaps may appear in other agentic settings where internal state must control external actions.
Interventions that directly use hidden-state probes could serve as lightweight safety checks for multimodal systems.

Load-bearing premise

The IMAVB clips and questions cleanly separate grounding failures from confounds in clip choice, phrasing, or training data, and that probe detection of hidden-state signals accurately tracks what the model functionally knows.

What would settle it

A direct comparison showing that the accuracy of a linear probe on hidden states for mismatch detection does not predict the model's actual rejection rate on the same misleading-premise questions better than random guessing.

Figures

Figures reproduced from arXiv: 2605.13737 by Fanyi Pu, Kaichen Zhang, Shuo Sun, Trung Nguyen Quang, Yiming Gao, Ziwei Liu.

**Figure 2.** Figure 2: IMAVB three-pass annotation pipeline and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A 2 × 2 IMAVB sample. Rows target vision and audio; columns hold the standard or misleading premise; the same stimulus (filmstrip and audio cue above) drives both columns of a row. Misleading variants swap exactly one premise detail (red). Red boxes in the filmstrip mark the woman in the red dress (frames 6–8). More examples in Appendix O. 2.4 Manual Quality Verification The authors manually verified all 5… view at source ↗

**Figure 4.** Figure 4: Annotation tool interface for standard questions. The annotator views the video, reads the question with six options (A–F), and rates question clarity, answer correctness, and timestamp correctness on a 1–4 Likert scale. The full rubric is shown inline beneath each radio group so raters never have to recall the scale. Metadata shows answer timestamp, modality, and question category [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 5.** Figure 5: Annotation tool interface for misleading questions. In addition to the three 1–4 scales used for standard items, the tool displays the misleading category and a human-readable description of the detail swap. The annotator additionally flags whether the misleading premise is valid (Yes/No). correctness 3.91/4 (91.45% at 4, 8.55% at 3), mean timestamp correctness 3.93/4 (92.65% at 4, 7.20% at 3, 0.15% at 2),… view at source ↗

**Figure 6.** Figure 6: Accuracy by video duration bin, per split. Standard accuracy degrades with duration; [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy by answer evidence position ratio. Misleading detection is independent of where [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Example 1. Frames 1–5: a man in a dark industrial setting approaches machinery. Frames 6–10: a creature on a monitor wears a yellow hard hat (red boxes). Visual swap: hard-hat colour (yellow → blue). Audio swap: music style (late-1980s video games → 1950s jazz club). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: Example 2. Frames 1–5: a car drives at night. Frames 6–10: interior shot of a man and a woman; red boxes on frames 8–9 mark the brown suit jacket, a small-region detail that makes even the standard variant challenging. Visual swap: jacket colour (brown → blue). Audio insertion: a fabricated “loud and angry tone” is added to the woman’s speech, though she actually speaks softly. 35 [PITH_FULL_IMAGE:figures… view at source ↗

read the original abstract

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Omnimodal models pick up premise-perception conflicts in hidden states but rarely reject the false premise in outputs, and the new IMAVB benchmark isolates this gap with a clean 2x2 design.

read the letter

The paper's main finding is that current omnimodal LLMs encode mismatches between their sensory input and a contradictory textual premise inside their hidden states, yet they almost never use that signal to reject the premise when answering. This representation-action gap appears across eight open models and Gemini, with a clear asymmetry where audio grounding lags behind vision. The 2x2 structure on 500 movie clips separates ordinary comprehension from conflict detection, and the probe-guided logit adjustment offers a simple way to re-inject the mismatch signal and improve rejection rates without tanking normal accuracy. Prompt variants do not close the gap, which is a useful negative result.

Circularity Check

0 steps flagged

Empirical benchmark study with no circular derivation steps

full rationale

The paper introduces the IMAVB benchmark and reports empirical measurements of hidden-state probe accuracy versus output rejection rates across models. No equations, fitted parameters, or self-referential definitions are used; the Representation-Action Gap is presented as a direct observational result from applying probes to existing models on curated clips. The 2x2 design and PGLA intervention are experimental interventions, not derivations that reduce to their own inputs by construction. This is a standard empirical analysis with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probe-detected hidden-state signals constitute functional encoding of mismatches and that the IMAVB clips provide an unbiased test of grounding.

axioms (1)

domain assumption Probe-based detection of hidden-state mismatches accurately reflects functional encoding of premise-perception conflicts.
The paper interprets hidden-state probe results as evidence that perception occurs, which underpins the claim that the bottleneck is in translation rather than perception.

pith-pipeline@v0.9.0 · 5584 in / 1236 out tokens · 43761 ms · 2026-05-14T18:03:22.061288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 7 canonical work pages · 3 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

A new era of intelligence with Gemini 3, 2026

Google Blog. A new era of intelligence with Gemini 3, 2026. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/

2026
[3]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, et al. Qwen3-omni technical report, 2025

2025
[4]

Mitchell

Amos Azaria and Tom M. Mitchell. The internal state of an llm knows when it’s lying, 2023

2023
[5]

Discovering latent knowledge in language models without supervision, 2022

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

2022
[6]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023

2023
[7]

Representation engineering: A top-down approach to ai transparency, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to ai transparency, 2023

2023
[8]

Inference- time intervention: Eliciting truthful answers from a language model, 2023

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model, 2023

2023
[9]

LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024

2024
[10]

Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger. Arbi- tration failure, not perceptual blindness: How vision-language models resolve visual-linguistic conflicts.arXiv preprint arXiv:2604.09364, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Haruka Kawasaki, Ryota Tanaka, and Kyosuke Nishida. Responses fall short of understand- ing: Revealing the gap between internal representations and responses in visual document understanding.arXiv preprint arXiv:2604.04411, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Diagnosing knowledge conflict in multimodal long-chain reasoning.arXiv preprint arXiv:2602.14518, 2026

Jing Tang, Kun Wang, Haolang Lu, Hongjin Chen, KaiTao Chen, Zhongxiang Sun, Qiankun Li, Lingjuan Lyu, Guoshun Nan, and Zhigang Zeng. Diagnosing knowledge conflict in multimodal long-chain reasoning.arXiv preprint arXiv:2602.14518, 2026

work page arXiv 2026
[13]

Crepe: Open-domain question answering with false presuppositions

Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10457–10480, 2023

2023
[14]

Bowman, and Jackson Petty

Najoung Kim, Phu Mon Htut, Samuel R. Bowman, and Jackson Petty. (QA) 2: Question answering with questionable assumptions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8466–8487, 2023

2023
[15]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023

2023
[16]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023

2023
[17]

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models, 2023. 10

2023
[18]

Have the vlms lost confidence? a study of sycophancy in vlms, 2024

Shuo Li, Tao Ji, Xiaoran Fan, Linsheng Lu, Leyi Yang, Yuming Yang, Zhiheng Xi, Rui Zheng, Yuran Wang, Xiaohui Zhao, Tao Gui, Qi Zhang, and Xuanjing Huang. Have the vlms lost confidence? a study of sycophancy in vlms, 2024

2024
[19]

Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024

Yunpu Zhao, Rui Zhang, Junbin Xiao, et al. Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024

2024
[20]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

2022
[21]

Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026

Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, and Xing Xie. Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026

2026
[22]

Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

2023
[23]

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023

Munan Ning, Bin Zhu, Yujia Xie, et al. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023

2023
[24]

MMAU: A massive multi-task audio understanding and reasoning benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TeVAZXr3yv

2025
[25]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979
[26]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

2025
[27]

Omnibench: Towards the future of universal omni-language models, 2024

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, et al. Omnibench: Towards the future of universal omni-language models, 2024

2024
[28]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

2025
[29]

Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

2025
[30]

Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024

2024
[31]

Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding

Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626– 20636, 2025

2025
[32]

Binge society (youtube channel), 2026

Binge Society. Binge society (youtube channel), 2026. URL https://www.youtube.com/ @bingesociety

2026
[33]

Boxoffice movie scenes (youtube channel), 2026

Boxoffice Movie Scenes. Boxoffice movie scenes (youtube channel), 2026. URL https: //www.youtube.com/@BoxofficeMoviesScenes

2026
[34]

Condensed movies: Story based retrieval with contextual embeddings

Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. InProceedings of the Asian Conference on Computer Vision, 2020. 11

2020
[35]

Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

work page arXiv 2025
[36]

Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025

2025
[37]

Qwen2.5-omni technical report, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, et al. Qwen2.5-omni technical report, 2025

2025
[38]

Minicpm-v: A gpt-4v level mllm on your phone, 2024

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, et al. Minicpm-v: A gpt-4v level mllm on your phone, 2024

2024
[39]

Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025

Yunxin Li, Xinyu Chen, Shenyuan Jiang, et al. Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025

2025
[40]

Baichuan-omni-1.5 technical report, 2025

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report, 2025

2025
[41]

video-salmonn 2: Caption-enhanced audio-visual large language models, 2025

Changli Tang, Yixuan Li, Yudong Yang, et al. video-salmonn 2: Caption-enhanced audio-visual large language models, 2025

2025
[42]

LMMs-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. LMMs-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025. doi: 10.18653/v1/2025.findings-naacl.51. URLhttps://aclan...

work page doi:10.18653/v1/2025.findings-naacl.51 2025
[43]

Eliciting latent predictions from transform- ers with the tuned lens, 2023

Nora Belrose, Igor Ostrovsky, Lev McKinney, et al. Eliciting latent predictions from transform- ers with the tuned lens, 2023

2023
[44]

Designing and interpreting probes with control tasks, 2019

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks, 2019

2019
[45]

Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020

2020
[46]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024

2024
[47]

Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025

Chaeyoung Jung, Youngjoon Jang, and Joon Son Chung. Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025

2025
[48]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024

2024
[49]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[50]

The revolution of multimodal large language models: A survey, 2024

Davide Caffagni, Federico Cocchi, Luca Barsellotti, et al. The revolution of multimodal large language models: A survey, 2024

2024
[51]

VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024

2024
[52]

Connector-s: A survey of connectors in multi-modal large language models

Xun Zhu, Zheng Zhang, Xi Chen, Yiming Shi, Miao Li, and Ji Wu. Connector-s: A survey of connectors in multi-modal large language models. InProceedings of IJCAI-25, 2025

2025
[53]

Modality laziness: Everybody’s business is nobody’s business, 2022

Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Yue Wang, Yang Yuan, and Hang Zhao. Modality laziness: Everybody’s business is nobody’s business, 2022. URL https: //openreview.net/forum?id=1eGFH6yYAJn

2022
[54]

The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024

Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, et al. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024

2024
[55]

A survey of hallucination in large foundation models, 2023

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models, 2023. 12

2023
[56]

The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations

Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, et al. The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations. InEMNLP, 2023

2023
[57]

Cross-modal information flow in multimodal large language models, 2024

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models, 2024

2024
[58]

Diagnosing and mitigating modality interference in multimodal large language models, 2025

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models, 2025

2025
[59]

Mllms are deeply affected by modality bias, 2025

Xu Zheng, Chenfei Liao, Yuqian Fu, et al. Mllms are deeply affected by modality bias, 2025

2025
[60]

Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024

Jean Park, Kuk Jin Jang, Basam Alasaly, et al. Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024

2024
[61]

Bench- marking gaslighting negation attacks against multimodal large language models, 2025

Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim. Bench- marking gaslighting negation attacks against multimodal large language models, 2025

2025
[62]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

2023
[63]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of ACL, 2024

2024
[64]

Fork-merge decoding: Enhancing multimodal understanding in audio-visual large language models.arXiv preprint arXiv:2505.20873, 2025

Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, and Joon Son Chung. Fork-merge decoding: Enhancing multimodal understanding in audio-visual large language models.arXiv preprint arXiv:2505.20873, 2025

work page arXiv 2025
[65]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

2024
[66]

Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024

2024
[67]

Dola: Decoding by contrasting layers improves factuality in large language models, 2023

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models, 2023

2023
[68]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024

2024
[69]

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding long videos with multimodal language models, 2024

2024
[70]

From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024

Heqing Zou, Tianze Luo, et al. From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024

2024
[71]

Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

2025
[72]

How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation

Ruohao Guo, Wei Xu, and Alan Ritter. How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation. InEMNLP, 2025

2025
[73]

Multihoax: A dataset of multi-hop false-premise questions

Mohammadamin Shafiei, Hamidreza Saffari, and Nafise Sadat Moosavi. Multihoax: A dataset of multi-hop false-premise questions. InFindings of ACL, 2025

2025
[74]

The visual/audio detail in the question is incorrect

Yunkai Dang, Mengxi Gao, Yibo Yan, et al. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios, 2024. 13 A Appendix Overview The appendices provide full implementation details, extended results, and supporting analyses for all experiments reported in the main text. Table 5 lists each appendix with its scope and the m...

2024
[76]

What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 1 (Omni Caption): Subsequent Segments User prompt: For context, here is what happened in the last 10s of the video: {previous_caption} Use this context to understand continuity (same people, ongoing actions, conversation flow). Now describe TH...
[77]

What you see: people, actions, setting, objects
[78]

Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer

What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer. Add specific details while maintaining the primary source’s accuracy. User prompt: You are enhancing a video segment caption by combining details from multiple sour...
[79]

Primary Caption = TRUTH - Start with this as your base
[80]

Add visual details from Vision Caption
[81]

Add audio details ONLY if they match the scene

Showing first 80 references.