pith. machine review for the scientific record. sign in

arxiv: 2605.13737 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL

Recognition: unknown

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords omnimodal LLMsrepresentation-action gapmultimodal groundingpremise-perception mismatchIMAVB benchmarkconflict detectionhidden state probes
0
0 comments X

The pith

Omnimodal LLMs encode premise-perception mismatches in hidden states but almost never reject the conflicting claims in their outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests basic grounding in models that process video, audio, and text together: whether they notice when a question's text contradicts their own sensory input. It introduces a 500-clip benchmark called IMAVB that crosses vision versus audio with standard versus misleading premises. Across eight open models and Gemini, hidden states reliably flag the mismatch, yet the models answer the questions as if the false premise were true. The gap appears in two behavioral patterns—under-rejection of misleading questions and over-rejection that also harms normal comprehension—and resists prompt changes while being worse for audio than vision. A simple probe-guided adjustment to the output logits improves rejection without retraining.

Core claim

Hidden states in omnimodal LLMs reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs, revealing a representation-action gap that is modality-asymmetric and prompt-resistant.

What carries the argument

The Representation-Action Gap: internal hidden-state encoding of premise-perception conflict versus behavioral failure to reject the false premise during answer generation.

If this is right

  • A probe-guided logit adjustment that re-injects the mismatch signal improves rejection rates across models.
  • Audio grounding lags behind vision in both detection and rejection.
  • Seven prompt variants fail to close the gap, indicating the issue is not easily fixed at inference time.
  • Models split into under-rejection (accepting false premises) and over-rejection (rejecting valid questions too) patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bottleneck for omnimodal grounding is in how encoded mismatches are translated into output decisions rather than in perception itself.
  • Similar gaps may appear in other agentic settings where internal state must control external actions.
  • Interventions that directly use hidden-state probes could serve as lightweight safety checks for multimodal systems.

Load-bearing premise

The IMAVB clips and questions cleanly separate grounding failures from confounds in clip choice, phrasing, or training data, and that probe detection of hidden-state signals accurately tracks what the model functionally knows.

What would settle it

A direct comparison showing that the accuracy of a linear probe on hidden states for mismatch detection does not predict the model's actual rejection rate on the same misleading-premise questions better than random guessing.

Figures

Figures reproduced from arXiv: 2605.13737 by Fanyi Pu, Kaichen Zhang, Shuo Sun, Trung Nguyen Quang, Yiming Gao, Ziwei Liu.

Figure 1
Figure 1. Figure 1: Overview of the Representation–Action Gap on IMAVB. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: IMAVB three-pass annotation pipeline and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A 2 × 2 IMAVB sample. Rows target vision and audio; columns hold the standard or misleading premise; the same stimulus (filmstrip and audio cue above) drives both columns of a row. Misleading variants swap exactly one premise detail (red). Red boxes in the filmstrip mark the woman in the red dress (frames 6–8). More examples in Appendix O. 2.4 Manual Quality Verification The authors manually verified all 5… view at source ↗
Figure 4
Figure 4. Figure 4: Annotation tool interface for standard questions. The annotator views the video, reads the question with six options (A–F), and rates question clarity, answer correctness, and timestamp correctness on a 1–4 Likert scale. The full rubric is shown inline beneath each radio group so raters never have to recall the scale. Metadata shows answer timestamp, modality, and question category [PITH_FULL_IMAGE:figure… view at source ↗
Figure 5
Figure 5. Figure 5: Annotation tool interface for misleading questions. In addition to the three 1–4 scales used for standard items, the tool displays the misleading category and a human-readable description of the detail swap. The annotator additionally flags whether the misleading premise is valid (Yes/No). correctness 3.91/4 (91.45% at 4, 8.55% at 3), mean timestamp correctness 3.93/4 (92.65% at 4, 7.20% at 3, 0.15% at 2),… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy by video duration bin, per split. Standard accuracy degrades with duration; [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy by answer evidence position ratio. Misleading detection is independent of where [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example 1. Frames 1–5: a man in a dark industrial setting approaches machinery. Frames 6–10: a creature on a monitor wears a yellow hard hat (red boxes). Visual swap: hard-hat colour (yellow → blue). Audio swap: music style (late-1980s video games → 1950s jazz club). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example 2. Frames 1–5: a car drives at night. Frames 6–10: interior shot of a man and a woman; red boxes on frames 8–9 mark the brown suit jacket, a small-region detail that makes even the standard variant challenging. Visual swap: jacket colour (brown → blue). Audio insertion: a fabricated “loud and angry tone” is added to the woman’s speech, though she actually speaks softly. 35 [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

Empirical benchmark study with no circular derivation steps

full rationale

The paper introduces the IMAVB benchmark and reports empirical measurements of hidden-state probe accuracy versus output rejection rates across models. No equations, fitted parameters, or self-referential definitions are used; the Representation-Action Gap is presented as a direct observational result from applying probes to existing models on curated clips. The 2x2 design and PGLA intervention are experimental interventions, not derivations that reduce to their own inputs by construction. This is a standard empirical analysis with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probe-detected hidden-state signals constitute functional encoding of mismatches and that the IMAVB clips provide an unbiased test of grounding.

axioms (1)
  • domain assumption Probe-based detection of hidden-state mismatches accurately reflects functional encoding of premise-perception conflicts.
    The paper interprets hidden-state probe results as evidence that perception occurs, which underpins the claim that the bottleneck is in translation rather than perception.

pith-pipeline@v0.9.0 · 5584 in / 1236 out tokens · 43761 ms · 2026-05-14T18:03:22.061288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267, 2026

  2. [2]

    A new era of intelligence with Gemini 3, 2026

    Google Blog. A new era of intelligence with Gemini 3, 2026. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/

  3. [3]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, et al. Qwen3-omni technical report, 2025

  4. [4]

    Mitchell

    Amos Azaria and Tom M. Mitchell. The internal state of an llm knows when it’s lying, 2023

  5. [5]

    Discovering latent knowledge in language models without supervision, 2022

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

  6. [6]

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023

  7. [7]

    Representation engineering: A top-down approach to ai transparency, 2023

    Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to ai transparency, 2023

  8. [8]

    Inference- time intervention: Eliciting truthful answers from a language model, 2023

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model, 2023

  9. [9]

    LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024

  10. [10]

    Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

    Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger. Arbi- tration failure, not perceptual blindness: How vision-language models resolve visual-linguistic conflicts.arXiv preprint arXiv:2604.09364, 2026

  11. [11]

    Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

    Haruka Kawasaki, Ryota Tanaka, and Kyosuke Nishida. Responses fall short of understand- ing: Revealing the gap between internal representations and responses in visual document understanding.arXiv preprint arXiv:2604.04411, 2026

  12. [12]

    Diagnosing knowledge conflict in multimodal long-chain reasoning.arXiv preprint arXiv:2602.14518, 2026

    Jing Tang, Kun Wang, Haolang Lu, Hongjin Chen, KaiTao Chen, Zhongxiang Sun, Qiankun Li, Lingjuan Lyu, Guoshun Nan, and Zhigang Zeng. Diagnosing knowledge conflict in multimodal long-chain reasoning.arXiv preprint arXiv:2602.14518, 2026

  13. [13]

    Crepe: Open-domain question answering with false presuppositions

    Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10457–10480, 2023

  14. [14]

    Bowman, and Jackson Petty

    Najoung Kim, Phu Mon Htut, Samuel R. Bowman, and Jackson Petty. (QA) 2: Question answering with questionable assumptions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8466–8487, 2023

  15. [15]

    Discovering language model behaviors with model-written evaluations

    Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023

  16. [16]

    Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023

  17. [17]

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models, 2023. 10

  18. [18]

    Have the vlms lost confidence? a study of sycophancy in vlms, 2024

    Shuo Li, Tao Ji, Xiaoran Fan, Linsheng Lu, Leyi Yang, Yuming Yang, Zhiheng Xi, Rui Zheng, Yuran Wang, Xiaohui Zhao, Tao Gui, Qi Zhang, and Xuanjing Huang. Have the vlms lost confidence? a study of sycophancy in vlms, 2024

  19. [19]

    Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024

    Yunpu Zhao, Rui Zhang, Junbin Xiao, et al. Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024

  20. [20]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

  21. [21]

    Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026

    Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, and Xing Xie. Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026

  22. [22]

    Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

  23. [23]

    Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023

    Munan Ning, Bin Zhu, Yujia Xie, et al. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023

  24. [24]

    MMAU: A massive multi-task audio understanding and reasoning benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TeVAZXr3yv

  25. [25]

    Air-bench: Benchmarking large audio-language models via generative comprehension

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

  26. [26]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

  27. [27]

    Omnibench: Towards the future of universal omni-language models, 2024

    Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, et al. Omnibench: Towards the future of universal omni-language models, 2024

  28. [28]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

    Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

  29. [29]

    Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

  30. [30]

    Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024

    Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024

  31. [31]

    Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding

    Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626– 20636, 2025

  32. [32]

    Binge society (youtube channel), 2026

    Binge Society. Binge society (youtube channel), 2026. URL https://www.youtube.com/ @bingesociety

  33. [33]

    Boxoffice movie scenes (youtube channel), 2026

    Boxoffice Movie Scenes. Boxoffice movie scenes (youtube channel), 2026. URL https: //www.youtube.com/@BoxofficeMoviesScenes

  34. [34]

    Condensed movies: Story based retrieval with contextual embeddings

    Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. InProceedings of the Asian Conference on Computer Vision, 2020. 11

  35. [35]

    Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

  36. [36]

    Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025

  37. [37]

    Qwen2.5-omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, et al. Qwen2.5-omni technical report, 2025

  38. [38]

    Minicpm-v: A gpt-4v level mllm on your phone, 2024

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, et al. Minicpm-v: A gpt-4v level mllm on your phone, 2024

  39. [39]

    Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025

    Yunxin Li, Xinyu Chen, Shenyuan Jiang, et al. Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025

  40. [40]

    Baichuan-omni-1.5 technical report, 2025

    Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report, 2025

  41. [41]

    video-salmonn 2: Caption-enhanced audio-visual large language models, 2025

    Changli Tang, Yixuan Li, Yudong Yang, et al. video-salmonn 2: Caption-enhanced audio-visual large language models, 2025

  42. [42]

    LMMs-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. LMMs-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025. doi: 10.18653/v1/2025.findings-naacl.51. URLhttps://aclan...

  43. [43]

    Eliciting latent predictions from transform- ers with the tuned lens, 2023

    Nora Belrose, Igor Ostrovsky, Lev McKinney, et al. Eliciting latent predictions from transform- ers with the tuned lens, 2023

  44. [44]

    Designing and interpreting probes with control tasks, 2019

    John Hewitt and Percy Liang. Designing and interpreting probes with control tasks, 2019

  45. [45]

    Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020

    Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020

  46. [46]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024

  47. [47]

    Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025

    Chaeyoung Jung, Youngjoon Jang, and Joon Son Chung. Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025

  48. [48]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024

  49. [49]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  50. [50]

    The revolution of multimodal large language models: A survey, 2024

    Davide Caffagni, Federico Cocchi, Luca Barsellotti, et al. The revolution of multimodal large language models: A survey, 2024

  51. [51]

    VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024

  52. [52]

    Connector-s: A survey of connectors in multi-modal large language models

    Xun Zhu, Zheng Zhang, Xi Chen, Yiming Shi, Miao Li, and Ji Wu. Connector-s: A survey of connectors in multi-modal large language models. InProceedings of IJCAI-25, 2025

  53. [53]

    Modality laziness: Everybody’s business is nobody’s business, 2022

    Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Yue Wang, Yang Yuan, and Hang Zhao. Modality laziness: Everybody’s business is nobody’s business, 2022. URL https: //openreview.net/forum?id=1eGFH6yYAJn

  54. [54]

    The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024

    Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, et al. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024

  55. [55]

    A survey of hallucination in large foundation models, 2023

    Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models, 2023. 12

  56. [56]

    The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations

    Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, et al. The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations. InEMNLP, 2023

  57. [57]

    Cross-modal information flow in multimodal large language models, 2024

    Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models, 2024

  58. [58]

    Diagnosing and mitigating modality interference in multimodal large language models, 2025

    Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models, 2025

  59. [59]

    Mllms are deeply affected by modality bias, 2025

    Xu Zheng, Chenfei Liao, Yuqian Fu, et al. Mllms are deeply affected by modality bias, 2025

  60. [60]

    Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024

    Jean Park, Kuk Jin Jang, Basam Alasaly, et al. Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024

  61. [61]

    Bench- marking gaslighting negation attacks against multimodal large language models, 2025

    Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim. Bench- marking gaslighting negation attacks against multimodal large language models, 2025

  62. [62]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

  63. [63]

    Mitigating hallucinations in large vision-language models with instruction contrastive decoding

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of ACL, 2024

  64. [64]

    Fork-merge decoding: Enhancing multimodal understanding in audio-visual large language models.arXiv preprint arXiv:2505.20873, 2025

    Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, and Joon Son Chung. Fork-merge decoding: Enhancing multimodal understanding in audio-visual large language models.arXiv preprint arXiv:2505.20873, 2025

  65. [65]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

  66. [66]

    Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024

  67. [67]

    Dola: Decoding by contrasting layers improves factuality in large language models, 2023

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models, 2023

  68. [68]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024

  69. [69]

    Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding long videos with multimodal language models, 2024

  70. [70]

    From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024

    Heqing Zou, Tianze Luo, et al. From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024

  71. [71]

    Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

  72. [72]

    How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation

    Ruohao Guo, Wei Xu, and Alan Ritter. How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation. InEMNLP, 2025

  73. [73]

    Multihoax: A dataset of multi-hop false-premise questions

    Mohammadamin Shafiei, Hamidreza Saffari, and Nafise Sadat Moosavi. Multihoax: A dataset of multi-hop false-premise questions. InFindings of ACL, 2025

  74. [74]

    The visual/audio detail in the question is incorrect

    Yunkai Dang, Mengxi Gao, Yibo Yan, et al. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios, 2024. 13 A Appendix Overview The appendices provide full implementation details, extended results, and supporting analyses for all experiments reported in the main text. Table 5 lists each appendix with its scope and the m...

  75. [76]

    What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 1 (Omni Caption): Subsequent Segments User prompt: For context, here is what happened in the last 10s of the video: {previous_caption} Use this context to understand continuity (same people, ongoing actions, conversation flow). Now describe TH...

  76. [77]

    What you see: people, actions, setting, objects

  77. [78]

    Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer

    What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer. Add specific details while maintaining the primary source’s accuracy. User prompt: You are enhancing a video segment caption by combining details from multiple sour...

  78. [79]

    Primary Caption = TRUTH - Start with this as your base

  79. [80]

    Add visual details from Vision Caption

  80. [81]

    Add audio details ONLY if they match the scene

Showing first 80 references.