pith. machine review for the scientific record. sign in

arxiv: 2604.17375 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.AI

Recognition: unknown

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelshallucinationtext overlaybenchmarkmixture of expertsvideo understandingmultimodal modelsdisentanglement
0
0 comments X

The pith

When on-screen text contradicts video visuals, existing vision-language models systematically hallucinate by favoring the text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models prioritize embedded on-screen text over actual video content whenever the two conflict, producing what the authors call text overlay-induced hallucination. This issue matters for any video understanding task that involves real-world footage with captions, labels, or subtitles, because it can make model outputs unreliable even when the underlying visuals are clear. To measure the problem, the authors build VisualTextTrap, a benchmark of 6,057 human-validated video samples drawn from public datasets and annotated across temporal, action, object, and spatial dimensions on a five-level hallucination scale. They then introduce VTHM-MoE, a mixture-of-experts model that uses a dual-encoder setup and four pre-trained dimension-specific experts together with adaptive token routing to detect and override text-vision contradictions. If these claims hold, models can gain resistance to the failure mode while keeping their accuracy on clean videos.

Core claim

When embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. The authors define this as Text Overlay-Induced Hallucination (TOIH). They construct VisualTextTrap, the first comprehensive benchmark, from public datasets via a hybrid VLM-assisted generation pipeline followed by manual verification, yielding 6,057 samples annotated on 88 fine-grained attributes in four dimensions with hallucination intensity scored L1 to L5. They further propose VTHM-MoE, a Vision-Text Disentanglement framework built on a dual-encoder architecture whose four dimension-specialized expert modules (spatiot

What carries the argument

VTHM-MoE, a dual-encoder mixture-of-experts framework with four pre-trained dimension-specialized experts for temporal, action, object, and spatial reasoning plus an adaptive token routing strategy that routes inputs to experts best able to resolve text-vision discrepancies.

If this is right

  • Standard VLMs will continue to produce incorrect answers on video QA whenever overlay text conflicts with visuals.
  • VTHM-MoE maintains accuracy on videos that lack conflicting text.
  • The adaptive routing strategy allows dynamic allocation of expert modules based on detected cross-modal discrepancies.
  • Performance gains appear across temporal, action, object, and spatial reasoning tasks in the presence of text overlays.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-vision conflict could appear in static image tasks with overlaid captions, suggesting the benchmark approach might transfer beyond video.
  • Real-world systems that process news footage or instructional videos could incorporate similar expert routing to reduce errors in live settings.
  • Extending the routing mechanism to handle text that changes over time might address more dynamic overlay scenarios the current benchmark does not test.

Load-bearing premise

The hybrid VLM-assisted text generation followed by manual verification creates samples that faithfully represent real-world contradictions and that the four dimension-specialized experts will generalize robustly to new videos without overfitting.

What would settle it

Run both baseline VLMs and the VTHM-MoE model on a fresh collection of videos containing deliberately created on-screen text that directly contradicts the visible scene, then measure whether hallucination rates remain high for baselines but drop for VTHM-MoE.

Figures

Figures reproduced from arXiv: 2604.17375 by Cui Yakun, Sirui Han, TianTian Geng, Xingqun Qi, Yike Guo, Yuyao Zhang.

Figure 1
Figure 1. Figure 1: Illustration of Text Overlay-Induced Hallucination (TOIH) across four evaluation dimensions and its mitigation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the VisualTextTrap benchmark. Distribution of hallucination (Halluc.) attributes (a), dataset sources and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of VTHM-MoE. Module (a) encodes video via parallel OCR and Visual encoders, producing [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of VTHM-MoE’s routing behavior across four TOIH scenarios. For each case (a-d), we present [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy across LLaVA-Video [70], VideoMME [13] and TemporalBench [6] datasets (389 valid groups contain￾ing all threeconditions). Each column corresponds to an at￾tribute: (1) Body Part; (2) Action Direction; (3) Motion Traj.; (4) Action Result; (5) Rel. Position; (6) Spec. Verb; (7) Step Or￾der; (8) Interaction; (9) Appear Order. Columns are sorted by Text-Contradictory Δ in descending order. T-Fr = Text… view at source ↗
Figure 6
Figure 6. Figure 6: .Accuracy comparison under Neutral (Text-Free) [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the data construction pipeline for [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy on paired Positive/Negative video samples. Positive: original videos with no overlay text. Negative: same [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Data distribution statistics of the training set. (a) [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of inconsistency label argmax across [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A sample from the Action Perceptual-tier subset. The overlay text introduces a conflict on the [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A sample from the Action Semantic-tier subset. The overlay text targets the [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A sample from the Spatial Semantic-tier subset. The overlay text conflicts on the [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A sample from the Spatial Perceptual-tier subset. The overlay text targets the [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A sample from the Action Reasoning-tier subset. The overlay text introduces a conflict on the [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A sample from the Temporal Perceptual-tier subset. The overlay text targets the [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A sample from the Spatial Reasoning-tier subset. The overlay text conflicts on the [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Distribution of fine-grained video understanding [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Chain-of-Thought for TOIH Mitigation. textual shortcuts in lieu of cross-frame understanding—compound each other, making temporal hallucinations particularly difficult to counteract. Action hallucinations are the easiest to resist. Unlike tempo￾ral reasoning, action recognition typically involves a short temporal span with only localized visual changes; for instance, in a basketball [PITH_FULL_IMAGE:figu… view at source ↗
Figure 23
Figure 23. Figure 23: Overall VQA accuracy on three benchmarks [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 22
Figure 22. Figure 22: Ablation study on the number of selected patches [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 24
Figure 24. Figure 24: Ablation study on loss coefficients. (a) [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt template used to classify each video question into one of four fine-grained evaluation dimensions via [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt template used to generate action-level hallucinated descriptions for incorrect options via Claude-Sonnet4.6. [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt template used to generate truthful descriptions for correct options via Claude-Sonnet4.6. The model combines [PITH_FULL_IMAGE:figures/full_fig_p030_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Prompt template used to generate object-level hallucinated descriptions for incorrect options via Claude-Sonnet4.6. [PITH_FULL_IMAGE:figures/full_fig_p031_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Prompt template used to generate truthful object descriptions for correct options via Claude-Sonnet4.6. The model [PITH_FULL_IMAGE:figures/full_fig_p032_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Prompt template used to generate spatial-level hallucinated descriptions for incorrect options via Claude-Sonnet4.6. [PITH_FULL_IMAGE:figures/full_fig_p033_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Prompt template used to generate truthful spatial descriptions for correct options via Claude-Sonnet4.6. The model [PITH_FULL_IMAGE:figures/full_fig_p034_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Prompt template used to generate temporal-level hallucinated descriptions for incorrect options via Claude-Sonnet4.6. [PITH_FULL_IMAGE:figures/full_fig_p035_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Prompt template used to generate truthful temporal descriptions for correct options via Claude-Sonnet4.6. The [PITH_FULL_IMAGE:figures/full_fig_p036_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Prompt template used to evaluate the semantic conflict between the correct answer and a hallucinated description [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Prompt template for estimating the multi-dimensional information ratio needed to answer a video question with a [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
read the original abstract

Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1--L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies Text Overlay-Induced Hallucination (TOIH) as a vulnerability in VLMs, where on-screen text contradicting visual content in videos causes models to prioritize textual semantics over visual evidence. It introduces VisualTextTrap, a benchmark of 6,057 human-validated samples constructed from public datasets via a hybrid VLM-assisted text generation pipeline followed by manual verification; samples are annotated across four dimensions (Temporal, Action, Object, Spatial) with 88 fine-grained attributes and hallucination intensity on an L1-L5 scale. The authors also propose VTHM-MoE, a dual-encoder Mixture-of-Experts architecture with four dimension-specialized experts pre-trained on cross-modal discrepancies and an Adaptive Token Routing Strategy for dynamic allocation. Experiments on the benchmark are claimed to show that VTHM-MoE outperforms state-of-the-art methods on video QA tasks while preserving performance on uncontaminated videos.

Significance. If the central claims hold, the work addresses a practically relevant robustness gap in VLMs for real-world video understanding involving text overlays (e.g., subtitles, signs, or captions). The scale of the benchmark, its multi-dimensional structure, and the intensity scaling provide a useful evaluation resource. The proposed disentanglement framework via specialized experts and routing is a concrete architectural contribution. Credit is due for the human-validated scale and the explicit focus on preserving clean-video performance.

major comments (2)
  1. [Benchmark construction pipeline] Benchmark construction pipeline (described in the abstract and methods): the hybrid VLM-assisted text generation step risks introducing model-specific biases into the contradiction samples, as the assisting VLMs may share training data or failure modes with the evaluated models. This directly affects the load-bearing claim that VLMs 'systematically' hallucinate on independent real-world overlays. Manual verification is noted but lacks reported inter-annotator agreement, blinding protocols, or quantitative naturalness metrics, leaving open whether the 6,057 samples (and L1-L5 intensities) faithfully represent unbiased contradictions.
  2. [Experiments] Experiments section: the abstract asserts 'extensive experiments' and outperformance on 'diverse video question answering tasks' with no quantitative results, error bars, ablation tables, or baseline details provided even at high level. Without these, it is impossible to verify the claimed superiority of VTHM-MoE or the robustness of the four-expert design and Adaptive Token Routing Strategy beyond the benchmark samples.
minor comments (2)
  1. [Benchmark description] The L1-L5 intensity scale is introduced but would benefit from an explicit table or equation defining the semantic contradiction thresholds for each level.
  2. [Benchmark construction] Ensure all source public datasets are explicitly cited with version or split information in the benchmark construction section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback. We agree that clarifying the benchmark construction process and providing more explicit experimental details will improve the manuscript. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Benchmark construction pipeline] Benchmark construction pipeline (described in the abstract and methods): the hybrid VLM-assisted text generation step risks introducing model-specific biases into the contradiction samples, as the assisting VLMs may share training data or failure modes with the evaluated models. This directly affects the load-bearing claim that VLMs 'systematically' hallucinate on independent real-world overlays. Manual verification is noted but lacks reported inter-annotator agreement, blinding protocols, or quantitative naturalness metrics, leaving open whether the 6,057 samples (and L1-L5 intensities) faithfully represent unbiased contradictions.

    Authors: We acknowledge the referee's valid concern about potential biases from VLM-assisted generation. In the revised version, we will expand the methods section to specify the exact VLMs employed, the diversification steps across multiple models and datasets to minimize shared failure modes, and the selection criteria for public video sources. For the manual verification, we will add inter-annotator agreement metrics (e.g., Fleiss' kappa), describe the blinding and independent annotation protocols, and include quantitative naturalness and realism scores from human raters. These changes will more rigorously support the benchmark's validity and the systematic nature of TOIH. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract asserts 'extensive experiments' and outperformance on 'diverse video question answering tasks' with no quantitative results, error bars, ablation tables, or baseline details provided even at high level. Without these, it is impossible to verify the claimed superiority of VTHM-MoE or the robustness of the four-expert design and Adaptive Token Routing Strategy beyond the benchmark samples.

    Authors: We agree that the current presentation would benefit from greater transparency in reporting. The experiments section contains comparative results on video QA tasks, but we will revise it to include full quantitative tables with mean performance and standard error bars over multiple seeds, detailed ablation studies isolating the contribution of each dimension-specific expert and the adaptive routing mechanism, and explicit baseline implementations. We will also update the abstract with key numerical highlights (e.g., accuracy gains on VisualTextTrap while preserving clean-video performance). These additions will enable direct verification of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark and model are independently constructed

full rationale

The paper defines TOIH as a new phenomenon, constructs VisualTextTrap from public datasets via a hybrid VLM-assisted generation pipeline plus manual verification to produce 6,057 human-validated samples, and introduces VTHM-MoE as a novel dual-encoder architecture with four pre-trained dimension-specialized experts and adaptive token routing. No equations, fitted parameters, or self-citations are shown that reduce the central claims or performance results to the inputs by construction. The evaluation is presented as external testing on the new benchmark rather than a tautological loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available so ledger is incomplete. The work relies on standard VLM evaluation assumptions and introduces new components (dimension experts, adaptive routing) without detailing underlying mathematical axioms or fitted parameters.

pith-pipeline@v0.9.0 · 5619 in / 1239 out tokens · 39700 ms · 2026-05-10T06:37:43.521359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    Rana AlShaikh, Norah Al-Malki, and Maida Almasre. 2024. The implementation of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models.Heliyon10, 3 (2024), e25361. doi:10.1016/j.heliyon.2024.e25361

  2. [2]

    Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and Michael L Littman. 2019. Deep reinforcement learning from policy-dependent human feedback.arXiv preprint arXiv:1902.04257(2019)

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  4. [4]

    Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz

    Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. 2024. Visual cognition in multimodal large language models. arXiv:2311.16093 [cs.LG] https://arxiv.org/abs/2311.16093

  5. [5]

    Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, and Rita Cucchiara. 2025. Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models.arXiv preprint arXiv:2512.15885(2025)

  6. [6]

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. 2024. Temporalbench: Bench- marking fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818(2024)

  7. [7]

    Christel Chappuis, Valérie Zermatten, Sylvain Lobry, Bertrand Le Saux, and Devis Tuia. 2022. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1372–1381

  8. [8]

    Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, and Lu Cheng. 2025. MVI-Bench: A Comprehensive Benchmark for Evalu- ating Robustness to Misleading Visual Inputs in LVLMs. arXiv:2511.14159 [cs.CV] https://arxiv.org/abs/2511.14159

  9. [9]

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  11. [11]

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. 2024. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.Advances in Neural Information Processing Systems 37 (2024), 89098–89124

  12. [12]

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. 2024. Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. arXiv:2501.03230 [cs.AI] https://arxiv.org/abs/ 2501.03230

  13. [13]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118

  14. [14]

    Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. 2025. Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation. arXiv:2503.19622 [cs.CV] https://arxiv.org/abs/2503.19622

  15. [15]

    Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2024. LMM-VQA: Advancing Video Quality Assessment With Large Multimodal Models.IEEE Transactions on Circuits and Systems for Video Technology35 (2024), 11083–11096. https: //api.semanticscholar.org/CorpusID:271956910

  16. [16]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. 2024. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognitio...

  17. [17]

    Iryna Hartsock and Ghulam Rasool. 2024. Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence7 (2024), 1430984

  18. [18]

    Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. 2023. Lvos: A benchmark for long-term video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13480–13492

  19. [19]

    Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. 2025. Exploiting multimodal spatial-temporal patterns for video object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3581–3589

  20. [20]

    Raisa Islam and Owana Marzia Moushi. 2025. Gpt-4o: The cutting-edge advance- ment in multimodal llm. InIntelligent Computing-Proceedings of the Computing Conference. Springer, 47–60

  21. [21]

    Nihal Jain, Dejiao Zhang, Wasi Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, et al. 2023. ContraCLM: Contrastive learning for causal language model. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6436–6459

  22. [22]

    Ziheng Jia, Zicheng Zhang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. 2026. Scaling-up perceptual video quality assessment. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 22292–22300

  23. [23]

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qing- hao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. 2024. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27036–27046

  24. [24]

    Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. 2024. A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research(2024)

  25. [25]

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi

  26. [26]

    Rlaif: Scaling reinforcement learning from human feedback with ai feedback. (2023)

  27. [27]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882

  28. [28]

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding. arXiv:2412.03735 [cs.CV] https://arxiv.org/abs/2412.03735

  29. [29]

    Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xi- aoming Wei, and Si Liu. 2025. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8592–8603

  30. [30]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

  31. [31]

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2023. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers). 12286–12312

  32. [32]

    Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, and Mike Zheng Shou. 2024. Parrot Captions Teach CLIP to Spot Text. arXiv:2312.14232 [cs.CV] https://arxiv.org/abs/2312.14232

  33. [33]

    Dan Liu, Fanrong Meng, Jinpeng Mi, Mao Ye, Qingdu Li, and Jianwei Zhang

  34. [34]

    SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos.Pattern Recognition168 (2025), 111725

  35. [35]

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Tempcompass: Do video llms really understand videos?. InFindings of the Association for Computational Linguistics: ACL 2024. 8731–8772

  36. [36]

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. 2025. Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints(2025), arXiv–2503

  37. [37]

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. 2026. Textmonkey: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence(2026)

  38. [38]

    Shichen Lu, Tongtian Yue, Longteng Guo, Handong Li, Xingjian He, Si Liu, and Jing Liu. 2025. ViPE: Visual Perception in Parameter Space for Efficient Video- Language Understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 17775–17786

  39. [39]

    Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive learning guided mixture of experts on parameter- efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851 (2024)

  40. [40]

    Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, and Junnan Li. 2025. VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation. arXiv:2411.13281 [cs.CV] https://arxiv.org/abs/2411.13281

  41. [41]

    Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. 2025. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference. 14268–14280

  42. [42]

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al . 2025. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models Conference acronym...

  43. [43]

    Ali Najafi and Onur Varol. 2024. Turkishbertweet: Fast and reliable large language model for social media analysis.Expert Systems with Applications255 (2024), 124737

  44. [44]

    Rayyan Najam and Safiullah Faizullah. 2023. Analysis of recent deep learning techniques for Arabic handwritten-text OCR and post-OCR correction.Applied Sciences13, 13 (2023), 7568

  45. [45]

    Heinrich Peters, Sandra C Matz, Michele Gelfand, et al . 2024. Large language models can infer psychological dispositions of social media users.PNAS nexus3, 6 (2024), pgae231

  46. [46]

    Nhi Pham and Michael Schott. 2024. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models.arXiv preprint arXiv:2411.04077(2024)

  47. [47]

    Manish Sanwal. 2024. Evaluating Large Language Models Using Contrast Sets: An Experimental Approach.arXiv preprint arXiv:2404.01569(2024)

  48. [48]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems37 (2024), 8612–8642

  49. [49]

    Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 14974–14983

  50. [50]

    Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, et al. 2025. Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios.arXiv preprint arXiv:2505.21333(2025)

  51. [51]

    Mustafa Shukor, Corentin Dancette, and Matthieu Cord. 2023. ep-alm: Efficient perceptual augmentation of language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22056–22069

  52. [52]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

  53. [53]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  54. [54]

    Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, and Min Yang. 2025. Smart- Sight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse. arXiv:2512.18671 [cs.CV] https://arxiv.org/abs/2512.18671

  55. [55]

    Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfir, Chao Ma, Danda Paudel, Luc Van Gool, and Radu Timofte. 2025. Xtrack: Multi- modal training boosts rgb-x video object trackers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5734–5744

  56. [56]

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  57. [57]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  58. [58]

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Kous- tuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. 2024. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning. arXiv:2412.14164 [cs.CV] https://arxiv.org/abs/2412.14164

  59. [59]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

  60. [60]

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. 2025. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22958–22967

  61. [61]

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng

  62. [62]

    Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338(2024)

  63. [63]

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems37 (2024), 28828–28857

  64. [64]

    Hanbo Wu, Xin Ma, and Yibin Li. 2025. Transformer-based multiview spatiotem- poral feature interactive fusion for human action recognition in depth videos. Signal Processing: Image Communication131 (2025), 117244

  65. [65]

    Yuanchen Wu, Lu Zhang, Hang Yao, Junlong Du, Ke Yan, Shouhong Ding, Yun- sheng Wu, and Xiaoqiang Li. 2025. Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception. arXiv:2504.20468 [cs.CV] https://arxiv.org/abs/2504.20468

  66. [66]

    Cui Yakun, Peng Qi, Fushuo Huo, Hang Du, Weijie Shi, Juntao Dai, Zhenghao Zhu, Sirui Han, and Yike Guo. 2025. Perception, Understanding and Reason- ing, A Multimodal Benchmark for Video Fake News Detection.arXiv preprint arXiv:2510.24816(2025)

  67. [67]

    Cui Yakun, Yanting Zhang, Zhu Lei, Jian Xie, Zhizhuo Kou, Hang Du, Zhenghao Zhu, and Sirui Han. 2026. MMFCTUB: Multi-Modal Financial Credit Table Understanding Benchmark.arXiv preprint arXiv:2601.04643(2026)

  68. [68]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  69. [69]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  70. [70]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  71. [71]

    Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. MentaLLaMA: interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference

  72. [72]

    Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. 2025. Vca: Video curious agent for long video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20168–20179

  73. [73]

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. 2025. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. arXiv:2501.13106 [cs.CV] https://arxiv.org/abs/2501.13106

  74. [74]

    Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. 2025. Vtimecot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24203–24213

  75. [75]

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713(2024)

  76. [76]

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. 2025. Mlvu: Benchmarking multi- task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13691–13701

  77. [77]

    Xinyi Zhou, Ashish Sharma, Amy X Zhang, and Tim Althoff. 2024. Correcting misinformation on social media with a large language model.arXiv preprint arXiv:2403.11169(2024)

  78. [78]

    Yujin Zhou, Pengcheng Wen, Jiale Chen, Boqin Yin, Han Zhu, Jiaming Ji, Juntao Dai, Chi-Min Chan, and Sirui Han. 2026. What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 29071–29079

  79. [79]

    Apollo: An exploration of video understanding in large multimodal models

    Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. 2024. Apollo: An Exploration of Video Understanding in Large Multimodal Models. arXiv:2412.10360 [cs.CV] https://arxiv.org/abs/ 2412.10360 When Text Hijacks Vision: Benchmarking an...

  80. [80]

    First check whether the question primarily testsTemporal; if yes, outputtemporal=YESand set all others toNO

Showing first 80 references.