Recognition: unknown
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3
The pith
When on-screen text contradicts video visuals, existing vision-language models systematically hallucinate by favoring the text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. The authors define this as Text Overlay-Induced Hallucination (TOIH). They construct VisualTextTrap, the first comprehensive benchmark, from public datasets via a hybrid VLM-assisted generation pipeline followed by manual verification, yielding 6,057 samples annotated on 88 fine-grained attributes in four dimensions with hallucination intensity scored L1 to L5. They further propose VTHM-MoE, a Vision-Text Disentanglement framework built on a dual-encoder architecture whose four dimension-specialized expert modules (spatiot
What carries the argument
VTHM-MoE, a dual-encoder mixture-of-experts framework with four pre-trained dimension-specialized experts for temporal, action, object, and spatial reasoning plus an adaptive token routing strategy that routes inputs to experts best able to resolve text-vision discrepancies.
If this is right
- Standard VLMs will continue to produce incorrect answers on video QA whenever overlay text conflicts with visuals.
- VTHM-MoE maintains accuracy on videos that lack conflicting text.
- The adaptive routing strategy allows dynamic allocation of expert modules based on detected cross-modal discrepancies.
- Performance gains appear across temporal, action, object, and spatial reasoning tasks in the presence of text overlays.
Where Pith is reading between the lines
- The same text-vision conflict could appear in static image tasks with overlaid captions, suggesting the benchmark approach might transfer beyond video.
- Real-world systems that process news footage or instructional videos could incorporate similar expert routing to reduce errors in live settings.
- Extending the routing mechanism to handle text that changes over time might address more dynamic overlay scenarios the current benchmark does not test.
Load-bearing premise
The hybrid VLM-assisted text generation followed by manual verification creates samples that faithfully represent real-world contradictions and that the four dimension-specialized experts will generalize robustly to new videos without overfitting.
What would settle it
Run both baseline VLMs and the VTHM-MoE model on a fresh collection of videos containing deliberately created on-screen text that directly contradicts the visible scene, then measure whether hallucination rates remain high for baselines but drop for VTHM-MoE.
Figures
read the original abstract
Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1--L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies Text Overlay-Induced Hallucination (TOIH) as a vulnerability in VLMs, where on-screen text contradicting visual content in videos causes models to prioritize textual semantics over visual evidence. It introduces VisualTextTrap, a benchmark of 6,057 human-validated samples constructed from public datasets via a hybrid VLM-assisted text generation pipeline followed by manual verification; samples are annotated across four dimensions (Temporal, Action, Object, Spatial) with 88 fine-grained attributes and hallucination intensity on an L1-L5 scale. The authors also propose VTHM-MoE, a dual-encoder Mixture-of-Experts architecture with four dimension-specialized experts pre-trained on cross-modal discrepancies and an Adaptive Token Routing Strategy for dynamic allocation. Experiments on the benchmark are claimed to show that VTHM-MoE outperforms state-of-the-art methods on video QA tasks while preserving performance on uncontaminated videos.
Significance. If the central claims hold, the work addresses a practically relevant robustness gap in VLMs for real-world video understanding involving text overlays (e.g., subtitles, signs, or captions). The scale of the benchmark, its multi-dimensional structure, and the intensity scaling provide a useful evaluation resource. The proposed disentanglement framework via specialized experts and routing is a concrete architectural contribution. Credit is due for the human-validated scale and the explicit focus on preserving clean-video performance.
major comments (2)
- [Benchmark construction pipeline] Benchmark construction pipeline (described in the abstract and methods): the hybrid VLM-assisted text generation step risks introducing model-specific biases into the contradiction samples, as the assisting VLMs may share training data or failure modes with the evaluated models. This directly affects the load-bearing claim that VLMs 'systematically' hallucinate on independent real-world overlays. Manual verification is noted but lacks reported inter-annotator agreement, blinding protocols, or quantitative naturalness metrics, leaving open whether the 6,057 samples (and L1-L5 intensities) faithfully represent unbiased contradictions.
- [Experiments] Experiments section: the abstract asserts 'extensive experiments' and outperformance on 'diverse video question answering tasks' with no quantitative results, error bars, ablation tables, or baseline details provided even at high level. Without these, it is impossible to verify the claimed superiority of VTHM-MoE or the robustness of the four-expert design and Adaptive Token Routing Strategy beyond the benchmark samples.
minor comments (2)
- [Benchmark description] The L1-L5 intensity scale is introduced but would benefit from an explicit table or equation defining the semantic contradiction thresholds for each level.
- [Benchmark construction] Ensure all source public datasets are explicitly cited with version or split information in the benchmark construction section.
Simulated Author's Rebuttal
We sincerely thank the referee for the detailed and constructive feedback. We agree that clarifying the benchmark construction process and providing more explicit experimental details will improve the manuscript. We address each major comment below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Benchmark construction pipeline] Benchmark construction pipeline (described in the abstract and methods): the hybrid VLM-assisted text generation step risks introducing model-specific biases into the contradiction samples, as the assisting VLMs may share training data or failure modes with the evaluated models. This directly affects the load-bearing claim that VLMs 'systematically' hallucinate on independent real-world overlays. Manual verification is noted but lacks reported inter-annotator agreement, blinding protocols, or quantitative naturalness metrics, leaving open whether the 6,057 samples (and L1-L5 intensities) faithfully represent unbiased contradictions.
Authors: We acknowledge the referee's valid concern about potential biases from VLM-assisted generation. In the revised version, we will expand the methods section to specify the exact VLMs employed, the diversification steps across multiple models and datasets to minimize shared failure modes, and the selection criteria for public video sources. For the manual verification, we will add inter-annotator agreement metrics (e.g., Fleiss' kappa), describe the blinding and independent annotation protocols, and include quantitative naturalness and realism scores from human raters. These changes will more rigorously support the benchmark's validity and the systematic nature of TOIH. revision: yes
-
Referee: [Experiments] Experiments section: the abstract asserts 'extensive experiments' and outperformance on 'diverse video question answering tasks' with no quantitative results, error bars, ablation tables, or baseline details provided even at high level. Without these, it is impossible to verify the claimed superiority of VTHM-MoE or the robustness of the four-expert design and Adaptive Token Routing Strategy beyond the benchmark samples.
Authors: We agree that the current presentation would benefit from greater transparency in reporting. The experiments section contains comparative results on video QA tasks, but we will revise it to include full quantitative tables with mean performance and standard error bars over multiple seeds, detailed ablation studies isolating the contribution of each dimension-specific expert and the adaptive routing mechanism, and explicit baseline implementations. We will also update the abstract with key numerical highlights (e.g., accuracy gains on VisualTextTrap while preserving clean-video performance). These additions will enable direct verification of the claims. revision: yes
Circularity Check
No circularity detected; benchmark and model are independently constructed
full rationale
The paper defines TOIH as a new phenomenon, constructs VisualTextTrap from public datasets via a hybrid VLM-assisted generation pipeline plus manual verification to produce 6,057 human-validated samples, and introduces VTHM-MoE as a novel dual-encoder architecture with four pre-trained dimension-specialized experts and adaptive token routing. No equations, fitted parameters, or self-citations are shown that reduce the central claims or performance results to the inputs by construction. The evaluation is presented as external testing on the new benchmark rather than a tautological loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Rana AlShaikh, Norah Al-Malki, and Maida Almasre. 2024. The implementation of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models.Heliyon10, 3 (2024), e25361. doi:10.1016/j.heliyon.2024.e25361
- [2]
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz
Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. 2024. Visual cognition in multimodal large language models. arXiv:2311.16093 [cs.LG] https://arxiv.org/abs/2311.16093
- [5]
- [6]
-
[7]
Christel Chappuis, Valérie Zermatten, Sylvain Lobry, Bertrand Le Saux, and Devis Tuia. 2022. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1372–1381
2022
- [8]
-
[9]
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)
work page internal anchor Pith review arXiv 2024
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. 2024. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.Advances in Neural Information Processing Systems 37 (2024), 89098–89124
2024
- [12]
-
[13]
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118
2025
-
[14]
Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. 2025. Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation. arXiv:2503.19622 [cs.CV] https://arxiv.org/abs/2503.19622
-
[15]
Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2024. LMM-VQA: Advancing Video Quality Assessment With Large Multimodal Models.IEEE Transactions on Circuits and Systems for Video Technology35 (2024), 11083–11096. https: //api.semanticscholar.org/CorpusID:271956910
2024
-
[16]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. 2024. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognitio...
2024
-
[17]
Iryna Hartsock and Ghulam Rasool. 2024. Vision-language models for medical report generation and visual question answering: A review.Frontiers in artificial intelligence7 (2024), 1430984
2024
-
[18]
Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. 2023. Lvos: A benchmark for long-term video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13480–13492
2023
-
[19]
Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. 2025. Exploiting multimodal spatial-temporal patterns for video object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3581–3589
2025
-
[20]
Raisa Islam and Owana Marzia Moushi. 2025. Gpt-4o: The cutting-edge advance- ment in multimodal llm. InIntelligent Computing-Proceedings of the Computing Conference. Springer, 47–60
2025
-
[21]
Nihal Jain, Dejiao Zhang, Wasi Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, et al. 2023. ContraCLM: Contrastive learning for causal language model. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6436–6459
2023
-
[22]
Ziheng Jia, Zicheng Zhang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. 2026. Scaling-up perceptual video quality assessment. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 22292–22300
2026
-
[23]
Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qing- hao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. 2024. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27036–27046
2024
-
[24]
Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. 2024. A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research(2024)
2024
-
[25]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi
-
[26]
Rlaif: Scaling reinforcement learning from human feedback with ai feedback. (2023)
2023
-
[27]
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882
2024
- [28]
-
[29]
Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xi- aoming Wei, and Si Liu. 2025. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8592–8603
2025
-
[30]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206
2024
-
[31]
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2023. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers). 12286–12312
2023
- [32]
-
[33]
Dan Liu, Fanrong Meng, Jinpeng Mi, Mao Ye, Qingdu Li, and Jianwei Zhang
-
[34]
SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos.Pattern Recognition168 (2025), 111725
2025
-
[35]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Tempcompass: Do video llms really understand videos?. InFindings of the Association for Computational Linguistics: ACL 2024. 8731–8772
2024
-
[36]
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. 2025. Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints(2025), arXiv–2503
2025
-
[37]
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. 2026. Textmonkey: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence(2026)
2026
-
[38]
Shichen Lu, Tongtian Yue, Longteng Guo, Handong Li, Xingjian He, Si Liu, and Jing Liu. 2025. ViPE: Visual Perception in Parameter Space for Efficient Video- Language Understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 17775–17786
2025
- [39]
- [40]
-
[41]
Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. 2025. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference. 14268–14280
2025
-
[42]
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al . 2025. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models Conference acronym...
-
[43]
Ali Najafi and Onur Varol. 2024. Turkishbertweet: Fast and reliable large language model for social media analysis.Expert Systems with Applications255 (2024), 124737
2024
-
[44]
Rayyan Najam and Safiullah Faizullah. 2023. Analysis of recent deep learning techniques for Arabic handwritten-text OCR and post-OCR correction.Applied Sciences13, 13 (2023), 7568
2023
-
[45]
Heinrich Peters, Sandra C Matz, Michele Gelfand, et al . 2024. Large language models can infer psychological dispositions of social media users.PNAS nexus3, 6 (2024), pgae231
2024
-
[46]
Nhi Pham and Michael Schott. 2024. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models.arXiv preprint arXiv:2411.04077(2024)
work page internal anchor Pith review arXiv 2024
- [47]
-
[48]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems37 (2024), 8612–8642
2024
-
[49]
Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 14974–14983
2023
- [50]
-
[51]
Mustafa Shukor, Corentin Dancette, and Matthieu Cord. 2023. ep-alm: Efficient perceptual augmentation of language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22056–22069
2023
-
[52]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al
-
[53]
Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [54]
-
[55]
Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfir, Chao Ma, Danda Paudel, Luc Van Gool, and Radu Timofte. 2025. Xtrack: Multi- modal training boosts rgb-x video object trackers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5734–5744
2025
-
[56]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology(2025)
2025
-
[57]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Kous- tuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. 2024. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning. arXiv:2412.14164 [cs.CV] https://arxiv.org/abs/2412.14164
-
[59]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)
work page internal anchor Pith review arXiv 2025
-
[60]
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. 2025. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22958–22967
2025
-
[61]
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng
- [62]
-
[63]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems37 (2024), 28828–28857
2024
-
[64]
Hanbo Wu, Xin Ma, and Yibin Li. 2025. Transformer-based multiview spatiotem- poral feature interactive fusion for human action recognition in depth videos. Signal Processing: Image Communication131 (2025), 117244
2025
-
[65]
Yuanchen Wu, Lu Zhang, Hang Yao, Junlong Du, Ke Yan, Shouhong Ding, Yun- sheng Wu, and Xiaoqiang Li. 2025. Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception. arXiv:2504.20468 [cs.CV] https://arxiv.org/abs/2504.20468
- [66]
- [67]
-
[68]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
-
[70]
InProceedings of the Computer Vision and Pattern Recognition Conference
Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643
-
[71]
Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. MentaLLaMA: interpretable mental health analysis on social media with large language models. InProceedings of the ACM Web Conference
2024
-
[72]
Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. 2025. Vca: Video curious agent for long video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20168–20179
2025
-
[73]
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. 2025. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. arXiv:2501.13106 [cs.CV] https://arxiv.org/abs/2501.13106
work page internal anchor Pith review arXiv 2025
-
[74]
Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. 2025. Vtimecot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24203–24213
2025
-
[75]
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713(2024)
work page Pith review arXiv 2024
-
[76]
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. 2025. Mlvu: Benchmarking multi- task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13691–13701
2025
- [77]
-
[78]
Yujin Zhou, Pengcheng Wen, Jiale Chen, Boqin Yin, Han Zhu, Jiaming Ji, Juntao Dai, Chi-Min Chan, and Sirui Han. 2026. What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 29071–29079
2026
-
[79]
Apollo: An exploration of video understanding in large multimodal models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. 2024. Apollo: An Exploration of Video Understanding in Large Multimodal Models. arXiv:2412.10360 [cs.CV] https://arxiv.org/abs/ 2412.10360 When Text Hijacks Vision: Benchmarking an...
-
[80]
First check whether the question primarily testsTemporal; if yes, outputtemporal=YESand set all others toNO
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.