pith. sign in

arxiv: 2411.16771 · v3 · submitted 2024-11-25 · 💻 cs.CV

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Pith reviewed 2026-05-23 16:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision language modelshallucinationsvideo benchmarktemporal hallucinationscaption orderingVLLM evaluationmultimodal models
0
0 comments X

The pith

VidHal benchmark shows vision LLMs struggle to rank video captions by hallucination level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VidHal, a benchmark built from videos across common temporal aspects to test hallucinations in vision large language models. Multiple captions are created for each video to represent graduated levels of hallucination, and models must complete a caption ordering task that ranks them by increasing hallucinatory extent. This setup targets nuanced spatiotemporal errors that image-focused methods overlook. Experiments on a range of models reveal they consistently fail to order the captions correctly, exposing clear limitations in handling video content. The work positions the benchmark as a tool to drive better evaluation and mitigation of temporal hallucinations.

Core claim

VidHal evaluates video-based hallucinations in VLLMs by supplying videos paired with captions that vary in hallucination severity across temporal aspects, then requiring models to rank the captions by hallucinatory extent. Comprehensive tests across multiple models demonstrate that existing VLLMs exhibit significant limitations in generating responses without such hallucinations.

What carries the argument

The caption ordering task, which requires models to rank captions by their degree of hallucination for each video.

If this is right

  • Existing VLLMs produce responses containing significant temporal hallucinations when describing video content.
  • Standard evaluation methods miss the nuanced spatiotemporal errors that arise in video responses.
  • The benchmark supports targeted development of VLLMs that reduce hallucination in video settings.
  • Further work should pursue more complete assessment of VLLM hallucination across dynamic inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the ordering task could push training methods to explicitly reduce temporal inconsistencies in generated descriptions.
  • The graduated-caption approach may transfer to measuring hallucinations in other time-based modalities such as audio sequences.
  • Models succeeding on this task might also show stronger general temporal reasoning when combining vision and language.

Load-bearing premise

The manually created captions accurately reflect distinct and ordered levels of hallucination without bias or inconsistency.

What would settle it

Human raters reordering the captions in a way that frequently disagrees with the benchmark's intended ranking across many videos.

Figures

Figures reproduced from arXiv: 2411.16771 by Mohan Kankanhalli, Wey Yeh Choong, Yangyang Guo.

Figure 1
Figure 1. Figure 1: Multiple-Choice Question Answering (MCQA) per [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our VIDHAL benchmark construction pipeline. Using direction as an example from the five selected aspects, we begin by sourcing relevant video instances from existing datasets. Next, the anchor (positive) caption is generated from the original video metadata. Finally, GPT-4o is employed to generate hallucinatory captions at varying levels. based approaches, which we argue are less effective in c… view at source ↗
Figure 4
Figure 4. Figure 4: Visual illustration of relative caption ordering task in VIDHAL. The final ordering is parsed based on VLLM responses for each pair order queried. 3.4. Dataset Statistics and Human Validation Our VIDHAL benchmark consists of a total of 1,000 video instances. Using our automatic annotation pipeline, each video instance is tagged with M = 3 captions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human agreement on hallucination levels in the V [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aspect-aware results of VLLMs for the (Left) naive and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overlapping ratios of model predictions under single [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples of VLLM responses on the caption ordering tasks, for the [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hallucination misalignment (HM) scores on V [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: (Top) Invalid response rates across all models. VLLMs [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: illustrates the distribution of public dataset sources contributing to the visual instances in VIDHAL. Additionally, Figures 11 and 12 depict the distribution of temporal aspects across VIDHAL and the ground truth an￾swers for the MCQA and caption ordering tasks, respec￾tively. One can observe that both temporal aspects and ground truth options are uniformly distributed across our benchmark. The distribut… view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of (Left) correct answer options for the [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Specific skills and corresponding questions from the Perception Test dataset chosen for V [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Evaluation tasks in MVBench aligned with temporal [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompts used for generating the anchor caption from [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for generating aspect-specific hallucinatory captions based on anchor captions and in-context examples. [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Definitions incorporated into the prompt for generating hallucinatory captions for each aspect, with separate definitions provided [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: In-context examples for Size sub-aspect. Original Caption: 1 : A circle shaped block is placed in a wooden box. Hallucinated Captions: 2 : A square shaped block is placed in a wooden box. 3 : A star shaped block is placed in a wooden box. Original Caption: 1 : Cubes are transforming into cylinders. Hallucinated Captions: 2 : Cubes are transforming into cones. 3 : Cubes are transforming into spheres. Origi… view at source ↗
Figure 20
Figure 20. Figure 20: In-context examples for Shape sub-aspect. Original Caption: 1 : A leaf with holes turns green to red. Hallucinated Captions: 2 : A leaf with holes turns from green to orange. 3 : A leaf with holes turns from yellow to orange. Original Caption: 1 : A yellow ball bounces on the ground, and lands in the pool. Hallucinated Captions: 2 : A red ball bounces on the ground, and lands in the pool. 3 : A blue ball … view at source ↗
Figure 21
Figure 21. Figure 21: In-context examples for the Color sub-aspect. Original Caption: 1 : The man wearing a jacket performed three backflips. Hallucinated Captions: 2 : The man wearing a jacket performed four backflips. 3 : The man wearing a jacket performed five backflips. Original Caption: 1 : Four birds perched on the wire. Hallucinated Captions: 2 : Five birds perched on the wire. 3 : Six birds perched on the wire. Origina… view at source ↗
Figure 25
Figure 25. Figure 25: In-context examples for the Event Order aspect. Original Caption: 1 : The people are cooking in the video. Hallucinated Captions: 2 : The people are chopping in the video. 3 : The people are washing in the video. Original Caption: 1 : A car is driving down the road. Hallucinated Captions: 2 : A car is reversing down the road. 3 : A car is being repaired along the road. Original Caption: 1 : A dog is diggi… view at source ↗
Figure 26
Figure 26. Figure 26: In-context examples for the Action aspect. Original Caption: 1 : An eagle is flying from left to right diagonally upwards. Hallucinated Captions: 2 : An eagle is flying from left to right horizontally. 3 : An eagle is flying from left to right diagonally downwards. Original Caption: 1 : The car drives forward and makes a right turn. Hallucinated Captions: 2 : The car drives forward and continues driving s… view at source ↗
Figure 27
Figure 27. Figure 27: In-context examples for the Direction aspect. 9.2. Relative Order Parsing Prompting the VLLM to predict the order of captions based on their hallucinatory level in the relative caption ordering task involves asking a series of paired questions derived from different caption combinations. However, providing the model with all possible pairs at once may result in cyclic and non-transitive orderings. To addr… view at source ↗
Figure 28
Figure 28. Figure 28: Prompt template for evaluating the quality of generated captions for the GPT-4o, Gemini-1.5 Flash, and LLaMA3 (70B) models. [PITH_FULL_IMAGE:figures/full_fig_p019_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Question prompts for evaluating caption quality based on the three assessment criteria. Prompts with the placeholder [PITH_FULL_IMAGE:figures/full_fig_p020_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Qualitative examples of video instances and their corresponding generated captions in the V [PITH_FULL_IMAGE:figures/full_fig_p020_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Pipeline for validating the quality of generated caption orders in VidHal. For each instance, human annotators are provided [PITH_FULL_IMAGE:figures/full_fig_p021_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Prompt template for the MCQA and relative caption ordering evaluation tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Prompt template for the naive caption ordering evaluation task. [PITH_FULL_IMAGE:figures/full_fig_p021_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Decision tree for determining the final caption order based on VLLM responses to paired questions in the relative caption [PITH_FULL_IMAGE:figures/full_fig_p022_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Distribution of results of VLLMs across varied input [PITH_FULL_IMAGE:figures/full_fig_p023_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: (Top) Averaged consensus score of each respective [PITH_FULL_IMAGE:figures/full_fig_p023_36.png] view at source ↗
read the original abstract

Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VidHal, a benchmark for temporal hallucinations in Vision Large Language Models (VLLMs). It constructs the benchmark by bootstrapping video instances across temporal aspects, creates captions representing varying hallucination levels for each video, and proposes a caption ordering task in which VLLMs rank captions by hallucinatory extent. Experiments on a range of models are reported to uncover significant limitations in existing VLLMs regarding hallucination generation.

Significance. If the benchmark construction is shown to be reliable, VidHal would address a clear gap: most hallucination work targets static images, while video inputs introduce richer spatiotemporal dynamics that current metrics do not capture. The caption-ordering task supplies a fine-grained, ordinal evaluation signal that could support more nuanced model comparisons and motivate targeted mitigation research. The manuscript correctly identifies the need for holistic VLLM evaluation focused on temporal hallucination.

major comments (2)
  1. [Abstract / benchmark construction] Abstract and benchmark-construction description: the central claim that the caption ordering task enables 'fine-grained evaluation' of hallucinatory extent rests on the assumption that the human-created captions accurately encode ordered hallucination levels. No details are supplied on the assignment procedure (temporal error taxonomy, grounding against video ground truth, or controls for length/style confounds), nor on inter-annotator agreement. This directly affects the validity of all reported model rankings and the conclusion of 'significant limitations.'
  2. [Experiments] Experimental section: the claim that results 'uncover significant limitations' requires that the ordering task measures genuine hallucination differences rather than superficial cues. Without reported validation of the caption levels or experimental controls (e.g., caption-length balancing, reference-video verification), the performance differences cannot be confidently attributed to hallucination sensitivity.
minor comments (1)
  1. The abstract states that VidHal is 'specially designed' and that 'extensive experiments' were conducted, yet the provided text gives no concrete counts of videos, captions per video, or model list; these numbers should appear in the main text or a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of benchmark validity that we will address through revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] Abstract and benchmark-construction description: the central claim that the caption ordering task enables 'fine-grained evaluation' of hallucinatory extent rests on the assumption that the human-created captions accurately encode ordered hallucination levels. No details are supplied on the assignment procedure (temporal error taxonomy, grounding against video ground truth, or controls for length/style confounds), nor on inter-annotator agreement. This directly affects the validity of all reported model rankings and the conclusion of 'significant limitations.'

    Authors: We agree that the manuscript would benefit from explicit details on the caption creation process to support the claim of ordered hallucination levels. In the revised manuscript, we will expand the relevant sections to describe the temporal error taxonomy, the grounding procedure against video ground truth, controls for length and style confounds, and inter-annotator agreement statistics from the annotation process. revision: yes

  2. Referee: [Experiments] Experimental section: the claim that results 'uncover significant limitations' requires that the ordering task measures genuine hallucination differences rather than superficial cues. Without reported validation of the caption levels or experimental controls (e.g., caption-length balancing, reference-video verification), the performance differences cannot be confidently attributed to hallucination sensitivity.

    Authors: We acknowledge that additional validation and controls would strengthen attribution of results to hallucination sensitivity. In revision, we will add explicit discussion of any existing controls (such as caption-length balancing and reference verification) and, where needed, report further analyses or validation steps to confirm that performance differences reflect hallucination rather than superficial cues. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction paper with no derivations or self-referential steps

full rationale

This is a benchmark introduction paper. The abstract and provided text describe constructing VidHal by bootstrapping videos and creating captions, then proposing a caption ordering task. No equations, fitted parameters, predictions, or derivation chains exist. No self-citations are invoked to justify core claims, and the evaluation approach does not reduce to its inputs by construction. The central claim of uncovering VLLM limitations rests on the benchmark's design rather than any circular reduction. This matches the default expectation of no significant circularity for such papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that VLLMs exhibit hallucinations in video settings and that manually constructed captions can represent graduated hallucination levels; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption VLLMs are prone to hallucinations, particularly with video inputs due to spatiotemporal dynamics
    Stated as motivation in the abstract for creating the benchmark.

pith-pipeline@v0.9.0 · 5737 in / 1188 out tokens · 73208 ms · 2026-05-23T16:53:25.175396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

    cs.CV 2025-10 conditional novelty 7.0

    XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.

  2. Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 2 Pith papers · 17 internal anchors

  1. [1]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. CoRR, abs/2404.18930, 2024. 1, 2

  2. [2]

    Videocon: Robust video- language alignment via contrast captions

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video- language alignment via contrast captions. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13927–13937. IEEE, 2024. 2, 3, 4, 5

  3. [3]

    Revisiting the ”video” in video-language understanding

    Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jia- jun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the ”video” in video-language understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2907–2917. IEEE, 2022. 7

  4. [4]

    Visdiahalbench: A visual dialogue benchmark for di- agnosing hallucination in large vision-language models

    Qingxing Cao, Junhao Cheng, Xiaodan Liang, and Liang Lin. Visdiahalbench: A visual dialogue benchmark for di- agnosing hallucination in large vision-language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages 12161–12176. Associa- tion for Computational Linguistics, 2024. 2

  5. [5]

    Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

    Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. CoRR, abs/2311.14906, 2023. 1, 2, 4, 13

  6. [6]

    Fouhey, and Joyce Chai

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, and Joyce Chai. Multi-object hallucination in vision-language models. CoRR, abs/2407.06192, 2024. 2

  7. [7]

    Unified hallucination detection for multi- modal large language models

    Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xi- aoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multi- modal large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics , pages 3235–3252. Association for Computational Linguis- tics, 2024. 2

  8. [8]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms. CoRR, abs/2406.07476, 2024. 2, 6

  9. [9]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems , 2023. 2

  10. [10]

    Hallu-pi: Evaluating hallucination in multi-modal large language models within perturbed inputs

    Peng Ding, Jingyu Wu, Jun Kuang, Dan Ma, Xuezhi Cao, Xunliang Cai, Shi Chen, Jiajun Chen, and Shujian Huang. Hallu-pi: Evaluating hallucination in multi-modal large language models within perturbed inputs. CoRR, abs/2408.01355, 2024. 1

  11. [11]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur ´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi `ere, B...

  12. [12]

    Multi-modal hallucination control by visual information grounding

    Alessandro Favero, Luca Zancato, Matthew Trager, Sid- dharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303–14312. IEEE, 2024. 1, 2

  13. [13]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. CoRR, abs/240...

  14. [14]

    Chat- rec: Towards interactive and explainable llms-augmented recommender system,

    Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat-rec: Towards interactive and explainable llms-augmented recommender system. CoRR, abs/2303.14524, 2023. 5

  15. [15]

    DAMRO: dive into the attention mechanism of LVLM to reduce object hallucination

    Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. DAMRO: dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Process- ing, pages 7696–7712. Association for Computational Lin- guistics, 2024. 1

  16. [16]

    Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models. In IEEE/CVF Conference on Computer Vision and Pattern Recog...

  17. [17]

    OPERA: alleviating hallucination in multi- 9 modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. OPERA: alleviating hallucination in multi- 9 modal large language models via over-trust penalty and retrospection-allocation. In IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 13418–13427. IEEE, 2024. 1, 2

  18. [18]

    Cumulated gain- based evaluation of IR techniques

    Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain- based evaluation of IR techniques. ACM Trans. Inf. Syst. , 20(4):422–446, 2002. 5

  19. [19]

    Hallucination augmented contrastive learning for multimodal large language model

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. In IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 27026–27036. IEEE, 2024. 2

  20. [20]

    Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models

    Chaoya Jiang, Wei Ye, Mengfan Dong, Hongrui Jia, Haiyang Xu, Ming Yan, Ji Zhang, and Shikun Zhang. Hal-eval: A uni- versal and fine-grained hallucination evaluation framework for large vision language models. CoRR, abs/2402.15721,

  21. [21]

    Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. THRONE: an object-based hallucination benchmark for the free-form generations of large vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27218–27228. IEEE, 2024. 1, 2

  22. [22]

    Berg, and Mohit Bansal

    Jie Lei, Tamara L. Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 487–507. Association for Computational Linguistics, 2023. 7

  23. [23]

    Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. In IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 13872–13882. IEEE, 2024. 1, 2

  24. [24]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR, abs/2307.16125, 2023. 2, 4, 5

  25. [25]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 2

  26. [26]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning , pages 19730–19742. PMLR, 2023. 2, 14

  27. [27]

    VideoChat: Chat-Centric Video Understanding

    Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. CoRR, abs/2305.06355,

  28. [28]

    Mvbench: A comprehensive multi- modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark. In IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206. IEEE, 2024. 1, 2, 4, 5, 6, 13

  29. [29]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Processing, pages 292–305. Association for Computational Linguistics,

  30. [30]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 6

  31. [31]

    Mitigating hallucination in large multi-modal models via robust instruction tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representa- tions. OpenReview.net, 2024. 2

  32. [32]

    Models see hallucinations: Eval- uating the factuality in video captioning

    Hui Liu and Xiaojun Wan. Models see hallucinations: Eval- uating the factuality in video captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 11807–11823. Association for Computa- tional Linguistics, 2023. 4

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. 1, 2

  34. [34]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2024. 1, 2

  35. [35]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  36. [36]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiu- tian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. CoRR, abs/2402.00253, 2024. 2

  37. [37]

    Phd: A prompted visual hallucination evaluation dataset

    Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. Phd: A prompted visual hallucination evaluation dataset. CoRR, abs/2403.11116, 2024. 1

  38. [38]

    Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

    Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. arXiv preprint arXiv:2407.21771, 2024. 1, 2

  39. [39]

    Tempcom- pass: Do video llms really understand videos? In Findings of the Association for Computational Linguistics, pages 8731–

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? In Findings of the Association for Computational Linguistics, pages 8731–

  40. [40]

    1, 2, 4, 13

    Association for Computational Linguistics, 2024. 1, 2, 4, 13

  41. [41]

    Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens

    Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13151–13160. IEEE, 2024. 1

  42. [42]

    Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based 10 large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based 10 large language models. CoRR, abs/2311.16103, 2023. 1, 2, 4, 5

  43. [43]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

  44. [44]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

  45. [45]

    Per- ception test: A diagnostic benchmark for multimodal video models

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adri `a Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Do- ersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexandre Fr´echette, Hanna Klimczak, Raphael Koster, Jun- lin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, ...

  46. [46]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Vi- ola, Malcolm Reynolds, Yuanz...

  47. [47]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4035–4045. Association for Computational Linguistics, 2018. 1, 3

  48. [48]

    CSTA: cnn-based spatiotemporal attention for video summarization

    Jaewon Son, Jaehun Park, and Kwangsu Kim. CSTA: cnn-based spatiotemporal attention for video summarization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18847–18856. IEEE, 2024. 7

  49. [49]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. CoRR, abs/2305.16355, 2023. 2

  50. [50]

    Aligning large multimodal models with factually augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. In Findings of the Association for Com- putational Linguistics, pages 13088–13110. Association for Computational Linguistics, 2...

  51. [51]

    Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models

    Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models. arXiv preprint arXiv:2410.18325, 2024

  52. [52]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023. 1, 2, 3, 4, 8

  53. [53]

    Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models. CoRR, abs/2406.16338, 2024. 2, 3, 8

  54. [54]

    Le, Thang Luong, and Golnaz Ghiasi

    Zhecan Wang, Garrett Bingham, Adams Yu, Quoc V . Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallu- cination dataset for advancing multimodal reasoning. CoRR, abs/2407.15680, 2024. 2

  55. [55]

    Toward a stable, fair, and compre- hensive evaluation of object hallucination in large vision- language models

    Hongliang Wei, Xingtao Wang, Xianqi Zhang, Xiaopeng Fan, and Debin Zhao. Toward a stable, fair, and compre- hensive evaluation of object hallucination in large vision- language models. In The Annual Conference on Neural In- formation Processing Systems, 2024. 2

  56. [56]

    EFUF: effi- cient fine-grained unlearning framework for mitigating hal- lucinations in multimodal large language models

    Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, and Xinyu Dai. EFUF: effi- cient fine-grained unlearning framework for mitigating hal- lucinations in multimodal large language models. In Pro- ceedings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 1167–1181. Association for Computational Li...

  57. [57]

    Mitigat- ing object hallucination via concentric causal attention

    Yun Xing, Yiheng Li, Ivan Laptev, and Shijian Lu. Mitigat- ing object hallucination via concentric causal attention. In The Annual Conference on Neural Information Processing Systems, 2024. 1

  58. [58]

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Ziwei Xu, Sanjay Jain, and Mohan S. Kankanhalli. Hallu- cination is inevitable: An innate limitation of large language models. CoRR, abs/2401.11817, 2024. 1

  59. [59]

    Vript: A video is worth thousands of words

    Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. In Advances in Neural Information Processing Systems, 2024. 2, 3, 5

  60. [60]

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug- owl3: Towards long image-sequence understanding in multi- modal large language models. CoRR, abs/2408.04840, 2024. 2, 6

  61. [61]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Jun- feng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with mul- timodality. CoRR, abs/2304.14178, 2023

  62. [62]

    mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jin- gren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. CoRR, abs/2311.04257, 2023. 2 11

  63. [63]

    Woodpecker: Hallucination correction for multimodal large language models,

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. CoRR, abs/2310.16045, 2023. 2

  64. [64]

    HELPD: miti- gating hallucination of lvlms by hierarchical feedback learn- ing with vision-enhanced penalty decoding

    Fan Yuan, Chi Qin, Xiaogang Xu, and Piji Li. HELPD: miti- gating hallucination of lvlms by hierarchical feedback learn- ing with vision-enhanced penalty decoding. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, pages 1768–1785. Association for Com- putational Linguistics, 2024. 2

  65. [65]

    Less is more: Miti- gating multimodal hallucination from an EOS decision per- spective

    Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Miti- gating multimodal hallucination from an EOS decision per- spective. In Proceedings of the Annual Meeting of the Asso- ciation for Computational Linguistics , pages 11766–11781. Association for Computational Linguistics, 2024. 1

  66. [66]

    Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing , pages 543–553. Association for Computational Linguistics, 2023. 2

  67. [67]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

  68. [68]

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

    Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization. CoRR, abs/2311.16839, 2023. 2

  69. [69]

    Investigating and mitigating the multimodal halluci- nation snowballing in large vision-language models

    Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin. Investigating and mitigating the multimodal halluci- nation snowballing in large vision-language models. In Pro- ceedings of the Annual Meeting of the Association for Com- putational Linguistics, pages 11991–12011. Association for Computational Li...

  70. [70]

    Mitigating modality prior-induced halluci- nations in multimodal large language models via deciphering attention causality

    Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, and Xuming Hu. Mitigating modality prior-induced halluci- nations in multimodal large language models via deciphering attention causality. CoRR, abs/2410.04780, 2024. 1

  71. [71]

    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning. CoRR, abs/2402.11411, 2024. 2

  72. [72]

    Analyzing and mitigating object hallucination in large vision-language models

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. In The International Conference on Learning Representations. OpenReview.net, 2024. 2

  73. [73]

    Calibrated self-rewarding vi- sion language models

    Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Lin- jun Zhang, and Huaxiu Yao. Calibrated self-rewarding vi- sion language models. In Advances in Neural Information Processing Systems, 2024. 2

  74. [74]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The International Conference on Learning Representations . OpenReview.net, 2024. 2

  75. [75]

    Combating visual question answering hallucinations via robust multi-space co- debias learning

    Jiawei Zhu, Yishu Liu, Huanjia Zhu, Hui Lin, Yuncheng Jiang, Zheng Zhang, and Bingzhi Chen. Combating visual question answering hallucinations via robust multi-space co- debias learning. In ACM Multimedia 2024, 2024. 2

  76. [76]

    IBD: alleviating hallucinations in large vision-language models via image-biased decoding

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. IBD: alleviating hallucinations in large vision-language models via image-biased decoding. CoRR, abs/2402.18476, 2024. 1

  77. [77]

    Game on tree: Visual hal- lucination mitigation via coarse-to-fine view tree and game theory

    Xianwei Zhuang, Zhihong Zhu, Zhanpeng Chen, Yuxin Xie, Liming Liang, and Yuexian Zou. Game on tree: Visual hal- lucination mitigation via coarse-to-fine view tree and game theory. In Proceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17984– 18003. Association for Computational Linguistics, 2024. 1 12

  78. [78]

    Dataset Statistics Figure 10 illustrates the distribution of public dataset sources contributing to the visual instances in V IDHAL

    Benchmark Construction Details 7.1. Dataset Statistics Figure 10 illustrates the distribution of public dataset sources contributing to the visual instances in V IDHAL. Additionally, Figures 11 and 12 depict the distribution of temporal aspects across V IDHAL and the ground truth an- swers for the MCQA and caption ordering tasks, respec- tively. One can o...

  79. [79]

    14 You are given one or more questions targeted at content of a video and their corresponding answers

    Separate in-context examples are provided for each At- tribute subaspect of Shape, Size, Color, Count, and State Change to account for their distinct natures. 14 You are given one or more questions targeted at content of a video and their corresponding answers. You are tasked with generating an appropriate and informative single line caption for the video...

  80. [80]

    Human Validation Details 8.1. Human Validation Process As varying hallucination levels are a distinctive feature of our benchmark, we prioritize validating the robustness of caption ordering produced by our annotation pipeline. Each anchor caption is derived from the original video metadata, making it the most accurate reflection of the video content. Our...

Showing first 80 references.