MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

Jiale Li; Mengyuan Liu; Sihan Chen

arxiv: 2607.01117 · v1 · pith:S3Q2VMNKnew · submitted 2026-07-01 · 💻 cs.CV

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

Jiale Li , Sihan Chen , Mengyuan Liu This is my paper

Pith reviewed 2026-07-02 13:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion hallucinationVideoLLMsbenchmarkaction recognitionsequential inferencevideo understandinghallucination evaluation

0 comments

The pith

VideoLLMs hallucinate absent human motions most often when over-inferring from partial sequences, despite recognizing present actions well.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MoHallBench to measure motion hallucination in VideoLLMs, where models describe human movements not visible in the input video. It isolates three sources—co-occurrence priors, sequential inference, and similarity confusion—using 11,306 clips and 40,493 question-answer pairs across binary, multiple-choice, and generative formats. A bi-directional questioning protocol with bias-aware metrics tests ten open-source models and finds that strong performance on positive action recognition does not predict resistance to adversarial negative cases. Sequential inference emerges as the dominant failure mode, with stronger priors and finer similarity further increasing error rates.

Core claim

MoHallBench demonstrates that motion hallucination in VideoLLMs is driven by three distinct sources and shows a clear decoupling from standard action recognition accuracy, with sequential inference hallucination proving most severe as models over-infer expected motion outcomes from incomplete visual cues.

What carries the argument

MoHallBench benchmark with its bi-directional questioning protocol that evaluates co-occurrence priors, sequential inference, and similarity confusion while controlling for affirmation bias.

If this is right

Models excelling at positive action recognition still produce high hallucination rates on adversarial negatives.
Sequential inference causes the highest hallucination rates because models extrapolate expected future motions from partial cues.
Stronger co-occurrence priors and finer-grained action similarity both increase hallucination frequency across settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that penalize over-inference from partial motion sequences could reduce the dominant failure mode.
The benchmark design could be adapted to test whether similar decoupling appears in object or event hallucination tasks.
Mitigation methods might prioritize explicit checks against co-occurrence statistics during generation.

Load-bearing premise

The questions and protocol isolate the three hallucination sources without introducing new biases that would alter the observed decoupling between recognition and resistance.

What would settle it

A follow-up test on the same ten models using new video clips and questions that reverses the reported ranking of hallucination severity or eliminates the recognition-hallucination decoupling.

Figures

Figures reproduced from arXiv: 2607.01117 by Jiale Li, Mengyuan Liu, Sihan Chen.

**Figure 1.** Figure 1: Overview of MoHallBench. We evaluate motion hallucination in VideoLLMs along three complementary axes: co [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the MoHallBench pipeline. We curate adversarial samples from multi-source videos along three axes, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset statistics of MoHallBench. From left to right: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Top-10 co-occurrence structures for OAH, SAH, and AAH. For AAH, the horizontal-axis action is denoted by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance stratified by PMI in high-𝑃 regions. Lower-PMI bands show more severe hallucination under both OA and SA settings. Aerial Maneuver → Entry. In our annotations, aerial execution dominates around 50%–70% progress, while water entry occurs near the end. As truncation moves later, performance collapses rapidly: average AccNS falls from 80.39% (10%) to 13.54% (40%), then to 5.86% (50%), and approach… view at source ↗

**Figure 6.** Figure 6: Prompt used to induce the semantic hierarchy over the 500 HAA500 action categories. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used to judge whether a generated SIH description entails the withheld terminal action. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of explicit endpoint hallucination in the open-ended SIH setting. For a diving clip whose terminal action [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Clip-duration distributions of the OAH subset (left) [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Video Large Language Models (VideoLLMs) have shown strong progress in video understanding, yet they still suffer from hallucinations that are inconsistent with visual evidence. Existing benchmarks mainly focus on object hallucination or coarse action perception, leaving a key video-specific problem underexplored: motion hallucination, in which models infer human motions that are absent from the video. We present MoHallBench, a benchmark for diagnosing motion hallucination in VideoLLMs. MoHallBench systematically evaluates three major sources of hallucination: co-occurrence priors, sequential inference, and similarity confusion. It contains 11,306 video clips and 40,493 question-answer pairs, covering binary-choice, multiple-choice, and generative settings. We further introduce a bi-directional questioning protocol with bias-aware metrics to reduce affirmation bias in binary evaluation. Experiments on ten recent open-source VideoLLMs reveal a clear decoupling between action recognition and hallucination resistance, as models that perform well on positive action recognition often fail on adversarial negatives. Among all settings, sequential inference hallucination is the most severe, showing that current models tend to over-infer expected outcomes from partial motion cues. Our analyses further confirm that stronger priors and finer-grained similarity substantially amplify hallucination. We hope MoHallBench can facilitate future evaluation and mitigation of motion hallucination in VideoLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoHallBench gives a concrete taxonomy and scale for motion hallucination but the decoupling result hinges on unshown details of question construction.

read the letter

The main takeaway is that this paper supplies MoHallBench, a benchmark with 11k video clips and 40k QA pairs that splits motion hallucination into co-occurrence priors, sequential inference, and similarity confusion, plus a bi-directional protocol meant to limit affirmation bias.

It does useful work by focusing on a video-specific failure mode that object-hallucination benchmarks skip. Running ten recent open-source VideoLLMs and finding that strong positive action recognition often fails on adversarial negatives, with sequential inference the worst case, is a clear observation worth having on record. The note that stronger priors and finer similarity increase problems also tracks with what one would expect.

The soft spot is the lack of any description, even in the abstract, of how the questions were written, how negatives were chosen to isolate each source, or what checks were done on symmetry and annotator agreement. Without that, the reported decoupling could partly reflect benchmark construction rather than model behavior alone. The stress-test concern about post-hoc selection or language confounds lands because the abstract supplies no counter-evidence.

This is for people who build or evaluate VideoLLMs and want diagnostics beyond coarse action or object tasks. A reader looking for evaluation ideas gets the taxonomy and protocol sketch; someone needing validated numbers will wait for the methods section.

It deserves peer review. The gap it names is real and the scale is large enough that referees can usefully press on the isolation claims.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoHallBench, a benchmark with 11,306 video clips and 40,493 QA pairs to diagnose motion hallucination in VideoLLMs across three sources (co-occurrence priors, sequential inference, similarity confusion) in binary-choice, multiple-choice, and generative settings. It proposes a bi-directional questioning protocol with bias-aware metrics to reduce affirmation bias and reports experiments on ten open-source VideoLLMs showing decoupling between positive action recognition and hallucination resistance on adversarial negatives, with sequential inference hallucination being the most severe.

Significance. If the benchmark design is validated, this addresses an underexplored video-specific hallucination type beyond existing object or coarse-action benchmarks. The dataset scale, coverage of multiple settings, and empirical observation of decoupling plus the dominance of sequential inference represent a useful empirical contribution for guiding VideoLLM evaluation. The paper supplies no indication of released code or data, but the benchmark construction itself is the primary deliverable.

major comments (2)

[§3 and §4] §3 (Benchmark Construction) and §4 (Bi-directional Protocol): The decoupling result and source-specific severity claims rest on the protocol accurately isolating co-occurrence priors, sequential inference, and similarity confusion. The manuscript supplies no details on video selection criteria, negative-example crafting rules, symmetry of bidirectional pairs, inter-annotator validation of source attribution, or controls for language-only confounds, which is load-bearing for interpreting the pattern as a property of the models rather than an artifact of benchmark design.
[§5] §5 (Experiments and Results): The reported decoupling between action recognition and hallucination resistance (and the ranking of sequential inference as most severe) lacks ablations or statistical controls for potential post-hoc selection effects or unmeasured biases in the 40,493 QA pairs; without these, the central experimental claim cannot be fully assessed from the provided evidence.

minor comments (2)

[Abstract] The abstract states the protocol 'reduces affirmation bias' but does not define the bias-aware metrics; a brief definition or reference to the relevant subsection would improve clarity.
[§3] Consider adding a table or figure in §3 summarizing the distribution of the 40,493 pairs across the three hallucination sources, settings, and video clip categories for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important areas where additional transparency and validation are needed to support the benchmark's claims. We address each point below and commit to revisions that incorporate the requested details and analyses without altering the core contributions.

read point-by-point responses

Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Bi-directional Protocol): The decoupling result and source-specific severity claims rest on the protocol accurately isolating co-occurrence priors, sequential inference, and similarity confusion. The manuscript supplies no details on video selection criteria, negative-example crafting rules, symmetry of bidirectional pairs, inter-annotator validation of source attribution, or controls for language-only confounds, which is load-bearing for interpreting the pattern as a property of the models rather than an artifact of benchmark design.

Authors: We agree that the current manuscript lacks sufficient detail on these aspects of benchmark construction, which are necessary to substantiate that the observed patterns reflect model behavior rather than design artifacts. In the revised version, we will expand §3 and §4 with: explicit criteria for selecting the 11,306 video clips from source datasets; step-by-step rules used to craft negatives for each of the three hallucination sources; verification that bidirectional pairs maintain symmetry in question structure and answer options; inter-annotator agreement metrics (e.g., Cohen's kappa) for source attribution during QA pair creation; and analyses or controls (such as language-only baselines) to address potential confounds. These additions will directly support the validity of the decoupling and severity rankings. revision: yes
Referee: [§5] §5 (Experiments and Results): The reported decoupling between action recognition and hallucination resistance (and the ranking of sequential inference as most severe) lacks ablations or statistical controls for potential post-hoc selection effects or unmeasured biases in the 40,493 QA pairs; without these, the central experimental claim cannot be fully assessed from the provided evidence.

Authors: We acknowledge that the experimental section would benefit from explicit ablations and statistical safeguards to rule out selection effects or biases in the QA pairs. In the revision, we will add: (i) ablation studies using random subsampling of the 40,493 pairs and re-computation of decoupling metrics; (ii) statistical tests (e.g., paired t-tests or bootstrap confidence intervals) comparing action-recognition accuracy against hallucination rates on adversarial negatives; and (iii) controls for potential post-hoc biases, such as stratification by video length or source. These will be reported alongside the existing results on the ten VideoLLMs to strengthen the claims regarding source-specific severity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external model evaluations

full rationale

This is a benchmark construction paper with no mathematical derivation chain, fitted parameters, or predictions. The central claims consist of empirical measurements on ten external open-source VideoLLMs using a newly constructed dataset of 11,306 clips and 40,493 QA pairs. The bi-directional protocol and bias-aware metrics are described as design choices to mitigate affirmation bias, but they do not reduce any reported result to a quantity defined inside the paper by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The observed decoupling between action recognition and hallucination resistance is a direct experimental outcome on independent models, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that the three listed sources are the primary drivers of motion hallucination and that the new benchmark isolates them cleanly.

axioms (1)

domain assumption Co-occurrence priors, sequential inference, and similarity confusion are the three major sources of motion hallucination in VideoLLMs.
Abstract states these as the systematic evaluation targets of MoHallBench.

invented entities (1)

MoHallBench benchmark and bi-directional questioning protocol no independent evidence
purpose: To diagnose and measure motion hallucination
New artifact introduced by the paper with no independent evidence outside this work.

pith-pipeline@v0.9.1-grok · 5768 in / 1144 out tokens · 24322 ms · 2026-07-02T13:42:40.629331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 31 canonical work pages · 12 internal anchors

[1]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Bar- reira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binko...

2022
[2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2425–2433. doi:10.1109/ICCV. 2015.279

work page doi:10.1109/iccv 2015
[3]

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. 2025. MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations. InCVPR. Computer Vision Foundation / IEEE, 13744–13753

2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming- Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[5]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of Multimodal Large Language Models: A Survey.CoRRabs/2404.18930 (2024). arXiv:2404.18930 doi:10.48550/ ARXIV.2404.18930

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, and Christoph Feichtenhofer

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Dutt Jain, Miguel Martin, Huiyu Wang, Hanoona Abdul Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman H. ...

work page doi:10.48550/arxiv.2504.13180 2025
[7]

Jihoon Chung, Cheng-hsin Wuu, Hsuan-ru Yang, Yu-Wing Tai, and Chi-Keung Tang. 2021. HAA500: Human-Centric Atomic Action Dataset with Curated Videos. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 13445–13454. doi:10.1109/ ICCV48922.2021.01321

work page arXiv 2021
[8]

Kenneth Ward Church and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography.Comput. Linguistics16, 1 (1990), 22–29

1990
[9]

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna
[10]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.CoRRabs/2601.10611 (2026). arXiv:2601.10611 doi:10.48550/ARXIV.2601.10611

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.10611 2026
[11]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi
[12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (E...

2023
[13]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRRabs/2412.19437 (2024). arXiv:2412.19437 doi:10.48550/ARXIV.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[14]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
[15]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6325–6334. doi:10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017
[16]

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. 2025. MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Visio...

work page doi:10.1109/cvpr52734.2025.00791 2025
[17]

Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, and Qiang Zhu
[18]

MHBench: Demystifying Motion Hallucination in VideoLLMs. InThirty- Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Edu- cational Advances in Artificial Intelligence, AAAI 2025, Philadelphia, PA, USA, February 25 - March 4, 2025, Toby Walsh, Julie Shah...

work page doi:10.1609/aaai.v39i4.32463 2025
[19]

Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 13723– 13733. doi:10.1109/CVPR52734.2025.01281

work page doi:10.1109/cvpr52734.2025.01281 2025
[20]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP- 2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun C...

2023
[21]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. 2024. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. IEEE, 22195–22206. doi:10.1109/CVPR52733.2...

work page doi:10.1109/cvpr52733.2024.02095 2024
[22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computati...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[23]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Lev...

2023
[24]

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024. A Survey on Hallucination in Large Vision-Language Models.CoRRabs/2402.00253 (2024). arXiv:2402.00253 doi:10.48550/ARXIV.2402.00253

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.00253 2024
[25]

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. 2023. Valley: Video Assistant with Large Language model Enhanced abilitY.CoRRabs/2306.07207 (2023)

work page arXiv 2023
[26]

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan
[27]

InACL (1)

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. InACL (1). Association for Computational Linguistics, 12585–12602
[28]

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, W A, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 1046–1056. do...

work page doi:10.1109/cvpr42600.2020 2020
[29]

OpenAI. 2026. GPT-5.4 Thinking System Card. OpenAI. https://openai.com/ index/gpt-5-4-thinking-system-card/ Official system card

2026
[30]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object Hallucination in Image Captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hock- enmaier, and Jun’ichi Tsujii (Eds.). Associati...

work page doi:10.18653/v1/d18-1437 2018
[31]

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Intra- and Inter-Action Understanding via Temporal Action Parsing. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, W A, USA, June 13-19,

2020
[32]

In: IEEE Conf

Computer Vision Foundation / IEEE, 727–736. doi:10.1109/CVPR42600. 2020.00081

work page doi:10.1109/cvpr42600 2020
[33]

Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. InComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I (Lecture Notes in Computer Science), Bastian Leibe, J...

work page doi:10.1007/978-3-319-46448-0_31 2016
[34]

Qwen Team. 2025. Qwen3-VL Technical Report.CoRRabs/2511.21631 (2025). arXiv:2511.21631 doi:10.48550/ARXIV.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
[35]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. 2023. An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.arXiv preprint arXiv:2311.07397(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.CoRRabs/2409.12191 (2024). arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng
[38]

arXiv:2406.16338 doi:10

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models.CoRRabs/2406.16338 (2024). arXiv:2406.16338 doi:10. 48550/ARXIV.2406.16338

work page arXiv 2024
[39]

Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu
[40]

Ego4d: Around the world in 3, 000 hours of egocentric video

FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2939–2948. doi:10. 1109/CVPR52688.2022.00296

work page arXiv 2022
[41]

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.18154 2025
[42]

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. 2025. Tarsier2: Advancing Large Vision-Language Models from Detailed Video De- scription to Comprehensive Video Understanding.CoRRabs/2501.07888 (2025). arXiv:2501.07888 doi:10.48550/ARXIV.2501.07888

work page doi:10.48550/arxiv.2501.07888 2025
[43]

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. 2025. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding.CoRRabs/2501.13106 (2025). arXiv:2501.13106 doi:10.48550/ARXIV.2501.13106

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.13106 2025
[44]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. InEMNLP (Demos). Association for Computational Linguistics, 543–553

2023
[45]

Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Jingjing Chen, and Yu-Gang Jiang
[46]

arXiv:2409.16597 doi:10.48550/ARXIV.2409.16597

EventHallusion: Diagnosing Event Hallucinations in Video LLMs.CoRR abs/2409.16597 (2024). arXiv:2409.16597 doi:10.48550/ARXIV.2409.16597

work page doi:10.48550/arxiv.2409.16597 2024
[47]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.CoRRabs/2506.05176 (2025). arXiv:2506.05176 doi:10.48550/ARXIV.2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025
[48]

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2025. LLaVA-Video: Video Instruction Tuning With Synthetic Data.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=EElFGvt39K

2025
[49]

Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. 2019. HACS: Human Action Clips and Segments Dataset for Recognition and Tempo- ral Localization. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 8667–8677. doi:10.1109/ICCV.2019.00876

work page doi:10.1109/iccv.2019.00876 2019
[50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479 2025
[51]

This is a semantic entailment task, not keyword matching
[52]

Return ENTAILED only when MODEL_OUTPUT explicitly states or strongly entails that NEGATIVE_ACTION occurred
[53]

Return NOT_ENTAILED when MODEL_OUTPUT describes only preparation, attempt, approach, partial execution, a related action, a broader action, or a description that does not support that NEGATIVE_ACTION actually happened
[54]

Return UNCERTAIN when the text is ambiguous or insufficient to determine whether NEGATIVE_ACTION actually happened
[55]

Do not assume completion from common knowledge or likely event sequences unless it is strongly entailed by the wording
[56]

Judge only from the provided text
[57]

is_neg": true,

Be conservative. If there is reasonable uncertainty between ENTAILED and NOT_ENTAILED, choose UNCERTAIN. Label mapping: - ENTAILED→is_neg = true - NOT_ENTAILED→is_neg = false - UNCERTAIN→is_neg = false Output valid JSON only in exactly this format: {"is_neg": true, "label": "ENTAILED", "reason": "one-sentence explanation"} or {"is_neg": false, "label": "N...

[1] [1]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Bar- reira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binko...

2022

[2] [2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2425–2433. doi:10.1109/ICCV. 2015.279

work page doi:10.1109/iccv 2015

[3] [3]

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. 2025. MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations. InCVPR. Computer Vision Foundation / IEEE, 13744–13753

2025

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming- Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025

[5] [5]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of Multimodal Large Language Models: A Survey.CoRRabs/2404.18930 (2024). arXiv:2404.18930 doi:10.48550/ ARXIV.2404.18930

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, and Christoph Feichtenhofer

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Dutt Jain, Miguel Martin, Huiyu Wang, Hanoona Abdul Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman H. ...

work page doi:10.48550/arxiv.2504.13180 2025

[7] [7]

Jihoon Chung, Cheng-hsin Wuu, Hsuan-ru Yang, Yu-Wing Tai, and Chi-Keung Tang. 2021. HAA500: Human-Centric Atomic Action Dataset with Curated Videos. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 13445–13454. doi:10.1109/ ICCV48922.2021.01321

work page arXiv 2021

[8] [8]

Kenneth Ward Church and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography.Comput. Linguistics16, 1 (1990), 22–29

1990

[9] [9]

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna

[10] [10]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.CoRRabs/2601.10611 (2026). arXiv:2601.10611 doi:10.48550/ARXIV.2601.10611

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.10611 2026

[11] [11]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi

[12] [12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (E...

2023

[13] [13]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRRabs/2412.19437 (2024). arXiv:2412.19437 doi:10.48550/ARXIV.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024

[14] [14]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

[15] [15]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 6325–6334. doi:10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017

[16] [16]

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. 2025. MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Visio...

work page doi:10.1109/cvpr52734.2025.00791 2025

[17] [17]

Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, and Qiang Zhu

[18] [18]

MHBench: Demystifying Motion Hallucination in VideoLLMs. InThirty- Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Edu- cational Advances in Artificial Intelligence, AAAI 2025, Philadelphia, PA, USA, February 25 - March 4, 2025, Toby Walsh, Julie Shah...

work page doi:10.1609/aaai.v39i4.32463 2025

[19] [19]

Chaoyu Li, Eun Woo Im, and Pooyan Fazli. 2025. VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 13723– 13733. doi:10.1109/CVPR52734.2025.01281

work page doi:10.1109/cvpr52734.2025.01281 2025

[20] [20]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP- 2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun C...

2023

[21] [21]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. 2024. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. IEEE, 22195–22206. doi:10.1109/CVPR52733.2...

work page doi:10.1109/cvpr52733.2024.02095 2024

[22] [22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computati...

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[23] [23]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Lev...

2023

[24] [24]

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024. A Survey on Hallucination in Large Vision-Language Models.CoRRabs/2402.00253 (2024). arXiv:2402.00253 doi:10.48550/ARXIV.2402.00253

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.00253 2024

[25] [25]

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. 2023. Valley: Video Assistant with Large Language model Enhanced abilitY.CoRRabs/2306.07207 (2023)

work page arXiv 2023

[26] [26]

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan

[27] [27]

InACL (1)

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. InACL (1). Association for Computational Linguistics, 12585–12602

[28] [28]

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, W A, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 1046–1056. do...

work page doi:10.1109/cvpr42600.2020 2020

[29] [29]

OpenAI. 2026. GPT-5.4 Thinking System Card. OpenAI. https://openai.com/ index/gpt-5-4-thinking-system-card/ Official system card

2026

[30] [30]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object Hallucination in Image Captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hock- enmaier, and Jun’ichi Tsujii (Eds.). Associati...

work page doi:10.18653/v1/d18-1437 2018

[31] [31]

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Intra- and Inter-Action Understanding via Temporal Action Parsing. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, W A, USA, June 13-19,

2020

[32] [32]

In: IEEE Conf

Computer Vision Foundation / IEEE, 727–736. doi:10.1109/CVPR42600. 2020.00081

work page doi:10.1109/cvpr42600 2020

[33] [33]

Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. InComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I (Lecture Notes in Computer Science), Bastian Leibe, J...

work page doi:10.1007/978-3-319-46448-0_31 2016

[34] [34]

Qwen Team. 2025. Qwen3-VL Technical Report.CoRRabs/2511.21631 (2025). arXiv:2511.21631 doi:10.48550/ARXIV.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025

[35] [35]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. 2023. An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.arXiv preprint arXiv:2311.07397(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.CoRRabs/2409.12191 (2024). arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng

[38] [38]

arXiv:2406.16338 doi:10

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models.CoRRabs/2406.16338 (2024). arXiv:2406.16338 doi:10. 48550/ARXIV.2406.16338

work page arXiv 2024

[39] [39]

Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu

[40] [40]

Ego4d: Around the world in 3, 000 hours of egocentric video

FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2939–2948. doi:10. 1109/CVPR52688.2022.00296

work page arXiv 2022

[41] [41]

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning D...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.18154 2025

[42] [42]

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. 2025. Tarsier2: Advancing Large Vision-Language Models from Detailed Video De- scription to Comprehensive Video Understanding.CoRRabs/2501.07888 (2025). arXiv:2501.07888 doi:10.48550/ARXIV.2501.07888

work page doi:10.48550/arxiv.2501.07888 2025

[43] [43]

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. 2025. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding.CoRRabs/2501.13106 (2025). arXiv:2501.13106 doi:10.48550/ARXIV.2501.13106

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.13106 2025

[44] [44]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. InEMNLP (Demos). Association for Computational Linguistics, 543–553

2023

[45] [45]

Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Jingjing Chen, and Yu-Gang Jiang

[46] [46]

arXiv:2409.16597 doi:10.48550/ARXIV.2409.16597

EventHallusion: Diagnosing Event Hallucinations in Video LLMs.CoRR abs/2409.16597 (2024). arXiv:2409.16597 doi:10.48550/ARXIV.2409.16597

work page doi:10.48550/arxiv.2409.16597 2024

[47] [47]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.CoRRabs/2506.05176 (2025). arXiv:2506.05176 doi:10.48550/ARXIV.2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025

[48] [48]

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2025. LLaVA-Video: Video Instruction Tuning With Synthetic Data.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/forum?id=EElFGvt39K

2025

[49] [49]

Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. 2019. HACS: Human Action Clips and Segments Dataset for Recognition and Tempo- ral Localization. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 8667–8677. doi:10.1109/ICCV.2019.00876

work page doi:10.1109/iccv.2019.00876 2019

[50] [50]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479 2025

[51] [51]

This is a semantic entailment task, not keyword matching

[52] [52]

Return ENTAILED only when MODEL_OUTPUT explicitly states or strongly entails that NEGATIVE_ACTION occurred

[53] [53]

Return NOT_ENTAILED when MODEL_OUTPUT describes only preparation, attempt, approach, partial execution, a related action, a broader action, or a description that does not support that NEGATIVE_ACTION actually happened

[54] [54]

Return UNCERTAIN when the text is ambiguous or insufficient to determine whether NEGATIVE_ACTION actually happened

[55] [55]

Do not assume completion from common knowledge or likely event sequences unless it is strongly entailed by the wording

[56] [56]

Judge only from the provided text

[57] [57]

is_neg": true,

Be conservative. If there is reasonable uncertainty between ENTAILED and NOT_ENTAILED, choose UNCERTAIN. Label mapping: - ENTAILED→is_neg = true - NOT_ENTAILED→is_neg = false - UNCERTAIN→is_neg = false Output valid JSON only in exactly this format: {"is_neg": true, "label": "ENTAILED", "reason": "one-sentence explanation"} or {"is_neg": false, "label": "N...