pith. sign in

arxiv: 2605.22907 · v1 · pith:WLCEDYMSnew · submitted 2026-05-21 · 💻 cs.CV

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

Pith reviewed 2026-05-25 05:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingbenchmarkmultimodal large language modelscontinuous reasoningultra-long contextomni-modal understandingfine-grained perception
0
0 comments X

The pith

Bottlenecks in current MLLMs for long videos extend beyond retrieval to continuous reasoning, fine-grained perception, and non-verbal omni-modal understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoOdyssey to test models on videos where questions demand watching long unbroken segments rather than isolated clips. It defines continuous certificate length as the key measure of required viewing time and builds two subsets covering visual-only and audio-visual cases across 11 domains with average video lengths of 109 minutes. Evaluations using five levels of certificate length from seconds to hours show that models fail at sustained integration of information, detailed observation, and handling non-verbal audio-visual cues. This setup aims to expose the real cognitive demands of real-world ultra-long video tasks. If accurate, it implies that simply extending context windows will not resolve the core limitations.

Core claim

The paper claims that VideoOdyssey, with its extreme durations and multi-level continuous certificates averaging 16 minutes for the visual subset and 12.8 minutes for the audio-visual subset, demonstrates that current MLLMs struggle with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding rather than simple retrieval alone.

What carries the argument

continuous certificate length, the video length a human must continuously watch to definitively answer a question; it structures question design, subset creation, and diagnostic levels to force models into long temporal integration.

If this is right

  • Models must demonstrate integration of information over continuous spans of 12 to 16 minutes on average to succeed.
  • Performance varies systematically across five certificate levels, allowing targeted diagnosis of length-dependent failures.
  • Omni-modal models require explicit handling of synchronized non-verbal audio-visual cues beyond speech.
  • Progress depends on mechanisms for fine-grained perception sustained across long temporal contexts rather than retrieval alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The leveled structure could support staged training where models first master short certificates before scaling to hour-long ones.
  • It points to a need for architectures with explicit long-term memory retention separate from context window size.
  • Real-world applications such as video surveillance or lecture analysis may require similar certificate-based evaluation to ensure reliability.
  • Extending the benchmark to interactive or multi-agent video scenarios could test whether the identified bottlenecks persist under dynamic conditions.

Load-bearing premise

The questions are written so that correct answers require watching the full continuous certificate length without being solvable through partial viewing, question phrasing, or video selection biases.

What would settle it

A model that answers the benchmark questions correctly after processing only short isolated segments, or human annotators who answer correctly without watching the full certificate length, would show the metric does not capture the intended cognitive load.

Figures

Figures reproduced from arXiv: 2605.22907 by Haichen He, Jiayi Zhou, Kaiyang Zhou, Sifeng Shang, Yihan Hu, Yuanhan Zhang.

Figure 1
Figure 1. Figure 1: Continuous certificate length across various video datasets. Based on this metric, we introduce VideoOdyssey, a pioneering benchmark specifically designed for ultra-long-context and omni-modal video under￾standing. VideoOdyssey features three key charac￾teristics: 1) Extreme video duration and domain diversity: We collected 100 ultra-long videos from public platforms, spanning 11 domains and 54 fine-graine… view at source ↗
Figure 2
Figure 2. Figure 2: Examples from our benchmark. In VideoOdyssey-V, the model needs to consistently attend to detailed visual cues across an ultra-long time span, performing OCR-based counting tasks. In VideoOdyssey-AV, the model needs to build a continuous logical chain of events over this massive time span, leveraging audio-visual cues to infer character relationships. ultra-short windows. Furthermore, our analysis shows th… view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of VideoOdyssey. (a) VideoOdyssey contains 11 domains and 54 subcategories. (b) VideoOdyssey-V contains 1618 QA pairs across 14 tasks to assess model capability in four dimensions. (c) VideoOdyssey-AV contains 1062 QA pairs across 18 tasks to asses model performance in four dimensions. (d) All videos exceed 60 minutes, with the longest over 4 hours. (e) VideoOdyssey-AV features three audio types… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of MLLMs across five continuous certificate length levels on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of certificate window (CW) on selected models across different continuous certificate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of different inputs for selected models across three audio types on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the balanced answer distribution for both VideoOdyssey-V and VideoOdyssey-AV [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performancewa across different video domains on [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performancewa across different video domains on [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of MLLMs under different input modalities on [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human performance across continuous certificate length levels [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case of localization error. * indicates using the certificate window, whereas no asterisk [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cases of fain-grained perception error. * indicates using the certificate window, whereas [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cases of long-context reasoning error. * indicates using the certificate window, whereas [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case of cross-modal integration error. * indicates using the certificate window, whereas [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Case of non-verbal audio perception error. * indicates using the certificate window, [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
read the original abstract

Real-world long video understanding requires models to perform continuous tracking, information integration and memory retention over massive temporal spans within extreme video durations. Mastering this intense cognitive load constitutes the fundamental bottleneck in long video understanding. While existing benchmarks have driven progress by scaling up video duration, their evaluation tasks often require comprehending only short and isolated video segments, falling short of capturing the challenge of ultra-long-context reasoning. To measure this cognitive load, we emphasize continuous certificate length, defined as the video length a human must continuously watch to definitively answer a given question. Driven by this metric, we introduce VideoOdyssey, a benchmark specifically designed for ultra-long-context and omni-modal video understanding. VideoOdyssey is characterized by three key features: 1) Extreme video duration and diversity: spanning 11 domains and 54 subcategories with an average video duration of 109 minutes; 2) Comprehensive evaluation scenarios: offering two subsets to address different research focuses, i.e., VideoOdyssey-V for probing the limits of visual understanding in MLLMs, and VideoOdyssey-AV for evaluating synchronized audio-visual understanding for omni-modal models; 3) Ultra-long and multi-level continuous certificates: extending the average continuous certificate to 16 minutes for VideoOdyssey-V and 12.8 minutes for VideoOdyssey-AV. Crucially, we design 5 granular levels from seconds to hours, providing a comprehensive diagnostic tool to evaluate models across varying context lengths and cognitive loads. Extensive evaluations show that bottlenecks of current MLLMs extend beyond simple retrieval to include struggles with continuous reasoning across varying context lengths, fine-grained perception, and non-verbal omni-modal understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VideoOdyssey, a benchmark for ultra-long-context and omni-modal video understanding. It features videos averaging 109 minutes across 11 domains and 54 subcategories, two subsets (VideoOdyssey-V and VideoOdyssey-AV), and defines 'continuous certificate length' (avg. 16 min for V, 12.8 min for AV) with five granular levels from seconds to hours. Evaluations on MLLMs are reported to show that bottlenecks extend beyond retrieval to continuous reasoning across context lengths, fine-grained perception, and non-verbal omni-modal understanding.

Significance. If the continuous certificate length metric is validated and the evaluations are methodologically sound, the benchmark could provide a useful diagnostic for long-video MLLM limitations that existing shorter-segment benchmarks do not capture, potentially guiding targeted improvements in memory and integration capabilities.

major comments (2)
  1. [Abstract] Abstract (definition of continuous certificate length): the central claim that performance drops across context lengths isolate 'continuous reasoning' load depends on the untested premise that questions require the full specified duration; no ablation, human study, or truncation experiment is described showing that shorter segments render questions unanswerable.
  2. [Evaluation section] Evaluation/results section: the abstract states that 'extensive evaluations show' specific bottlenecks, yet the manuscript supplies no task examples, data statistics, question construction details, or validation procedures for the certificate lengths, preventing assessment of whether the empirical findings support the conclusions.
minor comments (1)
  1. [Abstract] The abstract lacks any numerical data statistics (e.g., total questions, distribution across levels) or example questions, which would improve clarity even if present in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our methodology while committing to revisions that improve transparency without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (definition of continuous certificate length): the central claim that performance drops across context lengths isolate 'continuous reasoning' load depends on the untested premise that questions require the full specified duration; no ablation, human study, or truncation experiment is described showing that shorter segments render questions unanswerable.

    Authors: Continuous certificate length is established via a human annotation protocol in which annotators determine the shortest continuous video segment that allows definitive question resolution; this process directly encodes the requirement that shorter segments are insufficient. We acknowledge that the initial manuscript did not include explicit truncation ablations or additional human studies beyond the annotation itself. We will expand the methods section with a fuller description of the annotation guidelines and consider adding supporting truncation analyses where feasible. revision: partial

  2. Referee: [Evaluation section] Evaluation/results section: the abstract states that 'extensive evaluations show' specific bottlenecks, yet the manuscript supplies no task examples, data statistics, question construction details, or validation procedures for the certificate lengths, preventing assessment of whether the empirical findings support the conclusions.

    Authors: The manuscript contains sections on benchmark construction that report video statistics, domain coverage, and the multi-level certificate design, along with question generation guidelines. To address the concern about accessibility and completeness, we will insert concrete task examples, additional summary statistics, and an explicit subsection detailing the certificate-length validation steps directly into the evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark metric is explicitly defined without reduction to inputs or self-citations

full rationale

The paper is a benchmark introduction with no equations, derivations, fitted parameters, or predictions. The continuous certificate length is defined directly as 'the video length a human must continuously watch to definitively answer a given question' and used to curate questions; this is a design choice, not a self-referential loop or fitted input renamed as output. No self-citation chains or uniqueness theorems are invoked as load-bearing. The evaluation is purely empirical on the new dataset, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a benchmark without mathematical derivations, fitted parameters, or new postulated entities; it rests on the domain assumption that video questions can be engineered to demand sustained temporal integration.

axioms (1)
  • domain assumption Video understanding tasks can be constructed such that answering requires continuous attention to long temporal spans rather than isolated segments.
    This premise underpins the continuous certificate length metric and the claim that current benchmarks fall short.

pith-pipeline@v0.9.0 · 5858 in / 1278 out tokens · 26712 ms · 2026-05-25T05:50:10.362082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 25 canonical work pages · 17 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743,

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

  3. [3]

    Infinibench: A benchmark for large multi-modal models in long-form movies and tv shows

    Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. Infinibench: A benchmark for large multi-modal models in long-form movies and tv shows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19496–19523,

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  5. [5]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding

    Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding. arXiv preprint arXiv:2412.12075,

  6. [6]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,

  8. [8]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  9. [9]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025a. Chaoyou Fu, Haojia Lin, X...

  10. [10]

    Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

    10 Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

  11. [11]

    Longinsightbench: A comprehensive benchmark for evaluating omni-modal models on human-centric long-video understanding.arXiv preprint arXiv:2510.17305,

    ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, and Wentao Zhang. Longinsightbench: A comprehensive benchmark for evaluating omni-modal models on human-centric long-video understanding.arXiv preprint arXiv:2510.17305,

  12. [12]

    WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,

  13. [13]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, et al. Omnivideobench: Toward...

  15. [15]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025b. Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. ...

  16. [16]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024a

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024a. Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?ArXiv preprint, 2024b. Zuyan Liu, Yuhao Dong, Jiah...

  17. [17]

    Introducing GPT-4.1 in the API

    OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/, 2025a. Ac- cessed: 2025-04-14. OpenAI. Introducing OpenAI o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/, 2025b. Accessed: 2025-05-15. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via la...

  18. [18]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  19. [19]

    Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,

    Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, et al. Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,

  20. [20]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,

  21. [21]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  22. [22]

    Qwen3.5-omni technical report, 2026a

    Qwen Team. Qwen3.5-omni technical report, 2026a. URL https://arxiv.org/abs/2604. 15804. Qwen Team. Qwen3. 5: Towards native multimodal agents.URL: https://qwen. ai/blog, 2026b. Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long ego...

  23. [23]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025a. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoy...

  24. [24]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765,

  25. [25]

    Videochat-r1

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100,

  26. [26]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025a. Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video di...

  27. [27]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13691–13701, 2025a. Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reason...

  28. [28]

    As illustrated in the figures, there is a clear negative correlation between the continuous certificate length and human accuracy. InVideoOdyssey-V, human performance starts at a high of 90.7% for extremely short evidence lengths ( <0.5 mins) but drops strictly and significantly to 74.5% for evidence lengths exceeding 60 minutes. A similar declining trend...

  29. [29]

    What did you say?

    When evaluating this model using the ground-truth certificate window, the sampling strategy is as follows: • Window length < 128 seconds: We densely extract frames at a rate of 1fps to preserve maximum temporal granularity. • Window length ≥ 128 seconds: We uniformly sample 128 frames across the entire duration of the window to provide a comprehensive ove...