pith. sign in

arxiv: 2501.02955 · v2 · submitted 2025-01-06 · 💻 cs.CV

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Pith reviewed 2026-05-23 05:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords video motion understandingvision language modelsfine-grained motionbenchmarkThrough-Encoder Fusionvideo comprehensionmodel evaluation
0
0 comments X

The pith

Vision language models perform poorly on fine-grained motion understanding in videos, with a new benchmark and fusion method showing paths to partial improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates MotionBench to measure how well vision language models grasp detailed movements in video clips. The benchmark uses six kinds of motion questions from many different video sources. Current models do not do well on these tests. The authors also test ways to pack more video information into the models using a Through-Encoder Fusion technique. Using more frames per video and this fusion approach leads to better results on motion tasks, but models are still far from perfect.

Core claim

The paper establishes MotionBench as a benchmark for fine-grained motion comprehension in VLMs using six primary categories of motion-oriented questions from diverse sources. It finds that existing VLMs perform poorly. It introduces the Through-Encoder (TE) Fusion method for efficient video feature compression and shows that higher frame rate inputs and TE Fusion improve motion understanding, though substantial room for enhancement exists.

What carries the argument

The MotionBench benchmark consisting of six motion-oriented question categories, along with the Through-Encoder (TE) Fusion method that enables better video feature compression for limited sequence lengths in language models.

If this is right

  • Higher frame rate video inputs improve fine-grained motion understanding in VLMs.
  • The TE Fusion method provides an efficient way to incorporate more frames without exceeding sequence limits.
  • Current improvements still leave substantial room for further advances in motion perception.
  • Video understanding models should prioritize motion-level perception in addition to other capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on this benchmark might translate to better performance in downstream tasks like video action prediction.
  • Developers could explore combining TE Fusion with other compression techniques for even better results.
  • Expanding the benchmark to include more complex motion scenarios could reveal additional weaknesses.

Load-bearing premise

The motion-oriented questions and videos in the benchmark measure fine-grained motion comprehension separately from other skills like recognizing objects or understanding language.

What would settle it

If a vision language model achieves high accuracy on MotionBench without relying on increased frame rates or the TE Fusion method, it would challenge the claim that these are necessary for better motion understanding.

Figures

Figures reproduced from arXiv: 2501.02955 by Jie Tang, Lefan Wang, Shiyu Huang, Weihan Wang, Wenyi Hong, Xiaotao Gu, Yean Cheng, Yuxiao Dong, Zhuoyi Yang.

Figure 1
Figure 1. Figure 1: State-of-the-art video understanding models strug [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Basic statistics of MotionBench [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of dynamic information annotation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Summarization of prevalent paradigms for video compression and our proposed Through-Encoder Fusion (TE Fusion). Here we [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model performance variation with respect to different compression ratios [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The absolute number and the proportion of questions [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MotionBench, a benchmark with six primary categories of motion-oriented questions drawn from diverse real-world video sources, to evaluate fine-grained motion comprehension in VLMs. It reports that existing VLMs perform poorly on these tasks, proposes a Through-Encoder (TE) Fusion method for efficient video feature compression within LLM sequence limits, and shows that higher frame-rate inputs combined with TE Fusion yield improvements, while noting substantial remaining room for enhancement.

Significance. If the benchmark validly isolates motion-specific understanding, the work would usefully highlight a targeted limitation in current video VLMs and demonstrate an efficient architectural tweak (TE Fusion) that improves motion perception; the emphasis on fine-grained motion could usefully steer future model development.

major comments (3)
  1. [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.
  2. [Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.
  3. [TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.
minor comments (2)
  1. [Abstract] Abstract contains minor phrasing issues (e.g., 'VLM's ability' should read 'VLMs' abilities'; 'reviewing VLM architectures' is unclear).
  2. [Results tables] Tables reporting model accuracies should include standard deviations or confidence intervals and specify the number of videos/questions per category.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.

    Authors: We agree that the absence of these controls limits the ability to isolate motion-specific understanding. In the revised manuscript we will add single-frame baselines, frame-shuffled ablations, and object-recognition-only variants. These will be reported alongside the existing results to demonstrate that the performance gaps and TE-Fusion gains are attributable to motion perception rather than static appearance or language priors. revision: yes

  2. Referee: [Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.

    Authors: We will expand the relevant sections to document the question validation procedures, report inter-annotator agreement, provide the rationale for baseline model selection, and include statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) for the accuracy improvements. revision: yes

  3. Referee: [TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.

    Authors: We will augment the method section with additional implementation details (hyperparameters, exact fusion equations, and computational overhead). We will also add direct comparisons against alternative fusion/compression methods and targeted ablations that isolate the benefit of TE Fusion on motion features versus general video understanding. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark and experiments are self-contained

full rationale

The paper introduces MotionBench as a new evaluation set with six motion-oriented question categories and proposes TE Fusion as an architectural change, then reports direct experimental accuracies on VLMs. No equations, fitted parameters, or derivations appear in the provided text; performance numbers are obtained by running models on the collected videos rather than by any reduction to self-defined quantities or self-citation chains. The central claims therefore rest on external model behavior and data collection rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark and architecture paper with no explicit free parameters, new axioms, or invented entities; relies on standard assumptions of VLM evaluation and video feature extraction.

pith-pipeline@v0.9.0 · 5755 in / 1028 out tokens · 42009 ms · 2026-05-23T05:42:21.900696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PushupBench: Your VLM is not good at counting pushups

    cs.CV 2026-04 unverdicted novelty 7.0

    VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.

  2. CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

  3. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  4. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  5. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 5 Pith papers · 17 internal anchors

  1. [1]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. 7

  2. [2]

    Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion

    Jiaben Chen and Huaizu Jiang. Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6475–6486,

  3. [3]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In CVPR, pages 13320–13331,

  4. [4]

    Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

    Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906, 2023. 3

  5. [5]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2

  6. [6]

    Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

  7. [7]

    Towards event-oriented long video under- standing

    Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video under- standing. arXiv preprint arXiv:2406.14129, 2024. 2

  8. [8]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2, 3, 7

  9. [9]

    Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

    Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

  10. [10]

    Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

  11. [11]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr¨und, Peter Yianilos, Moritz Mueller- Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 12

  12. [12]

    Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier,...

  13. [13]

    A dataset for medical instructional video classification and question answering

    Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023. 2, 6

  14. [14]

    A dataset for medical instructional video classi- fication and question answering

    Deepak Kumar Gupta, Kush Attal, and Dina Demner- Fushman. A dataset for medical instructional video classi- fication and question answering. Scientific Data, 10, 2022. 4

  15. [15]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 1, 2, 3, 6, 7 9

  16. [16]

    Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim

    Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in vi- sual question answering. In CVPR, 2017. 12

  17. [17]

    An image grid can be worth a video: Zero- shot video question answering using a vlm

    Wonkyun Kim, Changin Choi, Wonseok Lee, and Won- jong Rhee. An image grid can be worth a video: Zero- shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024. 6

  18. [18]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

  19. [19]

    Aria: An open multimodal native mixture-of- experts model

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of- experts model. arXiv preprint arXiv:2410.05993, 2024. 1

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 3, 6, 7

  21. [21]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022. 12

  22. [22]

    VideoChat: Chat-Centric Video Understanding

    Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355,

  23. [23]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 2, 3, 7

  24. [24]

    Videovista: A versatile bench- mark for video understanding and reasoning

    Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024. 1

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

  26. [26]

    Kangaroo: A powerful video-language model supporting long-context video input

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

  27. [27]

    St-llm: Large language models are effective tem- poral learners

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025. 1

  28. [28]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 3

  29. [29]

    Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 1, 7

  30. [30]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. 2024. 12

  31. [31]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2023. 3

  32. [32]

    Yi-vl-34b, 2024

    NousResearch. Yi-vl-34b, 2024. 7

  33. [33]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 5, 7

  34. [34]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 7

  35. [35]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 3

  36. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 6, 7

  37. [38]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3

  38. [39]

    LVBench: An Extreme Long Video Understanding Benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 1, 2, 3, 7

  39. [40]

    Internvideo2: Scaling video foundation mod- els for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 3, 6

  40. [41]

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 1, 3

  41. [42]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 12

  42. [43]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2

  43. [44]

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 1, 3, 6, 7

  44. [45]

    Just ask: Learning to answer questions from millions of narrated videos

    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021. 12

  45. [46]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7

  46. [47]

    Tenenbaum

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020. 12

  47. [48]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2

  48. [49]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 3

  49. [50]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 3, 6, 7

  50. [51]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 1

  51. [52]

    Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023

    Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023. 4, 6

  52. [53]

    Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding

    Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2

  53. [54]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

  54. [55]

    1, 3 11 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Supplementary Material

  55. [56]

    3 and all ablated models in Tab

    Training Details Here we provide the detailed training hyperparameters for both TE Fusion in Tab. 3 and all ablated models in Tab. 4 and Fig. 6. Configurations Total steps 10,000 Warmup steps 1,000 Global batch size 768 Learning rate 8e-6 Minimal learning rate 1e-6 Learning rate decay cosine Optimizer Adam Adam ϵ 1e-8 Adam β1 0.9 Adam β2 0.95 Precision bf...

  56. [57]

    Model Details To maintain a fair comparison, all model architectures are ablated with the same backbone, GLM-4V , with its model configuration as follows: Assume the temporal compression ratio be K, The spe- cific feature of each ablated architecture is:

  57. [58]

    After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

    TE-Fusion (ours): Before the visual encoder, we con- catenate every neighboring K frames into one sequence, and conduct self-attention across each K frames to fuse temporal feature. After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

  58. [59]

    The model configurations of all ablated architectures

    Qwen2-VL: The neighboring K frames are concatenated along the channel dimension and patchified into one fea- VLM decoder Layers 40 Hidden size 4096 Attention heads 32 num query groups 2 FFN hidden size 13696 Sequence len 4096 Position embedding RoPE Normalization RMSNorm visual encoder Input resolution 224 Patch size 14 Post spatial downsample 2 × 2 Layer...

  59. [60]

    Kangaroo: This approach is the most similar one to TE Fusion, except that every frame is computed indepen- dently within the visual encoder and concatenated along the hidden size dimension to perform temporal down- sample (with an MLP layer)

  60. [61]

    Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer

    QFormer: After going through the visual encoder, the video feature is passed through a QFormer (learned from scratch). Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer. From the experiment, we found that, though being light-weighted, the QFormer is hard to optimize and model temporal relationships durin...

  61. [62]

    qwen2-vl

    PLLaV A: This approach is similar to Kangaroo. Instead of fusion with the MLP layer, PLLaV A adopts a simple adaptive pooling. To avoid possible information loss, we conduct the pooling operation after the spatial downsam- ple module. The pseudo-code below further illustrates all ablated archi- tectures. 12 def forward(): ’’’ The pseudo-code of the forwar...

  62. [63]

    QA Construction Process for Videos with In- tricate Interactions Here we illustrate the QA generation process corresponding to Fig. 4. 9.1. Step1: Video caption annotation For videos with intricate interactions, it is impractical to di- rectly annotate the whole video clip, since the total com- plexity and quantity of the motions are too large. There- for...

  63. [64]

    Each question should have 4 options. 13

  64. [65]

    It may draw from multiple highly related content dimensions

    For each question, combine one dimension from the Content Dimension and one from the Question Logic Dimension. It may draw from multiple highly related content dimensions

  65. [66]

    Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

    Focus only on representative and prominent events or actions to keep options clear and unique without being overly detailed or tricky. Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

  66. [67]

    The worker holds a long, thin tool,

    Given possible ambiguities in some descriptions, ensure the answer is unique and clear to avoid deductions. • Ambiguity Example 1: Temporal ambiguity. If a description reads, “On the left, a woman in a khaki suit faces right, nodding her head while speaking. In the middle, a group faces the camera, and a man in a white shirt pulls a chair leftward to sit,...

  67. [68]

    slightly bent elbow

    Choose only prominent events or actions, avoiding mi- nor or indeterminate details. Ensure each answer is unique and clear. • Minor Example: If “slightly bent elbow” isn’t men- tioned, it does not necessarily mean it did not hap- pen; if the video says “the mouth moved slightly a few times,” it cannot be determined the interval and number of these movemen...

  68. [69]

    based on the description

    Pretend you’re viewing the video, avoiding terms like “based on the description” or expressions related to the description text, including questions, options, and expla- nations

  69. [70]

    Aim for at least 4 questions to focus beyond appearance

  70. [71]

    Keep questions to around six, focusing only on represen- tative events or actions and ensuring options are clear, unique, and straightforward

  71. [72]

    first frame description

    Questions should focus on dynamic actions only. The “first frame description” is supplementary and should not guide question design

  72. [73]

    Categorization System Content Dimension Below is the Content Dimension in the video classification system:

    The video dynamic information description does not contain causal or other logical relationships, therefore, do not involve logical relationships in the title. Categorization System Content Dimension Below is the Content Dimension in the video classification system:

  73. [74]

    Detailed actions of individuals 1.2

    Human Dynamics: 1.1. Detailed actions of individuals 1.2. Interaction among multiple people 1.3. Emotional states and their changes 1.4. Position and its changes (Location, Angle, etc.)

  74. [75]

    Movement trajectory 2.2

    Object Dynamics: 2.1. Movement trajectory 2.2. State changes

  75. [76]

    Detailed actions 3.2

    Animal Dynamics: 3.1. Detailed actions 3.2. Position and its changes (Location, Angle, etc.)

  76. [77]

    Camera movement

    Camera Movement: 4.1. Camera movement

  77. [78]

    individuals 5.2

    Appearance Characteristics: 5.1. individuals 5.2. objects 5.3. environment Question Logic Dimension Below is the Question Logic Dimension in the video classification system:

  78. [79]

    Whether a movement occurs

  79. [80]

    Sequence between multiple movements

  80. [81]

    Ensure it can be parsed by json.loads() without returning anything outside the list

    Appearance description and judgment Response Format Return only a Python list, where each element is a dictio- nary representing a question. Ensure it can be parsed by json.loads() without returning anything outside the list. 9.3. VLM Filtering To avoid over simple QAs that do not utilize motion com- prehension capability, we use various image VLMs to pre...

Showing first 80 references.