MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Jie Tang; Lefan Wang; Shiyu Huang; Weihan Wang; Wenyi Hong; Xiaotao Gu; Yean Cheng; Yuxiao Dong; Zhuoyi Yang

arxiv: 2501.02955 · v2 · submitted 2025-01-06 · 💻 cs.CV

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong , Yean Cheng , Zhuoyi Yang , Weihan Wang , Lefan Wang , Xiaotao Gu , Shiyu Huang , Yuxiao Dong

show 1 more author

Jie Tang

This is my paper

Pith reviewed 2026-05-23 05:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords video motion understandingvision language modelsfine-grained motionbenchmarkThrough-Encoder Fusionvideo comprehensionmodel evaluation

0 comments

The pith

Vision language models perform poorly on fine-grained motion understanding in videos, with a new benchmark and fusion method showing paths to partial improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates MotionBench to measure how well vision language models grasp detailed movements in video clips. The benchmark uses six kinds of motion questions from many different video sources. Current models do not do well on these tests. The authors also test ways to pack more video information into the models using a Through-Encoder Fusion technique. Using more frames per video and this fusion approach leads to better results on motion tasks, but models are still far from perfect.

Core claim

The paper establishes MotionBench as a benchmark for fine-grained motion comprehension in VLMs using six primary categories of motion-oriented questions from diverse sources. It finds that existing VLMs perform poorly. It introduces the Through-Encoder (TE) Fusion method for efficient video feature compression and shows that higher frame rate inputs and TE Fusion improve motion understanding, though substantial room for enhancement exists.

What carries the argument

The MotionBench benchmark consisting of six motion-oriented question categories, along with the Through-Encoder (TE) Fusion method that enables better video feature compression for limited sequence lengths in language models.

If this is right

Higher frame rate video inputs improve fine-grained motion understanding in VLMs.
The TE Fusion method provides an efficient way to incorporate more frames without exceeding sequence limits.
Current improvements still leave substantial room for further advances in motion perception.
Video understanding models should prioritize motion-level perception in addition to other capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on this benchmark might translate to better performance in downstream tasks like video action prediction.
Developers could explore combining TE Fusion with other compression techniques for even better results.
Expanding the benchmark to include more complex motion scenarios could reveal additional weaknesses.

Load-bearing premise

The motion-oriented questions and videos in the benchmark measure fine-grained motion comprehension separately from other skills like recognizing objects or understanding language.

What would settle it

If a vision language model achieves high accuracy on MotionBench without relying on increased frame rates or the TE Fusion method, it would challenge the claim that these are necessary for better motion understanding.

Figures

Figures reproduced from arXiv: 2501.02955 by Jie Tang, Lefan Wang, Shiyu Huang, Weihan Wang, Wenyi Hong, Xiaotao Gu, Yean Cheng, Yuxiao Dong, Zhuoyi Yang.

**Figure 3.** Figure 3: Basic statistics of MotionBench [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example of dynamic information annotation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Summarization of prevalent paradigms for video compression and our proposed Through-Encoder Fusion (TE Fusion). Here we [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Model performance variation with respect to different compression ratios [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The absolute number and the proportion of questions [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MotionBench and TE Fusion flag a real gap in video VLM motion handling, but the benchmark's ability to isolate motion from static or language cues is untested in the reported experiments.

read the letter

The paper's core offering is MotionBench, built around six motion-oriented question categories drawn from diverse video sources, plus the Through-Encoder Fusion approach for packing higher frame rates into limited LLM context. These are presented as distinct from prior video benchmarks and fusion techniques. The work shows existing VLMs score low on the new questions and that both higher frame rates and TE Fusion produce measurable lifts, while still leaving clear headroom for improvement. That combination is useful for anyone tracking where video VLMs fall short on temporal detail relevant to robotics or analysis tasks. The experimental framing is direct and the fusion method appears lightweight enough to be practical. The citation pattern follows standard VLM literature without obvious circularity. The main softness is that the abstract and reported results give no sign of the controls needed to tie performance differences specifically to motion perception. Single-frame baselines, frame-order shuffles, or object-only variants are not mentioned, so it remains possible that models are succeeding or failing on appearance, object identity, or language priors instead. If those checks are absent from the full paper, the claimed motion deficit and the attributed gains from TE Fusion rest on an assumption that has not been stress-tested. The paper is aimed at researchers who evaluate or extend video VLMs and who want a dedicated motion probe. Readers focused on benchmark construction will find the category breakdown worth examining, but anyone using the numbers for model comparison will need the missing ablations before treating the scores as motion-specific. It is worth sending to referees so they can verify the experimental details and any unreported controls; the topic is relevant enough that a cleaned-up version could usefully influence evaluation practice.

Referee Report

3 major / 2 minor

Summary. The paper introduces MotionBench, a benchmark with six primary categories of motion-oriented questions drawn from diverse real-world video sources, to evaluate fine-grained motion comprehension in VLMs. It reports that existing VLMs perform poorly on these tasks, proposes a Through-Encoder (TE) Fusion method for efficient video feature compression within LLM sequence limits, and shows that higher frame-rate inputs combined with TE Fusion yield improvements, while noting substantial remaining room for enhancement.

Significance. If the benchmark validly isolates motion-specific understanding, the work would usefully highlight a targeted limitation in current video VLMs and demonstrate an efficient architectural tweak (TE Fusion) that improves motion perception; the emphasis on fine-grained motion could usefully steer future model development.

major comments (3)

[Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.
[Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.
[TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.

minor comments (2)

[Abstract] Abstract contains minor phrasing issues (e.g., 'VLM's ability' should read 'VLMs' abilities'; 'reviewing VLM architectures' is unclear).
[Results tables] Tables reporting model accuracies should include standard deviations or confidence intervals and specify the number of videos/questions per category.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.

Authors: We agree that the absence of these controls limits the ability to isolate motion-specific understanding. In the revised manuscript we will add single-frame baselines, frame-shuffled ablations, and object-recognition-only variants. These will be reported alongside the existing results to demonstrate that the performance gaps and TE-Fusion gains are attributable to motion perception rather than static appearance or language priors. revision: yes
Referee: [Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.

Authors: We will expand the relevant sections to document the question validation procedures, report inter-annotator agreement, provide the rationale for baseline model selection, and include statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) for the accuracy improvements. revision: yes
Referee: [TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.

Authors: We will augment the method section with additional implementation details (hyperparameters, exact fusion equations, and computational overhead). We will also add direct comparisons against alternative fusion/compression methods and targeted ablations that isolate the benefit of TE Fusion on motion features versus general video understanding. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark and experiments are self-contained

full rationale

The paper introduces MotionBench as a new evaluation set with six motion-oriented question categories and proposes TE Fusion as an architectural change, then reports direct experimental accuracies on VLMs. No equations, fitted parameters, or derivations appear in the provided text; performance numbers are obtained by running models on the collected videos rather than by any reduction to self-defined quantities or self-citation chains. The central claims therefore rest on external model behavior and data collection rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark and architecture paper with no explicit free parameters, new axioms, or invented entities; relies on standard assumptions of VLM evaluation and video feature extraction.

pith-pipeline@v0.9.0 · 5755 in / 1028 out tokens · 42009 ms · 2026-05-23T05:42:21.900696+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types... propose a novel and efficient Through-Encoder (TE) Fusion method.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TE Fusion... applies deep fusion throughout the visual encoder... higher frame rate inputs and TE Fusion yield improvements

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PushupBench: Your VLM is not good at counting pushups
cs.CV 2026-04 unverdicted novelty 7.0

VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 5 Pith papers · 17 internal anchors

[1]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. 7

work page 2024
[2]

Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion

Jiaben Chen and Huaizu Jiang. Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6475–6486,

work page
[3]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In CVPR, pages 13320–13331,

work page
[4]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906, 2023. 3

work page arXiv 2023
[5]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

work page
[7]

Towards event-oriented long video under- standing

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video under- standing. arXiv preprint arXiv:2406.14129, 2024. 2

work page arXiv 2024
[8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

work page arXiv
[10]

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[11]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr¨und, Peter Yianilos, Moritz Mueller- Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 12

work page 2017
[12]

Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier,...

work page 2022
[13]

A dataset for medical instructional video classification and question answering

Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023. 2, 6

work page 2023
[14]

A dataset for medical instructional video classi- fication and question answering

Deepak Kumar Gupta, Kush Attal, and Dina Demner- Fushman. A dataset for medical instructional video classi- fication and question answering. Scientific Data, 10, 2022. 4

work page 2022
[15]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 1, 2, 3, 6, 7 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim

Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in vi- sual question answering. In CVPR, 2017. 12

work page 2017
[17]

An image grid can be worth a video: Zero- shot video question answering using a vlm

Wonkyun Kim, Changin Choi, Wonseok Lee, and Won- jong Rhee. An image grid can be worth a video: Zero- shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024. 6

work page arXiv 2024
[18]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Aria: An open multimodal native mixture-of- experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of- experts model. arXiv preprint arXiv:2410.05993, 2024. 1

work page arXiv 2024
[20]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 3, 6, 7

work page 2023
[21]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022. 12

work page arXiv 2022
[22]

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 2, 3, 7

work page 2024
[24]

Videovista: A versatile bench- mark for video understanding and reasoning

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024. 1

work page arXiv 2024
[25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

work page 2024
[26]

Kangaroo: A powerful video-language model supporting long-context video input

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

work page arXiv
[27]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025. 1

work page 2025
[28]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 1, 7

work page arXiv 2024
[30]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. 2024. 12

work page 2024
[31]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2023. 3

work page 2023
[32]

Yi-vl-34b, 2024

NousResearch. Yi-vl-34b, 2024. 7

work page 2024
[33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 3

work page 2024
[37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 1, 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 3, 6

work page arXiv 2024
[41]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 12

work page 2021
[43]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2

work page 2016
[44]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Just ask: Learning to answer questions from millions of narrated videos

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021. 12

work page 2021
[46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Tenenbaum

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020. 12

work page 2020
[48]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2

work page 2019
[49]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 3, 6, 7

work page 2024
[51]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023

Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023. 4, 6

work page 2023
[53]

Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding

Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2

work page 2024
[54]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

1, 3 11 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Supplementary Material

work page
[56]

3 and all ablated models in Tab

Training Details Here we provide the detailed training hyperparameters for both TE Fusion in Tab. 3 and all ablated models in Tab. 4 and Fig. 6. Configurations Total steps 10,000 Warmup steps 1,000 Global batch size 768 Learning rate 8e-6 Minimal learning rate 1e-6 Learning rate decay cosine Optimizer Adam Adam ϵ 1e-8 Adam β1 0.9 Adam β2 0.95 Precision bf...

work page
[57]

Model Details To maintain a fair comparison, all model architectures are ablated with the same backbone, GLM-4V , with its model configuration as follows: Assume the temporal compression ratio be K, The spe- cific feature of each ablated architecture is:

work page
[58]

After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

TE-Fusion (ours): Before the visual encoder, we con- catenate every neighboring K frames into one sequence, and conduct self-attention across each K frames to fuse temporal feature. After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

work page
[59]

The model configurations of all ablated architectures

Qwen2-VL: The neighboring K frames are concatenated along the channel dimension and patchified into one fea- VLM decoder Layers 40 Hidden size 4096 Attention heads 32 num query groups 2 FFN hidden size 13696 Sequence len 4096 Position embedding RoPE Normalization RMSNorm visual encoder Input resolution 224 Patch size 14 Post spatial downsample 2 × 2 Layer...

work page
[60]

Kangaroo: This approach is the most similar one to TE Fusion, except that every frame is computed indepen- dently within the visual encoder and concatenated along the hidden size dimension to perform temporal down- sample (with an MLP layer)

work page
[61]

Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer

QFormer: After going through the visual encoder, the video feature is passed through a QFormer (learned from scratch). Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer. From the experiment, we found that, though being light-weighted, the QFormer is hard to optimize and model temporal relationships durin...

work page
[62]

qwen2-vl

PLLaV A: This approach is similar to Kangaroo. Instead of fusion with the MLP layer, PLLaV A adopts a simple adaptive pooling. To avoid possible information loss, we conduct the pooling operation after the spatial downsam- ple module. The pseudo-code below further illustrates all ablated archi- tectures. 12 def forward(): ’’’ The pseudo-code of the forwar...

work page
[63]

QA Construction Process for Videos with In- tricate Interactions Here we illustrate the QA generation process corresponding to Fig. 4. 9.1. Step1: Video caption annotation For videos with intricate interactions, it is impractical to di- rectly annotate the whole video clip, since the total com- plexity and quantity of the motions are too large. There- for...

work page
[64]

Each question should have 4 options. 13

work page
[65]

It may draw from multiple highly related content dimensions

For each question, combine one dimension from the Content Dimension and one from the Question Logic Dimension. It may draw from multiple highly related content dimensions

work page
[66]

Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

Focus only on representative and prominent events or actions to keep options clear and unique without being overly detailed or tricky. Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

work page
[67]

The worker holds a long, thin tool,

Given possible ambiguities in some descriptions, ensure the answer is unique and clear to avoid deductions. • Ambiguity Example 1: Temporal ambiguity. If a description reads, “On the left, a woman in a khaki suit faces right, nodding her head while speaking. In the middle, a group faces the camera, and a man in a white shirt pulls a chair leftward to sit,...

work page
[68]

slightly bent elbow

Choose only prominent events or actions, avoiding mi- nor or indeterminate details. Ensure each answer is unique and clear. • Minor Example: If “slightly bent elbow” isn’t men- tioned, it does not necessarily mean it did not hap- pen; if the video says “the mouth moved slightly a few times,” it cannot be determined the interval and number of these movemen...

work page
[69]

based on the description

Pretend you’re viewing the video, avoiding terms like “based on the description” or expressions related to the description text, including questions, options, and expla- nations

work page
[70]

Aim for at least 4 questions to focus beyond appearance

work page
[71]

Keep questions to around six, focusing only on represen- tative events or actions and ensuring options are clear, unique, and straightforward

work page
[72]

first frame description

Questions should focus on dynamic actions only. The “first frame description” is supplementary and should not guide question design

work page
[73]

Categorization System Content Dimension Below is the Content Dimension in the video classification system:

The video dynamic information description does not contain causal or other logical relationships, therefore, do not involve logical relationships in the title. Categorization System Content Dimension Below is the Content Dimension in the video classification system:

work page
[74]

Detailed actions of individuals 1.2

Human Dynamics: 1.1. Detailed actions of individuals 1.2. Interaction among multiple people 1.3. Emotional states and their changes 1.4. Position and its changes (Location, Angle, etc.)

work page
[75]

Movement trajectory 2.2

Object Dynamics: 2.1. Movement trajectory 2.2. State changes

work page
[76]

Detailed actions 3.2

Animal Dynamics: 3.1. Detailed actions 3.2. Position and its changes (Location, Angle, etc.)

work page
[77]

Camera movement

Camera Movement: 4.1. Camera movement

work page
[78]

individuals 5.2

Appearance Characteristics: 5.1. individuals 5.2. objects 5.3. environment Question Logic Dimension Below is the Question Logic Dimension in the video classification system:

work page
[79]

Whether a movement occurs

work page
[80]

Sequence between multiple movements

work page
[81]

Ensure it can be parsed by json.loads() without returning anything outside the list

Appearance description and judgment Response Format Return only a Python list, where each element is a dictio- nary representing a question. Ensure it can be parsed by json.loads() without returning anything outside the list. 9.3. VLM Filtering To avoid over simple QAs that do not utilize motion com- prehension capability, we use various image VLMs to pre...

work page

Showing first 80 references.

[1] [1]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. 7

work page 2024

[2] [2]

Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion

Jiaben Chen and Huaizu Jiang. Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6475–6486,

work page

[3] [3]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In CVPR, pages 13320–13331,

work page

[4] [4]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906, 2023. 3

work page arXiv 2023

[5] [5]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

work page

[7] [7]

Towards event-oriented long video under- standing

Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video under- standing. arXiv preprint arXiv:2406.14129, 2024. 2

work page arXiv 2024

[8] [8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,

work page arXiv

[10] [10]

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024

[11] [11]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr¨und, Peter Yianilos, Moritz Mueller- Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 12

work page 2017

[12] [12]

Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier,...

work page 2022

[13] [13]

A dataset for medical instructional video classification and question answering

Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023. 2, 6

work page 2023

[14] [14]

A dataset for medical instructional video classi- fication and question answering

Deepak Kumar Gupta, Kush Attal, and Dina Demner- Fushman. A dataset for medical instructional video classi- fication and question answering. Scientific Data, 10, 2022. 4

work page 2022

[15] [15]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 1, 2, 3, 6, 7 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim

Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in vi- sual question answering. In CVPR, 2017. 12

work page 2017

[17] [17]

An image grid can be worth a video: Zero- shot video question answering using a vlm

Wonkyun Kim, Changin Choi, Wonseok Lee, and Won- jong Rhee. An image grid can be worth a video: Zero- shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024. 6

work page arXiv 2024

[18] [18]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Aria: An open multimodal native mixture-of- experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of- experts model. arXiv preprint arXiv:2410.05993, 2024. 1

work page arXiv 2024

[20] [20]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 3, 6, 7

work page 2023

[21] [21]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022. 12

work page arXiv 2022

[22] [22]

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 2, 3, 7

work page 2024

[24] [24]

Videovista: A versatile bench- mark for video understanding and reasoning

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024. 1

work page arXiv 2024

[25] [25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

work page 2024

[26] [26]

Kangaroo: A powerful video-language model supporting long-context video input

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

work page arXiv

[27] [27]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025. 1

work page 2025

[28] [28]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 1, 7

work page arXiv 2024

[30] [30]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. 2024. 12

work page 2024

[31] [31]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2023. 3

work page 2023

[32] [32]

Yi-vl-34b, 2024

NousResearch. Yi-vl-34b, 2024. 7

work page 2024

[33] [33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 3

work page 2024

[36] [37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 1, 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [40]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 3, 6

work page arXiv 2024

[40] [41]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 12

work page 2021

[42] [43]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2

work page 2016

[43] [44]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Just ask: Learning to answer questions from millions of narrated videos

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021. 12

work page 2021

[45] [46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [47]

Tenenbaum

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020. 12

work page 2020

[47] [48]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2

work page 2019

[48] [49]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [50]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 3, 6, 7

work page 2024

[50] [51]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [52]

Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023

Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023. 4, 6

work page 2023

[52] [53]

Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding

Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2

work page 2024

[53] [54]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [55]

1, 3 11 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Supplementary Material

work page

[55] [56]

3 and all ablated models in Tab

Training Details Here we provide the detailed training hyperparameters for both TE Fusion in Tab. 3 and all ablated models in Tab. 4 and Fig. 6. Configurations Total steps 10,000 Warmup steps 1,000 Global batch size 768 Learning rate 8e-6 Minimal learning rate 1e-6 Learning rate decay cosine Optimizer Adam Adam ϵ 1e-8 Adam β1 0.9 Adam β2 0.95 Precision bf...

work page

[56] [57]

Model Details To maintain a fair comparison, all model architectures are ablated with the same backbone, GLM-4V , with its model configuration as follows: Assume the temporal compression ratio be K, The spe- cific feature of each ablated architecture is:

work page

[57] [58]

After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

TE-Fusion (ours): Before the visual encoder, we con- catenate every neighboring K frames into one sequence, and conduct self-attention across each K frames to fuse temporal feature. After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion

work page

[58] [59]

The model configurations of all ablated architectures

Qwen2-VL: The neighboring K frames are concatenated along the channel dimension and patchified into one fea- VLM decoder Layers 40 Hidden size 4096 Attention heads 32 num query groups 2 FFN hidden size 13696 Sequence len 4096 Position embedding RoPE Normalization RMSNorm visual encoder Input resolution 224 Patch size 14 Post spatial downsample 2 × 2 Layer...

work page

[59] [60]

Kangaroo: This approach is the most similar one to TE Fusion, except that every frame is computed indepen- dently within the visual encoder and concatenated along the hidden size dimension to perform temporal down- sample (with an MLP layer)

work page

[60] [61]

Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer

QFormer: After going through the visual encoder, the video feature is passed through a QFormer (learned from scratch). Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer. From the experiment, we found that, though being light-weighted, the QFormer is hard to optimize and model temporal relationships durin...

work page

[61] [62]

qwen2-vl

PLLaV A: This approach is similar to Kangaroo. Instead of fusion with the MLP layer, PLLaV A adopts a simple adaptive pooling. To avoid possible information loss, we conduct the pooling operation after the spatial downsam- ple module. The pseudo-code below further illustrates all ablated archi- tectures. 12 def forward(): ’’’ The pseudo-code of the forwar...

work page

[62] [63]

QA Construction Process for Videos with In- tricate Interactions Here we illustrate the QA generation process corresponding to Fig. 4. 9.1. Step1: Video caption annotation For videos with intricate interactions, it is impractical to di- rectly annotate the whole video clip, since the total com- plexity and quantity of the motions are too large. There- for...

work page

[63] [64]

Each question should have 4 options. 13

work page

[64] [65]

It may draw from multiple highly related content dimensions

For each question, combine one dimension from the Content Dimension and one from the Question Logic Dimension. It may draw from multiple highly related content dimensions

work page

[65] [66]

Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

Focus only on representative and prominent events or actions to keep options clear and unique without being overly detailed or tricky. Select the most fitting di- mension combination for each video and avoid repeated combinations where possible

work page

[66] [67]

The worker holds a long, thin tool,

Given possible ambiguities in some descriptions, ensure the answer is unique and clear to avoid deductions. • Ambiguity Example 1: Temporal ambiguity. If a description reads, “On the left, a woman in a khaki suit faces right, nodding her head while speaking. In the middle, a group faces the camera, and a man in a white shirt pulls a chair leftward to sit,...

work page

[67] [68]

slightly bent elbow

Choose only prominent events or actions, avoiding mi- nor or indeterminate details. Ensure each answer is unique and clear. • Minor Example: If “slightly bent elbow” isn’t men- tioned, it does not necessarily mean it did not hap- pen; if the video says “the mouth moved slightly a few times,” it cannot be determined the interval and number of these movemen...

work page

[68] [69]

based on the description

Pretend you’re viewing the video, avoiding terms like “based on the description” or expressions related to the description text, including questions, options, and expla- nations

work page

[69] [70]

Aim for at least 4 questions to focus beyond appearance

work page

[70] [71]

Keep questions to around six, focusing only on represen- tative events or actions and ensuring options are clear, unique, and straightforward

work page

[71] [72]

first frame description

Questions should focus on dynamic actions only. The “first frame description” is supplementary and should not guide question design

work page

[72] [73]

Categorization System Content Dimension Below is the Content Dimension in the video classification system:

The video dynamic information description does not contain causal or other logical relationships, therefore, do not involve logical relationships in the title. Categorization System Content Dimension Below is the Content Dimension in the video classification system:

work page

[73] [74]

Detailed actions of individuals 1.2

Human Dynamics: 1.1. Detailed actions of individuals 1.2. Interaction among multiple people 1.3. Emotional states and their changes 1.4. Position and its changes (Location, Angle, etc.)

work page

[74] [75]

Movement trajectory 2.2

Object Dynamics: 2.1. Movement trajectory 2.2. State changes

work page

[75] [76]

Detailed actions 3.2

Animal Dynamics: 3.1. Detailed actions 3.2. Position and its changes (Location, Angle, etc.)

work page

[76] [77]

Camera movement

Camera Movement: 4.1. Camera movement

work page

[77] [78]

individuals 5.2

Appearance Characteristics: 5.1. individuals 5.2. objects 5.3. environment Question Logic Dimension Below is the Question Logic Dimension in the video classification system:

work page

[78] [79]

Whether a movement occurs

work page

[79] [80]

Sequence between multiple movements

work page

[80] [81]

Ensure it can be parsed by json.loads() without returning anything outside the list

Appearance description and judgment Response Format Return only a Python list, where each element is a dictio- nary representing a question. Ensure it can be parsed by json.loads() without returning anything outside the list. 9.3. VLM Filtering To avoid over simple QAs that do not utilize motion com- prehension capability, we use various image VLMs to pre...

work page