MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Pith reviewed 2026-05-23 05:42 UTC · model grok-4.3
The pith
Vision language models perform poorly on fine-grained motion understanding in videos, with a new benchmark and fusion method showing paths to partial improvement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes MotionBench as a benchmark for fine-grained motion comprehension in VLMs using six primary categories of motion-oriented questions from diverse sources. It finds that existing VLMs perform poorly. It introduces the Through-Encoder (TE) Fusion method for efficient video feature compression and shows that higher frame rate inputs and TE Fusion improve motion understanding, though substantial room for enhancement exists.
What carries the argument
The MotionBench benchmark consisting of six motion-oriented question categories, along with the Through-Encoder (TE) Fusion method that enables better video feature compression for limited sequence lengths in language models.
If this is right
- Higher frame rate video inputs improve fine-grained motion understanding in VLMs.
- The TE Fusion method provides an efficient way to incorporate more frames without exceeding sequence limits.
- Current improvements still leave substantial room for further advances in motion perception.
- Video understanding models should prioritize motion-level perception in addition to other capabilities.
Where Pith is reading between the lines
- Success on this benchmark might translate to better performance in downstream tasks like video action prediction.
- Developers could explore combining TE Fusion with other compression techniques for even better results.
- Expanding the benchmark to include more complex motion scenarios could reveal additional weaknesses.
Load-bearing premise
The motion-oriented questions and videos in the benchmark measure fine-grained motion comprehension separately from other skills like recognizing objects or understanding language.
What would settle it
If a vision language model achieves high accuracy on MotionBench without relying on increased frame rates or the TE Fusion method, it would challenge the claim that these are necessary for better motion understanding.
Figures
read the original abstract
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MotionBench, a benchmark with six primary categories of motion-oriented questions drawn from diverse real-world video sources, to evaluate fine-grained motion comprehension in VLMs. It reports that existing VLMs perform poorly on these tasks, proposes a Through-Encoder (TE) Fusion method for efficient video feature compression within LLM sequence limits, and shows that higher frame-rate inputs combined with TE Fusion yield improvements, while noting substantial remaining room for enhancement.
Significance. If the benchmark validly isolates motion-specific understanding, the work would usefully highlight a targeted limitation in current video VLMs and demonstrate an efficient architectural tweak (TE Fusion) that improves motion perception; the emphasis on fine-grained motion could usefully steer future model development.
major comments (3)
- [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.
- [Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.
- [TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.
minor comments (2)
- [Abstract] Abstract contains minor phrasing issues (e.g., 'VLM's ability' should read 'VLMs' abilities'; 'reviewing VLM architectures' is unclear).
- [Results tables] Tables reporting model accuracies should include standard deviations or confidence intervals and specify the number of videos/questions per category.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction and evaluation sections] Benchmark construction and evaluation sections: no controls (single-frame baselines, frame-shuffled ablations, or object-recognition-only variants) are reported to confirm that the six question categories require motion perception rather than static appearance, object recognition, or language priors. Without such ablations the central performance-gap and TE-Fusion-gain claims cannot be attributed specifically to motion understanding.
Authors: We agree that the absence of these controls limits the ability to isolate motion-specific understanding. In the revised manuscript we will add single-frame baselines, frame-shuffled ablations, and object-recognition-only variants. These will be reported alongside the existing results to demonstrate that the performance gaps and TE-Fusion gains are attributable to motion perception rather than static appearance or language priors. revision: yes
-
Referee: [Experimental results and setup] Experimental results and setup: the manuscript provides no information on question validation procedures, inter-annotator agreement, baseline selection rationale, or statistical significance testing of the reported accuracy improvements. These omissions make the soundness of the performance claims difficult to assess.
Authors: We will expand the relevant sections to document the question validation procedures, report inter-annotator agreement, provide the rationale for baseline model selection, and include statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) for the accuracy improvements. revision: yes
-
Referee: [TE Fusion proposal (method and experiments sections)] TE Fusion proposal (method and experiments sections): while presented as novel, the description lacks sufficient implementation details, direct comparisons to alternative fusion or compression techniques, and targeted ablations isolating its benefit for motion features versus general video understanding.
Authors: We will augment the method section with additional implementation details (hyperparameters, exact fusion equations, and computational overhead). We will also add direct comparisons against alternative fusion/compression methods and targeted ablations that isolate the benefit of TE Fusion on motion features versus general video understanding. revision: yes
Circularity Check
No circularity; empirical benchmark and experiments are self-contained
full rationale
The paper introduces MotionBench as a new evaluation set with six motion-oriented question categories and proposes TE Fusion as an architectural change, then reports direct experimental accuracies on VLMs. No equations, fitted parameters, or derivations appear in the provided text; performance numbers are obtained by running models on the collected videos rather than by any reduction to self-defined quantities or self-citation chains. The central claims therefore rest on external model behavior and data collection rather than on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types... propose a novel and efficient Through-Encoder (TE) Fusion method.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TE Fusion... applies deep fusion throughout the visual encoder... higher frame rate inputs and TE Fusion yield improvements
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
PushupBench: Your VLM is not good at counting pushups
VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
-
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
- [1]
-
[2]
Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion
Jiaben Chen and Huaizu Jiang. Sportsslomo: A new bench- mark and baselines for human-centric video frame interpola- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6475–6486,
-
[3]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In CVPR, pages 13320–13331,
-
[4]
Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906, 2023. 3
-
[5]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jia- peng Luo, Zheng Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,
-
[7]
Towards event-oriented long video under- standing
Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video under- standing. arXiv preprint arXiv:2406.14129, 2024. 2
-
[8]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, and Ivan Laptev. Short film dataset (sfd): A benchmark for story- level video understanding.arXiv preprint arXiv:2406.10221,
-
[10]
Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...
work page 2024
-
[11]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr¨und, Peter Yianilos, Moritz Mueller- Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 12
work page 2017
-
[12]
Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ra- makrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier,...
work page 2022
-
[13]
A dataset for medical instructional video classification and question answering
Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10(1):158, 2023. 2, 6
work page 2023
-
[14]
A dataset for medical instructional video classi- fication and question answering
Deepak Kumar Gupta, Kush Attal, and Dina Demner- Fushman. A dataset for medical instructional video classi- fication and question answering. Scientific Data, 10, 2022. 4
work page 2022
-
[15]
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 1, 2, 3, 6, 7 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim
Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in vi- sual question answering. In CVPR, 2017. 12
work page 2017
-
[17]
An image grid can be worth a video: Zero- shot video question answering using a vlm
Wonkyun Kim, Changin Choi, Wonseok Lee, and Won- jong Rhee. An image grid can be worth a video: Zero- shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024. 6
-
[18]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Aria: An open multimodal native mixture-of- experts model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of- experts model. arXiv preprint arXiv:2410.05993, 2024. 1
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 3, 6, 7
work page 2023
- [21]
-
[22]
VideoChat: Chat-Centric Video Understanding
Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 2, 3, 7
work page 2024
-
[24]
Videovista: A versatile bench- mark for video understanding and reasoning
Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024. 1
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3
work page 2024
-
[26]
Kangaroo: A powerful video-language model supporting long-context video input
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,
-
[27]
St-llm: Large language models are effective tem- poral learners
Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. In European Conference on Computer Vision, pages 1–18. Springer, 2025. 1
work page 2025
-
[28]
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 1, 7
-
[30]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. 2024. 12
work page 2024
-
[31]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2023. 3
work page 2023
- [32]
-
[33]
OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 3
work page 2024
-
[37]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Internvideo2: Scaling video foundation mod- els for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation mod- els for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024. 3, 6
-
[41]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 12
work page 2021
-
[43]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 10 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2
work page 2016
-
[44]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 1, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Just ask: Learning to answer questions from millions of narrated videos
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021. 12
work page 2021
-
[46]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [47]
-
[48]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2
work page 2019
-
[49]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Llava- next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 3, 6, 7
work page 2024
-
[51]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023
Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: A human assembly video dataset for comprehensive assembly knowl- edge understanding, 2023. 4, 6
work page 2023
-
[53]
Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding
Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: a human assembly video dataset for comprehensive assembly knowl- edge understanding. Advances in Neural Information Pro- cessing Systems, 36, 2024. 2
work page 2024
-
[54]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
1, 3 11 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Supplementary Material
-
[56]
3 and all ablated models in Tab
Training Details Here we provide the detailed training hyperparameters for both TE Fusion in Tab. 3 and all ablated models in Tab. 4 and Fig. 6. Configurations Total steps 10,000 Warmup steps 1,000 Global batch size 768 Learning rate 8e-6 Minimal learning rate 1e-6 Learning rate decay cosine Optimizer Adam Adam ϵ 1e-8 Adam β1 0.9 Adam β2 0.95 Precision bf...
-
[57]
Model Details To maintain a fair comparison, all model architectures are ablated with the same backbone, GLM-4V , with its model configuration as follows: Assume the temporal compression ratio be K, The spe- cific feature of each ablated architecture is:
-
[58]
TE-Fusion (ours): Before the visual encoder, we con- catenate every neighboring K frames into one sequence, and conduct self-attention across each K frames to fuse temporal feature. After the visual encoder, the tokens of K frames are concatenated along the hidden-size dimen- sion, downsampled and projected to the output dimen- sion
-
[59]
The model configurations of all ablated architectures
Qwen2-VL: The neighboring K frames are concatenated along the channel dimension and patchified into one fea- VLM decoder Layers 40 Hidden size 4096 Attention heads 32 num query groups 2 FFN hidden size 13696 Sequence len 4096 Position embedding RoPE Normalization RMSNorm visual encoder Input resolution 224 Patch size 14 Post spatial downsample 2 × 2 Layer...
-
[60]
Kangaroo: This approach is the most similar one to TE Fusion, except that every frame is computed indepen- dently within the visual encoder and concatenated along the hidden size dimension to perform temporal down- sample (with an MLP layer)
-
[61]
QFormer: After going through the visual encoder, the video feature is passed through a QFormer (learned from scratch). Every K frames’ feature is combined into a sequence to fusion temporal information within the QFormer. From the experiment, we found that, though being light-weighted, the QFormer is hard to optimize and model temporal relationships durin...
-
[62]
PLLaV A: This approach is similar to Kangaroo. Instead of fusion with the MLP layer, PLLaV A adopts a simple adaptive pooling. To avoid possible information loss, we conduct the pooling operation after the spatial downsam- ple module. The pseudo-code below further illustrates all ablated archi- tectures. 12 def forward(): ’’’ The pseudo-code of the forwar...
-
[63]
QA Construction Process for Videos with In- tricate Interactions Here we illustrate the QA generation process corresponding to Fig. 4. 9.1. Step1: Video caption annotation For videos with intricate interactions, it is impractical to di- rectly annotate the whole video clip, since the total com- plexity and quantity of the motions are too large. There- for...
-
[64]
Each question should have 4 options. 13
-
[65]
It may draw from multiple highly related content dimensions
For each question, combine one dimension from the Content Dimension and one from the Question Logic Dimension. It may draw from multiple highly related content dimensions
-
[66]
Focus only on representative and prominent events or actions to keep options clear and unique without being overly detailed or tricky. Select the most fitting di- mension combination for each video and avoid repeated combinations where possible
-
[67]
The worker holds a long, thin tool,
Given possible ambiguities in some descriptions, ensure the answer is unique and clear to avoid deductions. • Ambiguity Example 1: Temporal ambiguity. If a description reads, “On the left, a woman in a khaki suit faces right, nodding her head while speaking. In the middle, a group faces the camera, and a man in a white shirt pulls a chair leftward to sit,...
-
[68]
Choose only prominent events or actions, avoiding mi- nor or indeterminate details. Ensure each answer is unique and clear. • Minor Example: If “slightly bent elbow” isn’t men- tioned, it does not necessarily mean it did not hap- pen; if the video says “the mouth moved slightly a few times,” it cannot be determined the interval and number of these movemen...
-
[69]
Pretend you’re viewing the video, avoiding terms like “based on the description” or expressions related to the description text, including questions, options, and expla- nations
-
[70]
Aim for at least 4 questions to focus beyond appearance
-
[71]
Keep questions to around six, focusing only on represen- tative events or actions and ensuring options are clear, unique, and straightforward
-
[72]
Questions should focus on dynamic actions only. The “first frame description” is supplementary and should not guide question design
-
[73]
The video dynamic information description does not contain causal or other logical relationships, therefore, do not involve logical relationships in the title. Categorization System Content Dimension Below is the Content Dimension in the video classification system:
-
[74]
Detailed actions of individuals 1.2
Human Dynamics: 1.1. Detailed actions of individuals 1.2. Interaction among multiple people 1.3. Emotional states and their changes 1.4. Position and its changes (Location, Angle, etc.)
- [75]
-
[76]
Animal Dynamics: 3.1. Detailed actions 3.2. Position and its changes (Location, Angle, etc.)
- [77]
-
[78]
Appearance Characteristics: 5.1. individuals 5.2. objects 5.3. environment Question Logic Dimension Below is the Question Logic Dimension in the video classification system:
-
[79]
Whether a movement occurs
-
[80]
Sequence between multiple movements
-
[81]
Ensure it can be parsed by json.loads() without returning anything outside the list
Appearance description and judgment Response Format Return only a Python list, where each element is a dictio- nary representing a question. Ensure it can be parsed by json.loads() without returning anything outside the list. 9.3. VLM Filtering To avoid over simple QAs that do not utilize motion com- prehension capability, we use various image VLMs to pre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.