MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang · 2025 · cs.CV · arXiv 2501.02955

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

citation-role summary

dataset 2 baseline 1

citation-polarity summary

use dataset 2 baseline 1

representative citing papers

PushupBench: Your VLM is not good at counting pushups

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

cs.CV · 2026-01-30 · unverdicted · novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

Kimi K2.5: Visual Agentic Intelligence

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 5 of 5 citing papers.

PushupBench: Your VLM is not good at counting pushups cs.CV · 2026-04-25 · unverdicted · none · ref 5 · internal anchor
VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 19 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 26 · internal anchor
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 14 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 48 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer