NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
arXiv preprint arXiv:2603.13500 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
UMo presents a sparse MoE-based unified model for real-time co-speech avatar animation that claims superior quality under latency constraints via keyframe-centric design and multi-stage audio-augmented training.
citing papers explorer
-
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
-
UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
UMo presents a sparse MoE-based unified model for real-time co-speech avatar animation that claims superior quality under latency constraints via keyframe-centric design and multi-stage audio-augmented training.