NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
arXiv preprint arXiv:2411.17335 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Text2BFM aligns language with a frozen BFM via a text-aligned variational behavioral bottleneck to generate long motions by decoding latents into policy actions.
AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.
UMo presents a sparse MoE-based unified model for real-time co-speech avatar animation that claims superior quality under latency constraints via keyframe-centric design and multi-stage audio-augmented training.
citing papers explorer
-
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
-
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.