CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
The kit motion-language dataset.Big data, 4(4):236–252
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3roles
dataset 2representative citing papers
MotionHiFlow generates text-aligned 3D human motions using hierarchical flow matching across temporal scales, cross-scale transitions, a Text-Motion Diffusion Transformer, and a topology-aware Motion VAE, achieving state-of-the-art results on HumanML3D and KIT-ML.
citing papers explorer
-
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
-
MotionHiFlow: Text-to-motion via hierarchical flow matching
MotionHiFlow generates text-aligned 3D human motions using hierarchical flow matching across temporal scales, cross-scale transitions, a Text-Motion Diffusion Transformer, and a topology-aware Motion VAE, achieving state-of-the-art results on HumanML3D and KIT-ML.
- Next-Scale Autoregressive Models for Text-to-Motion Generation