Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
Title resolution pending
18 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
Multi-task learning on synthetic mmWave point clouds from clinical images enables regression of VAT and BFP from real mmWave scans with MAE of 1.0 L and 3.2%.
X-Morph retargets human motions to kinematically plausible references for multiple legged morphologies, trains privileged RL trackers, and distills them into deployable policies that generalize and enable teleoperation and text-conditioned generation.
New benchmark diagnoses directional, attributional, and temporal hallucinations in multimodal motion comparison models and demonstrates gains from explicit measurement verification.
OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.
SOMA recovers spatio-temporal muscle behavior from multi-view RGB surface data and introduces the SKIM soft-tissue deformation dataset as the first such method from RGB observations.
A part-based neural deformation model disentangles motion and shape spaces in a diffusion-based 4D generator, outperforming prior work on unconditional and conditional 4D shape tasks.
MuNet is an end-to-end graph convolutional network using 2-manifold graphs and a mutualistic training mechanism that jointly optimizes 3D human mesh recovery and clothed reconstruction, reporting state-of-the-art results on six benchmarks.
The work augments pose-conditioned 3D Gaussian avatars with a residual latent evolved by a transformer decoder that decomposes updates into driving, restoring, and dissipative forces to produce history-dependent, temporally coherent full-body animations.
Scaling motion tracking models along size, data volume, and compute produces a foundation model for natural, robust humanoid whole-body control with downstream uses in kinematic planning and vision-language-action models.
Presents a scene-adaptive 3D human image animation framework using ground-adaptive motion retargeting and viewpoint-adaptive latent fusion to control human and camera trajectories, claiming improvements on two benchmarks.
SparseStreet applies node-based learnable pruning followed by static background compression to 3D Gaussian Splatting, reporting up to 80% reduction in primitives with minimal quality loss on Waymo and nuScenes street scene data.
Optimization-based weighted registration of SMPL to noisy mmWave point clouds with foot-ground and pose constraints extracts anthropometric measurements contactlessly through clothing.
SmoCap performs unified scale-pose canonicalization for motion capture by solving constrained trust-region QPs with analytical proxy-mapped Jacobians in a sparse control subspace.
ShowMak3r reconstructs dynamic TV show scenes from video using 3D actor localization, shot matching, and expression fitting to enable new camera views and scene edits.
citing papers explorer
-
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
-
Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation
STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.
-
TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
-
Non-intrusive Body Composition Assessment from Full-body mmWave Scans
Multi-task learning on synthetic mmWave point clouds from clinical images enables regression of VAT and BFP from real mmWave scans with MAE of 1.0 L and 3.2%.
-
X-Morph: Human Motion Priors for Scalable Robot Learning Across Morphologies
X-Morph retargets human motions to kinematically plausible references for multiple legged morphologies, trains privileged RL trackers, and distills them into deployable policies that generalize and enable teleoperation and text-conditioned generation.
-
MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning
New benchmark diagnoses directional, attributional, and temporal hallucinations in multimodal motion comparison models and demonstrates gains from explicit measurement verification.
-
OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation
OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.
-
SOMA: From Surface Observations to Muscle Anatomy
SOMA recovers spatio-temporal muscle behavior from multi-view RGB surface data and introduces the SKIM soft-tissue deformation dataset as the first such method from RGB observations.
-
Learning Neural Deformation Representation for 4D Dynamic Shape Generation
A part-based neural deformation model disentangles motion and shape spaces in a diffusion-based 4D generator, outperforming prior work on unconditional and conditional 4D shape tasks.
-
MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images
MuNet is an end-to-end graph convolutional network using 2-manifold graphs and a mutualistic training mechanism that jointly optimizes 3D human mesh recovery and clothed reconstruction, reporting state-of-the-art results on six benchmarks.
-
Latent Dynamics for Full Body Avatar Animation
The work augments pose-conditioned 3D Gaussian avatars with a residual latent evolved by a transformer decoder that decomposes updates into driving, restoring, and dissipative forces to produce history-dependent, temporally coherent full-body animations.
-
3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement
Presents a scene-adaptive 3D human image animation framework using ground-adaptive motion retargeting and viewpoint-adaptive latent fusion to control human and camera trajectories, claiming improvements on two benchmarks.
-
SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation
SparseStreet applies node-based learnable pruning followed by static background compression to 3D Gaussian Splatting, reporting up to 80% reduction in primitives with minimal quality loss on Waymo and nuScenes street scene data.
-
Millimeter-wave Imaging for Anthropometric Body Measurement
Optimization-based weighted registration of SMPL to noisy mmWave point clouds with foot-ground and pose constraints extracts anthropometric measurements contactlessly through clothing.
-
SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP
SmoCap performs unified scale-pose canonicalization for motion capture by solving constrained trust-region QPs with analytical proxy-mapped Jacobians in a sparse control subspace.