SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3
The pith
Scaling model size, data volume, and compute in motion tracking produces a generalist humanoid controller for natural whole-body movements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scaling along network size, dataset volume from 700 hours of motion capture, and compute creates a foundation model for motion tracking that delivers natural, robust whole-body humanoid control, improves steadily with more resources, generalizes to unseen motions, and supports downstream uses such as real-time kinematic planning for navigation and a unified interface for VR teleoperation plus vision-language-action models.
What carries the argument
Motion tracking treated as a scalable supervised task that supplies dense supervision from large motion-capture datasets to acquire general human motion priors.
If this is right
- Tracking performance rises steadily as compute and data diversity grow.
- Policies generalize to motions absent from the training set.
- A real-time kinematic planner can convert tracking outputs into natural navigation and interactive whole-body behaviors.
- A single policy handles both VR teleoperation and vision-language-action models via a shared token space.
- The same controller supports coordinated hand and foot actions in autonomous loco-manipulation driven by vision-language inputs.
Where Pith is reading between the lines
- This scaling route could reduce reliance on per-task reward engineering across many robot behaviors.
- The learned motion priors might transfer to humanoids of different sizes or proportions if the underlying patterns prove body-agnostic.
- Direct coupling to language models could let robots follow spoken instructions that combine locomotion with object handling.
- Further increases in data and model size may close remaining gaps between simulated and real-world agility.
Load-bearing premise
That large-scale motion-capture data alone supplies enough signal to learn control policies that remain robust and generalizable on humanoid robots.
What would settle it
A head-to-head comparison showing that the largest scaled model performs no better than smaller versions when evaluated on a broad set of complex, previously unseen whole-body motion sequences.
Figures
read the original abstract
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SONIC, a scaled foundation model for humanoid whole-body control obtained by training a motion-tracking policy on large-scale motion-capture data. It scales network size from 1.2M to 42M parameters, dataset volume to over 100M frames drawn from 700 hours of mocap, and compute to 21k GPU hours. The central claim is that this scaling produces a generalist controller capable of natural, robust whole-body movements that generalizes to unseen motions; downstream utility is shown via a real-time kinematic planner for navigation and a unified token space enabling VR teleoperation and VLA-driven loco-manipulation.
Significance. If the scaling results and generalization claims hold, the work would be significant for robotics by demonstrating that dense mocap supervision can serve as a scalable pre-training task for humanoid control, yielding policies that avoid manual reward engineering. The explicit three-axis scaling study, the reported steady performance gains, and the practical interfaces for downstream tasks (kinematic planner and unified token space) constitute concrete strengths that could influence future foundation-model efforts in humanoid robotics.
major comments (2)
- [Abstract] Abstract: the statement that 'performance improves steadily with compute and data diversity' is presented without any quantitative metrics, baselines, error bars, or ablation tables. Because this scaling hypothesis is the load-bearing empirical claim, the absence of supporting numbers leaves the central result only partially substantiated.
- [Abstract] Abstract and §4 (presumed results section): the claim that 'learned policies generalize to unseen motions' is stated without details on how the unseen test set was constructed, what tracking-error metrics were used, or comparisons against smaller-scale baselines. These omissions directly affect evaluation of the generalization property asserted in the abstract.
minor comments (1)
- [Abstract] Abstract: the phrase 'unified token space' is introduced without a brief definition or pointer to the section that explains how motion, VR, and VLA tokens are represented in the same space.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential significance of scaling motion tracking for humanoid control. We address each major comment below, referencing the relevant sections of the manuscript and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that 'performance improves steadily with compute and data diversity' is presented without any quantitative metrics, baselines, error bars, or ablation tables. Because this scaling hypothesis is the load-bearing empirical claim, the absence of supporting numbers leaves the central result only partially substantiated.
Authors: We agree that the abstract would be strengthened by including representative quantitative results. The full manuscript substantiates the scaling hypothesis in Section 4 with ablation studies across model sizes (1.2M to 42M parameters), data volumes, and compute budgets. Figure 3 and Table 2 report steady reductions in mean per-joint position error (MPJPE) and velocity error with increasing scale, including error bars from three random seeds and direct comparisons to smaller baselines. We have revised the abstract to incorporate key metrics illustrating these trends while respecting length limits. revision: yes
-
Referee: [Abstract] Abstract and §4 (presumed results section): the claim that 'learned policies generalize to unseen motions' is stated without details on how the unseen test set was constructed, what tracking-error metrics were used, or comparisons against smaller-scale baselines. These omissions directly affect evaluation of the generalization property asserted in the abstract.
Authors: Section 4.3 of the manuscript details the unseen test set construction: it comprises held-out motion sequences from the AMASS dataset (different performers and activity categories) plus custom captures not used in training, totaling approximately 10% of the data. Tracking error is quantified via MPJPE and angular velocity error, as defined in Section 3.2. Figure 5 and the accompanying text provide direct comparisons showing that the 42M model reduces error on these unseen motions by 12-18% relative to the 1.2M baseline. To improve clarity, we have added a brief description of the test-set construction and primary metrics to the abstract. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper reports empirical scaling outcomes from training on external motion-capture datasets (100M+ frames, 700 hours) with explicit variation in model size (1.2M–42M parameters) and compute (21k GPU hours). Performance gains and generalization are measured directly against held-out motions and downstream tasks; no equations, fitted parameters, or self-citations are invoked as load-bearing derivations that reduce the claimed results to the inputs by construction. The argument is self-contained against observable benchmarks and does not rely on self-definitional loops or renamed known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dense supervision from diverse motion-capture data acquires human motion priors without manual reward engineering.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours).
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Motion tracking leverages human motion capture data, which provides dense, frame-by-frame supervision without reward engineering.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 15 Pith papers
-
CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation
CEER proposes a compliant end-effector and root control interface that unifies loco-manipulation for humanoids via a distilled low-level policy and hierarchical planners.
-
VOFA: Visual Object Goal Pushing with Force-Adaptive Control for Humanoids
VOFA combines a high-level visuomotor policy with a low-level force-adaptive controller to let humanoids push objects up to 17 kg to arbitrary goals using only noisy onboard vision, achieving over 80% real-world success.
-
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
-
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
A diffusion-based motion generator combined with an RL motion tracker enables terrain-aware whole-body locomotion on a humanoid robot by adapting reference motions online from perception.
-
CLAW: Composable Language-Annotated Whole-body Motion Generation
CLAW composes motion primitives from a kinematic planner, tracks them with a low-level controller in MuJoCo to produce physically grounded trajectories, and generates segment- and trajectory-level language annotations...
-
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...
-
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
A modular system uses motion matching to compose long-horizon human skill chains, trains RL experts, and distills them into a depth-based policy that lets a Unitree G1 humanoid autonomously climb, vault, and roll over...
-
HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model
HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.
-
TeleGate: Whole-Body Humanoid Teleoperation via Gated Expert Selection with Motion Prior
TeleGate achieves high-precision real-time whole-body teleoperation of humanoid robots by dynamically gating between expert policies and using a VAE motion prior to infer future intent from history, outperforming dist...
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a large Mixture-of-Experts Transformer policy on a hybrid corpus of video-reconstructed and MoCap motions to achieve robust zero-shot whole-body tracking that transfers directly to real humanoid robots.
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
-
Learning Versatile Humanoid Manipulation with Touch Dreaming
HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
-
Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots
Tree Learning uses root-branch parameter inheritance and multi-modal adaptation to enable continual multi-skill learning in humanoid robots, achieving higher rewards and 100% retention versus joint training in Unity s...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, and et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. URLhttps://arxiv.org/abs/2204.01691. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [3]
-
[4]
Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 2
-
[5]
Gr00t n1.5: An improved open foundation model for generalist humanoid robots
Johan Bjorck, Valts Blukis, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Xiaowei Jiang, Kaushil Kundalia, Jan Kautz, Zhiqi Li, Kevin Lin, Zongyu Lin, Loic Magne, Yunze Man, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang...
work page 2025
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, DieterFox, FengyuanHu, SpencerHuang, JoelJang, ZhenyuJiang, JanKautz, KaushilKundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Robin Rombach, and Patrick Esser. Stable video diffusion: Scaling latent video diffusion models.arXiv preprint arXiv:2311.15127, 2023. URLhttps://arxiv.org/abs/ 2311.15127. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. URLhttps://arxiv.org/abs/2108.07258. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Carlos Carbajal, and et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. URLhttps://arxiv.org/abs/2212.06817. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023
Anthony Brohan, Noah Brown, Ilya Chelombiev, and et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.Nature, 2023. doi: 10.1038/s41586-023-06475-7. 2
-
[11]
Video generation models as world simulators.OpenAI, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI, 2024. 1
work page 2024
-
[12]
Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025. 2, 4, 10, 12 19 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
-
[13]
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022. 14
work page 2022
-
[14]
Momask: Generative masked modeling of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 15
work page 1900
-
[15]
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024. 2, 12
-
[16]
Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 2
-
[17]
Hover: Versatile neural whole-body controller for humanoid robots
Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole-body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025. 2
work page 2025
-
[18]
Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025
Xialin He, Runpei Dong, Zixuan Chen, and Saurabh Gupta. Learning getting-up policies for real-world humanoid robots.arXiv preprint arXiv:2502.12152, 2025. 2
-
[19]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025
Tao Huang, Junli Ren, Huayi Wang, Zirui Wang, Qingwei Ben, Muning Wen, Xiao Chen, Jianan Li, and Jiangmiao Pang. Learning humanoid standing-up control across diverse postures.arXiv preprint arXiv:2502.08378, 2025. 2
-
[22]
Farrar, Straus and Giroux, 2011
Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. ISBN 9780374275631. 9
work page 2011
-
[23]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, and et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361. 1
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[24]
Pyroki: A modular toolkit for robot kinematic optimization
Chung Min Kim*, Brent Yi*, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. URLhttps://arxiv.org/abs/2505.03728. 10
-
[25]
Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InCVPR, 2020. doi: 10.1109/CVPR42600.2020.01265. 3
-
[26]
Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738,
-
[27]
Genmo: A generalist model for human motion
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A generalist model for human motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 6, 14, 16, 24 20 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
work page 2025
-
[28]
Bailando: 3d dance generation via actor-critic gpt with choreographic memory
Ruilong Li, Shan Li, Angjoo Huang, and et al. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. InSIGGRAPH Asia 2021, 2021. doi: 10.1145/3478513.3480495. 3
-
[29]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI Choreographer: Music conditioned 3d dance generation with aist++. InIEEE/CVF International Conference on Computer Vision (ICCV), October
-
[30]
BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion
Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Be- yondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025. 2, 4, 10, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 34(6): 248:1–248:16, 2015. doi: 10.1145/2816795.2818013. URLhttp://smpl.is.tue.mpg.de. 8, 13
-
[32]
Winkler, Kris Kitani, and Weipeng Xu
Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. InInternational Conference on Computer Vision (ICCV), 2023. 2, 12
work page 2023
-
[33]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,
-
[34]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Yi Ma, Zahid Hazara, Brian Ichter, and et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024. URLhttps://arxiv.org/abs/2403.12945. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Troje, Gerard Pons-Moll, and Michael J
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. 2, 4
work page 2019
-
[36]
Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024
Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation.arXiv preprint arXiv:2411.16575, 2024. 14
-
[37]
Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025
Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, and Huaizu Jiang. Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025. 14
-
[38]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXiv preprint arXiv:2309.15505, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
NVIDIA, :, Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yiji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. URLhttps://arxiv.org/abs/2310.08864. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Deepmimic: Example-guided deep reinforcement learning of physics-based character skills
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. InACM SIGGRAPH 2018 Papers, 2018. doi: 10.1145/3197517.3201311. 1
-
[42]
Xue Bin Peng, Zhaoyu Zhou, Stephen Luo, and Michiel van de Panne. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4), 2021. doi: 10.1145/ 3450626.3459670. 1
-
[43]
Mmm: Generative masked motion model
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024. 15
work page 2024
-
[44]
Black, Tushar Kapadi, and Gerard Pons-Moll
Abhinanda Punnakkal, Michael J. Black, Tushar Kapadi, and Gerard Pons-Moll. Babel: Bodies, action and behavior with english labels. InCVPR, 2021. doi: 10.1109/CVPR46437.2021.00756. 2
-
[45]
Marc Raibert and Farbod Farshidian. InWorkshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS), June 2025. 12
work page 2025
-
[46]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1
work page 2022
-
[48]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 12
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Sebastian Starke, Yiwei Zhao, Fabio Zinno, and Taku Komura. Neural animation layering for synthesizing martial arts movements.ACM Transactions on Graphics (TOG), 40(4):1–16, 2021. 5
work page 2021
-
[50]
Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. 1
work page 2019
-
[51]
Guy Tevet, Sigal Raab, Yuval Shafir, and et al. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022. URLhttps://arxiv.org/abs/2209.14916. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109. 5
-
[53]
Unitree. Humanoid robot G1_Humanoid Robot Functions_Humanoid Robot Price | Unitree Robotics — unitree.com.https://www.unitree.com/g1, 2024. [Accessed 31-10-2025]. 3
work page 2024
-
[54]
Unitree boxing.https://www.unitree.com/boxing, 2025
Unitree. Unitree boxing.https://www.unitree.com/boxing, 2025. Accessed: 2024-06-30. 5
work page 2025
-
[55]
Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning.https: //github.com/huggingface/trl, 2020. 14 22 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
work page 2020
-
[56]
Tingwu Wang, Yunrong Guo, Maria Shugrina, and Sanja Fidler. Unicon: Universal neural controller for physics-based character motion.arXiv preprint arXiv:2011.15119, 2020. 2
-
[57]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, and et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. URLhttps://arxiv.org/abs/2206.07682. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports.ACM Transactions on Graphics (TOG), 40(4):1–11,
-
[59]
Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025
Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025. URLhttps://arxiv.org/abs/2507.07356. 2
-
[60]
Magvit: Masked generative video transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469,
-
[61]
Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025
Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÚjo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 2, 4
-
[62]
Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025
Weishuai Zeng, Shunlin Lu, Kangning Yin, Xiaojie Niu, Minyue Dai, Jingbo Wang, and Jiangmiao Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025. 2, 4
-
[63]
T2m-gpt: Generating human motion from textual descriptions with discrete representations
Ye Zhang, Tong He, Qingxuan Zhang, and et al. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InCVPR, 2023. doi: 10.1109/CVPR52729.2023.00877. 3
-
[64]
Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, MaoqiLiu, HuapingLiu, etal. Trackanymotionsunderanydisturbances.arXivpreprintarXiv:2509.13833,
-
[65]
Zhigen Zhao, Liuchuan Yu, Ke Jing, and Ning Yang. Xrobotoolkit: A cross-platform framework for robot teleoperation.arXiv preprint arXiv:2508.00097, 2025. 8, 14
-
[66]
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation represen- tations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019. 12 23 SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control A. Supplementary Materials S1. Supplem...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.