Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Chenghuai Lin; Dairu Liu; He Wang; Jilong Wang; Li Yi; Sikai Liang; Wenyao Zhang; Xinqiang Yu; Xuchuan Chen; Yu Guan

arxiv: 2606.03985 · v1 · pith:LVNGJWCJnew · submitted 2026-06-02 · 💻 cs.RO · cs.AI· cs.CV

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Zekun Qi , Xuchuan Chen , Dairu Liu , Chenghuai Lin , Yunrui Lian , Sikai Liang , Zhikai Zhang , Yu Guan

show 5 more authors

Jilong Wang Wenyao Zhang Xinqiang Yu He Wang Li Yi

This is my paper

Pith reviewed 2026-06-28 09:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords Humanoid-GPTzero-shot motion trackinggenerative Transformerwhole-body controlmotion corpusscaling data and modelretargeted mocapcausal attention

0 comments

The pith

A GPT-style Transformer pre-trained on a 2B-frame unified motion corpus tracks dynamic humanoid behaviors and generalizes zero-shot to unseen motions and tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling both the size of a retargeted mocap training set to billions of frames and the capacity of a causal-attention Transformer produces a single generative model for whole-body humanoid control. Prior shallow MLP trackers were limited by scarce data and faced a direct trade-off between agility and generalization to new motions. If the scaling result holds, one model could handle both highly dynamic tracking and novel control tasks without task-specific retraining or additional fine-tuning. A sympathetic reader would care because this removes the need to collect new demonstrations or retrain for each new behavior, which has been a practical bottleneck for humanoid robots.

Core claim

Humanoid-GPT is a GPT-style Transformer with causal attention trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings; scaling both data volume and model capacity produces a single generative model that simultaneously tracks highly dynamic and complex motions while achieving robust zero-shot generalization to unseen motions and control tasks.

What carries the argument

The GPT-style Transformer with causal attention, used as a generative model for whole-body motion tracking and conditioned on the large unified retargeted motion corpus.

If this is right

A single model replaces the previous collection of task-specific trackers.
Scaling data volume and model size directly improves both tracking accuracy on dynamic motions and zero-shot performance on novel tasks.
Extensive scaling analyses confirm that performance continues to rise with additional data and parameters.
The same architecture supports both motion tracking and downstream control without separate training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the corpus coherence assumption holds across more sensor modalities, the same scaling recipe could apply to vision-based or force-based humanoid control.
The approach suggests that language-conditioned variants could allow instruction-driven motion generation without retraining the core tracker.
Hardware deployment may become simpler if one model handles both locomotion and manipulation tasks at deployment time.

Load-bearing premise

The retargeted mocap data from many different sources forms one coherent distribution that does not introduce biases blocking generalization to motions never seen in training.

What would settle it

A controlled experiment in which the model is tested on a motion sequence whose kinematics cannot be expressed as a retargeting of any frame in the 2B-frame corpus and shows clear failure to track or generalize.

read the original abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Humanoid-GPT scales a GPT-style model on a 2B-frame retargeted mocap corpus to claim zero-shot tracking of dynamic motions, but the unification step lacks checks that would rule out source artifacts.

read the letter

The core point is that a single Transformer trained on this scale of unified motion data can handle highly dynamic tracking and generalize to unseen tasks without task-specific fine-tuning. That scaling result is the main new element.

The work brings together major public mocap sets plus in-house recordings into one training distribution and applies a causal GPT architecture to whole-body humanoid control. The scaling analyses they run are useful for showing how performance improves with data and capacity, and the experiments appear to cover a range of dynamic behaviors.

The soft spot is the assumption that retargeting disparate sources produces a single coherent distribution. Without reported metrics on cross-source kinematic overlap or bias, the zero-shot claims could partly reflect artifacts from the retargeting process rather than broad generalization. The abstract frames the outcome as empirical scaling success, but that leaves the central assumption untested in the provided summary.

This paper is for robotics groups working on learned whole-body controllers and anyone tracking scaling trends in motion models. Readers who want concrete numbers on what larger motion corpora enable will get something from it.

It deserves peer review. The scale and the application area are substantial enough that referees should see the full methods and any bias diagnostics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Humanoid-GPT, a GPT-style Transformer with causal attention pre-trained on a 2B-frame retargeted motion-capture corpus that unifies major public mocap datasets with large-scale in-house recordings. It claims that jointly scaling data volume and model capacity produces a single generative model capable of tracking highly dynamic whole-body behaviors while delivering unprecedented zero-shot generalization to unseen motions and control tasks, thereby overcoming the agility-generalization trade-off that constrained prior shallow MLP trackers. Extensive experiments and scaling analyses are presented to support a new performance frontier.

Significance. If the zero-shot generalization claims are substantiated, the work would constitute a notable advance in humanoid robotics by demonstrating that large-scale generative pre-training on unified motion data can yield a versatile controller without per-task retraining. The explicit focus on scaling both data and architecture, together with the unification of disparate mocap sources, supplies a concrete empirical test of whether language-model-style scaling laws extend to whole-body control.

major comments (2)

[Data corpus construction and unification] The central zero-shot generalization claim rests on the premise that retargeting produces a single coherent training distribution free of source-specific kinematic or dynamic artifacts. No quantitative validation of this premise—such as cross-source distribution overlap statistics, per-source bias metrics, or an ablation isolating retargeting artifacts—is supplied in the data-preparation or experimental sections. This is load-bearing: absent such checks the reported scaling gains could arise from memorization of retargeting regularities rather than genuine generalization to truly unseen motions.
[Scaling analyses] The scaling analyses are described as demonstrating performance gains from increased data and capacity, yet the manuscript does not report the precise model sizes, data-subset sizes, or controlled ablations that would isolate the contribution of each factor while holding other variables fixed. Without these details the attribution of the performance frontier specifically to joint scaling remains under-supported.

minor comments (2)

[Methods] Notation for retargeting parameters and the precise definition of 'zero-shot' (e.g., whether any motion statistics from the target task appear in any training source) should be stated explicitly in the methods section to avoid ambiguity.
[Figures] Figure captions for the scaling curves should include the exact number of parameters and frames used at each point to allow direct replication of the reported trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our data unification and scaling results. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Data corpus construction and unification] The central zero-shot generalization claim rests on the premise that retargeting produces a single coherent training distribution free of source-specific kinematic or dynamic artifacts. No quantitative validation of this premise—such as cross-source distribution overlap statistics, per-source bias metrics, or an ablation isolating retargeting artifacts—is supplied in the data-preparation or experimental sections. This is load-bearing: absent such checks the reported scaling gains could arise from memorization of retargeting regularities rather than genuine generalization to truly unseen motions.

Authors: We agree that explicit quantitative validation of the unified corpus is important to substantiate the zero-shot claims. Although our data-preparation pipeline included internal consistency checks across sources, these were not reported in the manuscript. In the revised version we will add cross-source distribution overlap statistics (e.g., Wasserstein distances on joint-angle and velocity histograms), per-source bias metrics, and a controlled ablation that isolates retargeting artifacts by comparing models trained on raw versus retargeted subsets. These additions will directly address whether the observed generalization stems from coherent scaling rather than source-specific regularities. revision: yes
Referee: [Scaling analyses] The scaling analyses are described as demonstrating performance gains from increased data and capacity, yet the manuscript does not report the precise model sizes, data-subset sizes, or controlled ablations that would isolate the contribution of each factor while holding other variables fixed. Without these details the attribution of the performance frontier specifically to joint scaling remains under-supported.

Authors: We appreciate the request for greater precision. The original submission contained high-level descriptions of model scaling and data volume increases but omitted exact parameter counts, subset sizes, and fully controlled ablations. In the revision we will report the precise model sizes (parameter counts and layer dimensions), the exact frame counts for each data subset used in the scaling curves, and additional ablations that vary data volume and capacity independently while holding training steps, optimizer settings, and evaluation protocols fixed. These details will make the attribution to joint scaling explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results with no derivation chain

full rationale

The paper presents an empirical ML result: a GPT-style Transformer trained on a 2B-frame retargeted mocap corpus yields zero-shot generalization. No equations, derivations, or first-principles claims appear in the abstract or described content. Claims rest on scaling experiments and performance metrics rather than any reduction of outputs to fitted inputs or self-citations. The central premise (scaling data + capacity produces generalization) is an empirical observation, not a self-definitional or fitted-prediction construct. This is the expected non-finding for a scaling paper without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no identifiable free parameters, axioms, or invented entities; all modeling choices remain opaque.

pith-pipeline@v0.9.1-grok · 5690 in / 934 out tokens · 20920 ms · 2026-06-28T09:59:17.759869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages

[1]

Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Molloy, Micha...

Pith/arXiv arXiv 2023
[2]

Retargeting matters: General motion retargeting for humanoid motion tracking

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. arXiv preprint arXiv:2510.02252, 2025

arXiv 2025
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[4]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[5]

Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024
[6]

Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

Antonin Dallard, Mehdi Benallegue, Fumio Kane- hiro, and Abderrahmane Kheddar. Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

2023
[7]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025
[8]

Humanplus: Humanoid shadowing and imitation from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wet- zstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. 270:2828– 2844, 2024

2024
[9]

Robust motion in-betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

2020
[10]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024
[11]

Asap: Align- ing simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Align- ing simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025
[12]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beu- tel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, AlexTachardPassos, AlexanderKirillov, AlexiChris- takis, Alexis Connea...

Pith/arXiv arXiv 2024
[13]

Openai o1 system card.CoRR, abs/2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Car- ney, Alex Iftimie, Alex Karpenko, Alex Tachard Pas- sos, Alexander Neitz, Alexander Prokofiev, Alexan- der Wei, Allison Tam, Ally Bennett, Ananya Ku- mar, Andre Saraiva, Andrea Vallone, Andrew Du- berstein, Andrew Kondric...

Pith/arXiv arXiv 2024
[14]

Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

arXiv 2024
[15]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023

2023
[16]

Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Kyungmin Lee, Sibeen Kim, Minho Park, Hyunse- ung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo. Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Pith/arXiv arXiv 2025
[17]

Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

2023
[18]

Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

arXiv 2025
[19]

Beyondmimic: From motion tracking to versatile humanoidcontrolviaguideddiffusion.arXiv preprint arXiv:2508.08241, 2025

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoidcontrolviaguideddiffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[20]

Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

2023
[21]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023
[22]

Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

arXiv 2023
[23]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chen- ran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[24]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5442–5451, 2019

2019
[25]

Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

arXiv 2024
[26]

https://onnxruntime.ai

Microsoft Corporation.ONNX Runtime, 2024. https://onnxruntime.ai. High-performance infer- ence engine for ONNX models

2024
[27]

https://developer.nvidia.com/tensorrt

NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/tensorrt. Ver- sion 10.0

2024
[28]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human f...

2022
[29]

Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023
[30]

Learning transferable visual models from natural language supervision

AlecRadford, JongWookKim, ChrisHallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021

2021
[31]

A reduction of imitation learning and structured pre- diction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[32]

Roumeliotis and Nikolaos D

Konstantinos I. Roumeliotis and Nikolaos D. Tselikas. Chatgpt and open-ai models: A preliminary review. Future Internet, 15(6):192, 2023

2023
[33]

Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

2022
[34]

Mu- joco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intel- ligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[35]

Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023

Pith/arXiv arXiv 2023
[36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, pages 5998–6008, 2017

2017
[37]

From experts to a general- ist: Toward general whole-body control for hu- manoid robots.CoRR, abs/2506.12779, 2025

Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, and Zongqing Lu. From experts to a general- ist: Toward general whole-body control for hu- manoid robots.CoRR, abs/2506.12779, 2025. doi: 10.48550/ARXIV.2506.12779. https://doi.org/10. 48550/arXiv.2506.12779

work page doi:10.48550/arxiv.2506.12779 2025
[38]

Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

Pith/arXiv arXiv 2022
[39]

Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025
[40]

Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025
[41]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025
[42]

Twist: Teleoperated whole-body imitation system

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi- ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025

arXiv 2025
[43]

Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, ShunlinLu, YurongFu, YuanhaoCai, RuimaoZhang, Haoqian Wang, and Lei Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025
[44]

Freemotion: Mocap-free human motion synthesis with multimodal large language models

Zhikai Zhang, Yitang Li, Haofeng Huang, Mingx- ian Lin, and Li Yi. Freemotion: Mocap-free human motion synthesis with multimodal large language models. InEuropean Conference on Computer Vi- sion, pages 403–421. Springer, 2024

2024
[45]

Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, and Li Yi. Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

arXiv 2025
[46]

Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025. A Summary of Contributions Science of Scale:We are the first motion tracker with zero-shot ability trained on2B Framedata. Our train- ing set is o...

arXiv 2025
[47]

Table 5 Hyperparameter settings for training motion experts

Initialize both the expert teacher and the student policy within the simulation environment. Table 5 Hyperparameter settings for training motion experts. Hyperparameter Value Env Numbers 32768 Batch size 1024 Discount factorγ0.97 GAE parameterλ0.95 Clipping parameterϵ0.2 Policy network size [512, 256, 128] Critic network size [512, 256, 128] Learning rate...
[48]

At iteration i, roll out the student policy and query the expert for the corresponding target action using the same state
[49]

Table 8 Approximate compute breakdown

Train the student to match the expert’s action, and then update the environment using the stu- dent’s executed action. Table 8 Approximate compute breakdown. Stage Hardware Total GPU hours Fraction of total (%) PPO experts (∼384experts) RTX 4090 12,000 75% Distillation (Humanoid-GPT-S/B/L) H100 3,000 25% Total—15,000 100% Figure 11 T-SNE distribution Visu...

[1] [1]

Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Molloy, Micha...

Pith/arXiv arXiv 2023

[2] [2]

Retargeting matters: General motion retargeting for humanoid motion tracking

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. arXiv preprint arXiv:2510.02252, 2025

arXiv 2025

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020

[4] [4]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[5] [5]

Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024

[6] [6]

Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

Antonin Dallard, Mehdi Benallegue, Fumio Kane- hiro, and Abderrahmane Kheddar. Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023

2023

[7] [7]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025

[8] [8]

Humanplus: Humanoid shadowing and imitation from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wet- zstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. 270:2828– 2844, 2024

2024

[9] [9]

Robust motion in-betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020

2020

[10] [10]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024

[11] [11]

Asap: Align- ing simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Align- ing simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025

[12] [12]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beu- tel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, AlexTachardPassos, AlexanderKirillov, AlexiChris- takis, Alexis Connea...

Pith/arXiv arXiv 2024

[13] [13]

Openai o1 system card.CoRR, abs/2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Car- ney, Alex Iftimie, Alex Karpenko, Alex Tachard Pas- sos, Alexander Neitz, Alexander Prokofiev, Alexan- der Wei, Allison Tam, Ally Bennett, Ananya Ku- mar, Andre Saraiva, Andrea Vallone, Andrew Du- berstein, Andrew Kondric...

Pith/arXiv arXiv 2024

[14] [14]

Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024

arXiv 2024

[15] [15]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023

2023

[16] [16]

Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Kyungmin Lee, Sibeen Kim, Minho Park, Hyunse- ung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo. Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025

Pith/arXiv arXiv 2025

[17] [17]

Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023

2023

[18] [18]

Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025

arXiv 2025

[19] [19]

Beyondmimic: From motion tracking to versatile humanoidcontrolviaguideddiffusion.arXiv preprint arXiv:2508.08241, 2025

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoidcontrolviaguideddiffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[20] [20]

Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023

2023

[21] [21]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023

[22] [22]

Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023

arXiv 2023

[23] [23]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chen- ran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[24] [24]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5442–5451, 2019

2019

[25] [25]

Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024

arXiv 2024

[26] [26]

https://onnxruntime.ai

Microsoft Corporation.ONNX Runtime, 2024. https://onnxruntime.ai. High-performance infer- ence engine for ONNX models

2024

[27] [27]

https://developer.nvidia.com/tensorrt

NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/tensorrt. Ver- sion 10.0

2024

[28] [28]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human f...

2022

[29] [29]

Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023

arXiv 2023

[30] [30]

Learning transferable visual models from natural language supervision

AlecRadford, JongWookKim, ChrisHallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021

2021

[31] [31]

A reduction of imitation learning and structured pre- diction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[32] [32]

Roumeliotis and Nikolaos D

Konstantinos I. Roumeliotis and Nikolaos D. Tselikas. Chatgpt and open-ai models: A preliminary review. Future Internet, 15(6):192, 2023

2023

[33] [33]

Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022

2022

[34] [34]

Mu- joco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intel- ligent robots and systems, pages 5026–5033. IEEE, 2012

2012

[35] [35]

Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023

Pith/arXiv arXiv 2023

[36] [36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, pages 5998–6008, 2017

2017

[37] [37]

From experts to a general- ist: Toward general whole-body control for hu- manoid robots.CoRR, abs/2506.12779, 2025

Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, and Zongqing Lu. From experts to a general- ist: Toward general whole-body control for hu- manoid robots.CoRR, abs/2506.12779, 2025. doi: 10.48550/ARXIV.2506.12779. https://doi.org/10. 48550/arXiv.2506.12779

work page doi:10.48550/arxiv.2506.12779 2025

[38] [38]

Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

Pith/arXiv arXiv 2022

[39] [39]

Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025

arXiv 2025

[40] [40]

Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025

[41] [41]

Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025

arXiv 2025

[42] [42]

Twist: Teleoperated whole-body imitation system

Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÃšjo, Zi- ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025

arXiv 2025

[43] [43]

Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, ShunlinLu, YurongFu, YuanhaoCai, RuimaoZhang, Haoqian Wang, and Lei Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

arXiv 2025

[44] [44]

Freemotion: Mocap-free human motion synthesis with multimodal large language models

Zhikai Zhang, Yitang Li, Haofeng Huang, Mingx- ian Lin, and Li Yi. Freemotion: Mocap-free human motion synthesis with multimodal large language models. InEuropean Conference on Computer Vi- sion, pages 403–421. Springer, 2024

2024

[45] [45]

Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, and Li Yi. Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025

arXiv 2025

[46] [46]

Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025. A Summary of Contributions Science of Scale:We are the first motion tracker with zero-shot ability trained on2B Framedata. Our train- ing set is o...

arXiv 2025

[47] [47]

Table 5 Hyperparameter settings for training motion experts

Initialize both the expert teacher and the student policy within the simulation environment. Table 5 Hyperparameter settings for training motion experts. Hyperparameter Value Env Numbers 32768 Batch size 1024 Discount factorγ0.97 GAE parameterλ0.95 Clipping parameterϵ0.2 Policy network size [512, 256, 128] Critic network size [512, 256, 128] Learning rate...

[48] [48]

At iteration i, roll out the student policy and query the expert for the corresponding target action using the same state

[49] [49]

Table 8 Approximate compute breakdown

Train the student to match the expert’s action, and then update the environment using the stu- dent’s executed action. Table 8 Approximate compute breakdown. Stage Hardware Total GPU hours Fraction of total (%) PPO experts (∼384experts) RTX 4090 12,000 75% Distillation (Humanoid-GPT-S/B/L) H100 3,000 25% Total—15,000 100% Figure 11 T-SNE distribution Visu...