Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Pith reviewed 2026-06-28 09:59 UTC · model grok-4.3
The pith
A GPT-style Transformer pre-trained on a 2B-frame unified motion corpus tracks dynamic humanoid behaviors and generalizes zero-shot to unseen motions and tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Humanoid-GPT is a GPT-style Transformer with causal attention trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings; scaling both data volume and model capacity produces a single generative model that simultaneously tracks highly dynamic and complex motions while achieving robust zero-shot generalization to unseen motions and control tasks.
What carries the argument
The GPT-style Transformer with causal attention, used as a generative model for whole-body motion tracking and conditioned on the large unified retargeted motion corpus.
If this is right
- A single model replaces the previous collection of task-specific trackers.
- Scaling data volume and model size directly improves both tracking accuracy on dynamic motions and zero-shot performance on novel tasks.
- Extensive scaling analyses confirm that performance continues to rise with additional data and parameters.
- The same architecture supports both motion tracking and downstream control without separate training stages.
Where Pith is reading between the lines
- If the corpus coherence assumption holds across more sensor modalities, the same scaling recipe could apply to vision-based or force-based humanoid control.
- The approach suggests that language-conditioned variants could allow instruction-driven motion generation without retraining the core tracker.
- Hardware deployment may become simpler if one model handles both locomotion and manipulation tasks at deployment time.
Load-bearing premise
The retargeted mocap data from many different sources forms one coherent distribution that does not introduce biases blocking generalization to motions never seen in training.
What would settle it
A controlled experiment in which the model is tested on a motion sequence whose kinematics cannot be expressed as a retargeting of any frame in the 2B-frame corpus and shows clear failure to track or generalize.
read the original abstract
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Humanoid-GPT, a GPT-style Transformer with causal attention pre-trained on a 2B-frame retargeted motion-capture corpus that unifies major public mocap datasets with large-scale in-house recordings. It claims that jointly scaling data volume and model capacity produces a single generative model capable of tracking highly dynamic whole-body behaviors while delivering unprecedented zero-shot generalization to unseen motions and control tasks, thereby overcoming the agility-generalization trade-off that constrained prior shallow MLP trackers. Extensive experiments and scaling analyses are presented to support a new performance frontier.
Significance. If the zero-shot generalization claims are substantiated, the work would constitute a notable advance in humanoid robotics by demonstrating that large-scale generative pre-training on unified motion data can yield a versatile controller without per-task retraining. The explicit focus on scaling both data and architecture, together with the unification of disparate mocap sources, supplies a concrete empirical test of whether language-model-style scaling laws extend to whole-body control.
major comments (2)
- [Data corpus construction and unification] The central zero-shot generalization claim rests on the premise that retargeting produces a single coherent training distribution free of source-specific kinematic or dynamic artifacts. No quantitative validation of this premise—such as cross-source distribution overlap statistics, per-source bias metrics, or an ablation isolating retargeting artifacts—is supplied in the data-preparation or experimental sections. This is load-bearing: absent such checks the reported scaling gains could arise from memorization of retargeting regularities rather than genuine generalization to truly unseen motions.
- [Scaling analyses] The scaling analyses are described as demonstrating performance gains from increased data and capacity, yet the manuscript does not report the precise model sizes, data-subset sizes, or controlled ablations that would isolate the contribution of each factor while holding other variables fixed. Without these details the attribution of the performance frontier specifically to joint scaling remains under-supported.
minor comments (2)
- [Methods] Notation for retargeting parameters and the precise definition of 'zero-shot' (e.g., whether any motion statistics from the target task appear in any training source) should be stated explicitly in the methods section to avoid ambiguity.
- [Figures] Figure captions for the scaling curves should include the exact number of parameters and frames used at each point to allow direct replication of the reported trends.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our data unification and scaling results. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional analyses.
read point-by-point responses
-
Referee: [Data corpus construction and unification] The central zero-shot generalization claim rests on the premise that retargeting produces a single coherent training distribution free of source-specific kinematic or dynamic artifacts. No quantitative validation of this premise—such as cross-source distribution overlap statistics, per-source bias metrics, or an ablation isolating retargeting artifacts—is supplied in the data-preparation or experimental sections. This is load-bearing: absent such checks the reported scaling gains could arise from memorization of retargeting regularities rather than genuine generalization to truly unseen motions.
Authors: We agree that explicit quantitative validation of the unified corpus is important to substantiate the zero-shot claims. Although our data-preparation pipeline included internal consistency checks across sources, these were not reported in the manuscript. In the revised version we will add cross-source distribution overlap statistics (e.g., Wasserstein distances on joint-angle and velocity histograms), per-source bias metrics, and a controlled ablation that isolates retargeting artifacts by comparing models trained on raw versus retargeted subsets. These additions will directly address whether the observed generalization stems from coherent scaling rather than source-specific regularities. revision: yes
-
Referee: [Scaling analyses] The scaling analyses are described as demonstrating performance gains from increased data and capacity, yet the manuscript does not report the precise model sizes, data-subset sizes, or controlled ablations that would isolate the contribution of each factor while holding other variables fixed. Without these details the attribution of the performance frontier specifically to joint scaling remains under-supported.
Authors: We appreciate the request for greater precision. The original submission contained high-level descriptions of model scaling and data volume increases but omitted exact parameter counts, subset sizes, and fully controlled ablations. In the revision we will report the precise model sizes (parameter counts and layer dimensions), the exact frame counts for each data subset used in the scaling curves, and additional ablations that vary data volume and capacity independently while holding training steps, optimizer settings, and evaluation protocols fixed. These details will make the attribution to joint scaling explicit. revision: yes
Circularity Check
No circularity: empirical scaling results with no derivation chain
full rationale
The paper presents an empirical ML result: a GPT-style Transformer trained on a 2B-frame retargeted mocap corpus yields zero-shot generalization. No equations, derivations, or first-principles claims appear in the abstract or described content. Claims rest on scaling experiments and performance metrics rather than any reduction of outputs to fitted inputs or self-citations. The central premise (scaling data + capacity produces generalization) is an empirical observation, not a self-definitional or fitted-prediction construct. This is the expected non-finding for a scaling paper without mathematical derivations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Molloy, Micha...
Pith/arXiv arXiv 2023
-
[2]
Retargeting matters: General motion retargeting for humanoid motion tracking
Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C Karen Liu. Retargeting matters: General motion retargeting for humanoid motion tracking. arXiv preprint arXiv:2510.02252, 2025
arXiv 2025
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
2020
-
[4]
Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
arXiv 2025
-
[5]
Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024
Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024
arXiv 2024
-
[6]
Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023
Antonin Dallard, Mehdi Benallegue, Fumio Kane- hiro, and Abderrahmane Kheddar. Synchronized human-humanoid motion imitation.IEEE Robotics and Automation Letters, 8(7):4155–4162, 2023
2023
-
[7]
Go to zero: Towards zero-shot motion generation with million-scale data
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025
2025
-
[8]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wet- zstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. 270:2828– 2844, 2024
2024
-
[9]
Robust motion in-betweening
Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020
2020
-
[10]
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoper- ation and learning.arXiv preprint arXiv:2406.08858, 2024
arXiv 2024
-
[11]
Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Align- ing simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025
arXiv 2025
-
[12]
Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beu- tel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, AlexTachardPassos, AlexanderKirillov, AlexiChris- takis, Alexis Connea...
Pith/arXiv arXiv 2024
-
[13]
Openai o1 system card.CoRR, abs/2412.16720, 2024
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Car- ney, Alex Iftimie, Alex Karpenko, Alex Tachard Pas- sos, Alexander Neitz, Alexander Prokofiev, Alexan- der Wei, Allison Tam, Ally Bennett, Ananya Ku- mar, Andre Saraiva, Andrea Vallone, Andrew Du- berstein, Andrew Kondric...
Pith/arXiv arXiv 2024
-
[14]
Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024
Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Advanced expressive humanoid whole- body control.arXiv preprint arXiv:2412.13196, 2024
arXiv 2024
-
[15]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross B
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023
2023
-
[16]
Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025
Kyungmin Lee, Sibeen Kim, Minho Park, Hyunse- ung Kim, Dongyoon Hwang, Hojoon Lee, and Jaegul Choo. Phuma: Physically-grounded humanoid loco- motion dataset.arXiv preprint arXiv:2510.26236, 2025
Pith/arXiv arXiv 2025
-
[17]
Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023
Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Trans- actions on Graphics (TOG), 42(6):1–11, 2023
2023
-
[18]
Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks.arXiv preprint arXiv:2506.08931, 2025
arXiv 2025
-
[19]
Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoidcontrolviaguideddiffusion.arXiv preprint arXiv:2508.08241, 2025
Pith/arXiv arXiv 2025
-
[20]
Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.Advances in Neural Infor- mation Processing Systems, 36:25268–25280, 2023
2023
-
[21]
Perpetual humanoid control for real-time simulated avatars
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023
2023
-
[22]
Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representa- tions for physics-based control.arXiv preprint arXiv:2310.04582, 2023
arXiv 2023
-
[23]
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chen- ran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025
Pith/arXiv arXiv 2025
-
[24]
Amass: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5442–5451, 2019
2019
-
[25]
Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control.arXiv preprint arXiv:2412.14172, 2024
arXiv 2024
-
[26]
https://onnxruntime.ai
Microsoft Corporation.ONNX Runtime, 2024. https://onnxruntime.ai. High-performance infer- ence engine for ONNX models
2024
-
[27]
https://developer.nvidia.com/tensorrt
NVIDIA Corporation.NVIDIA TensorRT, 2024. https://developer.nvidia.com/tensorrt. Ver- sion 10.0
2024
-
[28]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human f...
2022
-
[29]
Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision- based dexterous robot arm-hand teleoperation sys- tem.arXiv preprint arXiv:2307.04577, 2023
arXiv 2023
-
[30]
Learning transferable visual models from natural language supervision
AlecRadford, JongWookKim, ChrisHallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021
2021
-
[31]
A reduction of imitation learning and structured pre- diction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
2011
-
[32]
Roumeliotis and Nikolaos D
Konstantinos I. Roumeliotis and Nikolaos D. Tselikas. Chatgpt and open-ai models: A preliminary review. Future Internet, 15(6):192, 2023
2023
-
[33]
Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022
Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning mo- tion phase manifolds.ACM Transactions on Graph- ics (ToG), 41(4):1–13, 2022
2022
-
[34]
Mu- joco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intel- ligent robots and systems, pages 5026–5033. IEEE, 2012
2012
-
[35]
Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language mod- els.CoRR, abs/2302.13971, 2023
Pith/arXiv arXiv 2023
-
[36]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, pages 5998–6008, 2017
2017
-
[37]
Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, and Zongqing Lu. From experts to a general- ist: Toward general whole-body control for hu- manoid robots.CoRR, abs/2506.12779, 2025. doi: 10.48550/ARXIV.2506.12779. https://doi.org/10. 48550/arXiv.2506.12779
-
[38]
Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
Pith/arXiv arXiv 2022
-
[39]
Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly- dynamic skills.arXiv preprint arXiv:2506.12851, 2025
arXiv 2025
-
[40]
Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, Rocky Duan, and Guanya Shi. Omnire- target: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025
Pith/arXiv arXiv 2025
-
[41]
Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole-body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025
arXiv 2025
-
[42]
Twist: Teleoperated whole-body imitation system
Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÚjo, Zi- ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025
arXiv 2025
-
[43]
Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, ShunlinLu, YurongFu, YuanhaoCai, RuimaoZhang, Haoqian Wang, and Lei Zhang. Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025
arXiv 2025
-
[44]
Freemotion: Mocap-free human motion synthesis with multimodal large language models
Zhikai Zhang, Yitang Li, Haofeng Huang, Mingx- ian Lin, and Li Yi. Freemotion: Mocap-free human motion synthesis with multimodal large language models. InEuropean Conference on Computer Vi- sion, pages 403–421. Springer, 2024
2024
-
[45]
Zhikai Zhang, Chao Chen, Han Xue, Jilong Wang, Sikai Liang, Yun Liu, Zongzhang Zhang, He Wang, and Li Yi. Unleashing humanoid reaching poten- tial via real-world-ready skill space.arXiv preprint arXiv:2505.10918, 2025
arXiv 2025
-
[46]
Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025. A Summary of Contributions Science of Scale:We are the first motion tracker with zero-shot ability trained on2B Framedata. Our train- ing set is o...
arXiv 2025
-
[47]
Table 5 Hyperparameter settings for training motion experts
Initialize both the expert teacher and the student policy within the simulation environment. Table 5 Hyperparameter settings for training motion experts. Hyperparameter Value Env Numbers 32768 Batch size 1024 Discount factorγ0.97 GAE parameterλ0.95 Clipping parameterϵ0.2 Policy network size [512, 256, 128] Critic network size [512, 256, 128] Learning rate...
-
[48]
At iteration i, roll out the student policy and query the expert for the corresponding target action using the same state
-
[49]
Table 8 Approximate compute breakdown
Train the student to match the expert’s action, and then update the environment using the stu- dent’s executed action. Table 8 Approximate compute breakdown. Stage Hardware Total GPU hours Fraction of total (%) PPO experts (∼384experts) RTX 4090 12,000 75% Distillation (Humanoid-GPT-S/B/L) H100 3,000 25% Total—15,000 100% Figure 11 T-SNE distribution Visu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.