HoloMotion-1 Technical Report
Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3
The pith
HoloMotion-1 shows a transformer policy trained on mixed video-reconstructed and motion-capture data can track diverse whole-body motions zero-shot and transfer directly to real humanoid robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HoloMotion-1 is a humanoid motion foundation model trained on a large-scale hybrid motion corpus that combines dominant video-reconstructed motions from in-the-wild videos with curated motion-capture and in-house data. It integrates large-capacity temporal modeling via a sparsely activated Mixture-of-Experts Transformer with KV-cache for real-time control and applies sequence-level training to improve efficiency on extended sequences. Experiments on multiple unseen motion benchmarks demonstrate robust generalization across diverse motion types and capture conditions, higher tracking accuracy than prior methods, and direct zero-shot transfer to a real humanoid robot.
What carries the argument
A sparsely activated Mixture-of-Experts Transformer with KV-cache inference that performs real-time whole-body motion tracking while managing large behavioral variation from heterogeneous training data.
If this is right
- Tracking accuracy improves over prior methods on multiple unseen motion benchmarks that vary in type and capture condition.
- The learned policy runs in real time on a real humanoid robot without any additional fine-tuning for the deployment task.
- Training on the hybrid corpus expands the set of motion styles and environmental conditions the policy can handle compared with MoCap-only baselines.
- Sequence-level training reduces the sample inefficiency that normally appears when learning from long, variable-length motion trajectories.
Where Pith is reading between the lines
- The same hybrid-data strategy could be tested on other whole-body control tasks such as locomotion or manipulation where video data is abundant but noisy.
- If the KV-cache mechanism maintains stability over very long horizons, the approach might extend to online adaptation during robot operation rather than pure offline training.
- Further growth of the video-reconstructed portion of the corpus could be used to probe the scaling limits of generalization in humanoid motion policies.
Load-bearing premise
That large temporal modeling capacity together with a Mixture-of-Experts Transformer and sequence-level training can overcome reconstruction noise, source-domain mismatch, and uneven motion quality in the hybrid corpus enough to support reliable zero-shot transfer to a physical robot.
What would settle it
Direct measurement showing that the model fails to track a motion sequence on the physical humanoid robot when the same sequence produces low tracking error in the simulation benchmark used during evaluation.
Figures
read the original abstract
In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. It scales policy training on a large hybrid corpus dominated by video-reconstructed motions from in-the-wild videos, augmented by curated MoCap and in-house data. The architecture combines large-capacity temporal modeling with a sparsely activated Mixture-of-Experts Transformer using KV-cache for real-time inference and employs sequence-level training to handle extended sequences. The authors report that extensive experiments demonstrate robust generalization across diverse unseen motion benchmarks and capture conditions, improved tracking accuracy relative to prior methods, and direct zero-shot transfer to a real humanoid robot without task-specific fine-tuning.
Significance. If the experimental claims are substantiated with quantitative evidence, this approach could meaningfully advance scalable humanoid control by showing how video-derived data can expand behavioral coverage beyond conventional MoCap-only regimes while maintaining deployability.
major comments (2)
- [Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.
- [Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.
minor comments (1)
- [Abstract] The proportion of video-reconstructed versus MoCap data is described only qualitatively ('dominant source'); a numerical breakdown would help readers assess the scale of the heterogeneity being addressed.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of how claims and supporting evidence are presented, particularly in the abstract and regarding the hybrid data regime. We address each major comment below and have revised the manuscript to improve clarity and substantiation of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.
Authors: We agree that the abstract would benefit from explicit quantitative anchors to support the high-level claims. The full manuscript reports these details in the Experiments section, including tracking error metrics, baseline comparisons, and success rates across unseen benchmarks. To directly address the concern, we have revised the abstract to include key quantitative results (e.g., average error reductions and generalization metrics) while preserving its concise nature. The MoE Transformer with KV-cache and sequence-level training are shown in the experiments to improve robustness to the noted noise and mismatch factors through comparative results against prior methods. revision: yes
-
Referee: [Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.
Authors: This is a fair observation on the need for targeted evidence supporting the hybrid corpus benefits. The manuscript already includes overall comparisons demonstrating gains from the full hybrid data over MoCap-only training, along with real-robot zero-shot transfer results. However, we acknowledge the value of a dedicated ablation isolating the video-reconstruction component and explicit distribution-shift statistics. We have added these elements in the revised Experiments section, including a new ablation study and quantitative domain-difference measures. For real-robot performance, we have expanded the analysis to report results conditioned on motion style and capture condition categories to further substantiate the zero-shot claim. revision: partial
Circularity Check
No circularity: claims rest on experimental outcomes from hybrid data training, not self-referential definitions or fitted predictions.
full rationale
The paper describes HoloMotion-1 as a scaled policy trained on a hybrid motion corpus (video-reconstructions dominant plus MoCap) using MoE Transformer with KV-cache and sequence-level training. It reports generalization on unseen benchmarks and zero-shot real-robot transfer. No equations, derivations, or first-principles results appear in the abstract or described content. Claims are supported by empirical evaluation rather than quantities defined in terms of their own fitted parameters or self-citation chains that reduce to inputs. The architecture and data regime are presented as design choices, not as outputs derived from the target performance metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025
-
[2]
Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820 , 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Track any motions under any disturbances
Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833, 2025
-
[4]
arXiv preprint arXiv:2509.16638 , year=
Jinrui Han, Weiji Xie, Jiakun Zheng, Jiyuan Shi, Weinan Zhang, Ting Xiao, and Chenjia Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control. arXiv preprint arXiv:2509.16638, 2025
-
[5]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[6]
Go to zero: Towards zero-shot motion generation with million-scale data
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13336–13348, 2025
work page 2025
-
[7]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[9]
Amass: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5442–5451, 2019
work page 2019
-
[10]
Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in- betweening. ACM Transactions on Graphics (TOG) , 39(4):60–1, 2020
work page 2020
-
[11]
Object motion guided human motion synthesis
Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42(6):1–11, 2023
work page 2023
-
[12]
Action2motion: Conditioned generation of 3d human motions
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia , pages 2021–2029, 2020
work page 2021
-
[13]
Perpetual humanoid control for real-time simulated avatars
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10895–10904, 2023
work page 2023
-
[14]
arXiv preprint arXiv:2406.08858 (2024)
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 , 2024. 19
-
[15]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023
work page 2023
-
[16]
H-rdt: Human manipulation enhanced bimanual robotic manipulation
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 , 2025
-
[17]
arXiv preprint arXiv:2406.10454 (2024)
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454 , 2024
-
[18]
Maskedmimic: Unified physics-based character control through masked motion inpainting
Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions On Graphics (TOG), 43(6):1–21, 2024
work page 2024
-
[19]
Humanoid locomotion as next token prediction
I Radosavovic, B Zhang, B Shi, J Rajasegaran, S Kamat, T Darrell, K Sreenath, and J Malik. Humanoid locomotion as next token prediction. arxiv 2024. arXiv preprint arXiv:2402.19469 , 2024
-
[20]
From experts to a generalist: Toward general whole-body control for humanoid robots
Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 , 2025
-
[21]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019
work page 2019
-
[22]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[23]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4895– 4901, 2023
work page 2023
-
[24]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020
work page 2020
-
[25]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Twist2: Scalable, portable, and holistic humanoid data collection system
Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system. arXiv preprint arXiv:2511.02832 , 2025. 20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.