pith. sign in

arxiv: 2605.15336 · v2 · pith:A32YYNXQnew · submitted 2026-05-14 · 💻 cs.RO · cs.AI

HoloMotion-1 Technical Report

Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords humanoid motion trackingzero-shot transfermixture of expertswhole-body controlmotion foundation modelvideo motion reconstructionreal-robot deployment
0
0 comments X

The pith

HoloMotion-1 shows a transformer policy trained on mixed video-reconstructed and motion-capture data can track diverse whole-body motions zero-shot and transfer directly to real humanoid robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HoloMotion-1 as a foundation model that learns humanoid motion tracking from a hybrid corpus where video-reconstructed motions supply broad behavioral diversity and motion-capture data supplies higher-fidelity examples. This regime moves past the narrow coverage of traditional MoCap-only training and forces the model to handle reconstruction noise, domain shifts, and uneven quality through large temporal capacity and efficient architecture. If the approach holds, the resulting policy should generalize to previously unseen motion types and capture conditions while running in real time on physical robots without any task-specific retraining. The central mechanism is a sparsely activated Mixture-of-Experts Transformer that uses KV-cache for inference efficiency and sequence-level training to process long motion trajectories effectively.

Core claim

HoloMotion-1 is a humanoid motion foundation model trained on a large-scale hybrid motion corpus that combines dominant video-reconstructed motions from in-the-wild videos with curated motion-capture and in-house data. It integrates large-capacity temporal modeling via a sparsely activated Mixture-of-Experts Transformer with KV-cache for real-time control and applies sequence-level training to improve efficiency on extended sequences. Experiments on multiple unseen motion benchmarks demonstrate robust generalization across diverse motion types and capture conditions, higher tracking accuracy than prior methods, and direct zero-shot transfer to a real humanoid robot.

What carries the argument

A sparsely activated Mixture-of-Experts Transformer with KV-cache inference that performs real-time whole-body motion tracking while managing large behavioral variation from heterogeneous training data.

If this is right

  • Tracking accuracy improves over prior methods on multiple unseen motion benchmarks that vary in type and capture condition.
  • The learned policy runs in real time on a real humanoid robot without any additional fine-tuning for the deployment task.
  • Training on the hybrid corpus expands the set of motion styles and environmental conditions the policy can handle compared with MoCap-only baselines.
  • Sequence-level training reduces the sample inefficiency that normally appears when learning from long, variable-length motion trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid-data strategy could be tested on other whole-body control tasks such as locomotion or manipulation where video data is abundant but noisy.
  • If the KV-cache mechanism maintains stability over very long horizons, the approach might extend to online adaptation during robot operation rather than pure offline training.
  • Further growth of the video-reconstructed portion of the corpus could be used to probe the scaling limits of generalization in humanoid motion policies.

Load-bearing premise

That large temporal modeling capacity together with a Mixture-of-Experts Transformer and sequence-level training can overcome reconstruction noise, source-domain mismatch, and uneven motion quality in the hybrid corpus enough to support reliable zero-shot transfer to a physical robot.

What would settle it

Direct measurement showing that the model fails to track a motion sequence on the physical humanoid robot when the same sequence produces low tracking error in the simulation benchmark used during evaluation.

Figures

Figures reproduced from arXiv: 2605.15336 by Bo Zhang, Kaihui Wang, Maiyue Chen, Qijun Huang, Xihan Ma, Yi Ren, Yucheng Wang, Zhiyuan Yang, Zhizhong Su, Zihao Zhu.

Figure 1
Figure 1. Figure 1: (a) The MoE-Transformer policy network architecture; (b) HoloMotion achieves the lowest overall [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Real-world zero-shot transfer of the HoloMotion policy. In the first row, the robot performs high [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The HoloMotion system pipeline. The framework provides an end-to-end workflow covering [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Roadmap of HoloMotion toward a foundation model for whole-body humanoid control. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. It scales policy training on a large hybrid corpus dominated by video-reconstructed motions from in-the-wild videos, augmented by curated MoCap and in-house data. The architecture combines large-capacity temporal modeling with a sparsely activated Mixture-of-Experts Transformer using KV-cache for real-time inference and employs sequence-level training to handle extended sequences. The authors report that extensive experiments demonstrate robust generalization across diverse unseen motion benchmarks and capture conditions, improved tracking accuracy relative to prior methods, and direct zero-shot transfer to a real humanoid robot without task-specific fine-tuning.

Significance. If the experimental claims are substantiated with quantitative evidence, this approach could meaningfully advance scalable humanoid control by showing how video-derived data can expand behavioral coverage beyond conventional MoCap-only regimes while maintaining deployability.

major comments (2)
  1. [Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.
  2. [Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.
minor comments (1)
  1. [Abstract] The proportion of video-reconstructed versus MoCap data is described only qualitatively ('dominant source'); a numerical breakdown would help readers assess the scale of the heterogeneity being addressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of how claims and supporting evidence are presented, particularly in the abstract and regarding the hybrid data regime. We address each major comment below and have revised the manuscript to improve clarity and substantiation of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.

    Authors: We agree that the abstract would benefit from explicit quantitative anchors to support the high-level claims. The full manuscript reports these details in the Experiments section, including tracking error metrics, baseline comparisons, and success rates across unseen benchmarks. To directly address the concern, we have revised the abstract to include key quantitative results (e.g., average error reductions and generalization metrics) while preserving its concise nature. The MoE Transformer with KV-cache and sequence-level training are shown in the experiments to improve robustness to the noted noise and mismatch factors through comparative results against prior methods. revision: yes

  2. Referee: [Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.

    Authors: This is a fair observation on the need for targeted evidence supporting the hybrid corpus benefits. The manuscript already includes overall comparisons demonstrating gains from the full hybrid data over MoCap-only training, along with real-robot zero-shot transfer results. However, we acknowledge the value of a dedicated ablation isolating the video-reconstruction component and explicit distribution-shift statistics. We have added these elements in the revised Experiments section, including a new ablation study and quantitative domain-difference measures. For real-robot performance, we have expanded the analysis to report results conditioned on motion style and capture condition categories to further substantiate the zero-shot claim. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on experimental outcomes from hybrid data training, not self-referential definitions or fitted predictions.

full rationale

The paper describes HoloMotion-1 as a scaled policy trained on a hybrid motion corpus (video-reconstructions dominant plus MoCap) using MoE Transformer with KV-cache and sequence-level training. It reports generalization on unseen benchmarks and zero-shot real-robot transfer. No equations, derivations, or first-principles results appear in the abstract or described content. Claims are supported by empirical evaluation rather than quantities defined in terms of their own fitted parameters or self-citation chains that reduce to inputs. The architecture and data regime are presented as design choices, not as outputs derived from the target performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; full manuscript required for audit.

pith-pipeline@v0.9.0 · 5776 in / 1104 out tokens · 75377 ms · 2026-05-20T20:25:45.352993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

    Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025

  2. [2]

    Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

    Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820 , 2025

  3. [3]

    Track any motions under any disturbances

    Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833, 2025

  4. [4]

    Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

    Jinrui Han, Weiji Xie, Jiakun Zheng, Jiyuan Shi, Weinan Zhang, Ting Xiao, and Chenjia Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control. arXiv preprint arXiv:2509.16638, 2025

  5. [5]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  6. [6]

    Go to zero: Towards zero-shot motion generation with million-scale data

    Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13336–13348, 2025

  7. [7]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  9. [9]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5442–5451, 2019

  10. [10]

    Robust motion in- betweening

    Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in- betweening. ACM Transactions on Graphics (TOG) , 39(4):60–1, 2020

  11. [11]

    Object motion guided human motion synthesis

    Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42(6):1–11, 2023

  12. [12]

    Action2motion: Conditioned generation of 3d human motions

    Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia , pages 2021–2029, 2020

  13. [13]

    Perpetual humanoid control for real-time simulated avatars

    Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10895–10904, 2023

  14. [14]

    Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning

    Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 , 2024. 19

  15. [15]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023

  16. [16]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 , 2025

  17. [17]

    Humanplus: Humanoid shad- owing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454 , 2024

  18. [18]

    Maskedmimic: Unified physics-based character control through masked motion inpainting

    Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions On Graphics (TOG), 43(6):1–21, 2024

  19. [19]

    Humanoid locomotion as next token prediction

    I Radosavovic, B Zhang, B Shi, J Rajasegaran, S Kamat, T Darrell, K Sreenath, and J Malik. Humanoid locomotion as next token prediction. arxiv 2024. arXiv preprint arXiv:2402.19469 , 2024

  20. [20]

    From experts to a generalist: Toward general whole-body control for humanoid robots

    Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 , 2025

  21. [21]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

  22. [22]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  23. [23]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4895– 4901, 2023

  24. [24]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

  25. [25]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 , 2025

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

  27. [27]

    Karen Liu

    Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system. arXiv preprint arXiv:2511.02832 , 2025. 20