HoloMotion-1 Technical Report

Bo Zhang; Kaihui Wang; Maiyue Chen; Qijun Huang; Xihan Ma; Yi Ren; Yucheng Wang; Zhiyuan Yang; Zhizhong Su; Zihao Zhu

arxiv: 2605.15336 · v2 · pith:A32YYNXQnew · submitted 2026-05-14 · 💻 cs.RO · cs.AI

HoloMotion-1 Technical Report

Maiyue Chen , Kaihui Wang , Bo Zhang , Xihan Ma , Zhiyuan Yang , Yi Ren , Qijun Huang , Zihao Zhu

show 2 more authors

Yucheng Wang Zhizhong Su

This is my paper

Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords humanoid motion trackingzero-shot transfermixture of expertswhole-body controlmotion foundation modelvideo motion reconstructionreal-robot deployment

0 comments

The pith

HoloMotion-1 shows a transformer policy trained on mixed video-reconstructed and motion-capture data can track diverse whole-body motions zero-shot and transfer directly to real humanoid robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HoloMotion-1 as a foundation model that learns humanoid motion tracking from a hybrid corpus where video-reconstructed motions supply broad behavioral diversity and motion-capture data supplies higher-fidelity examples. This regime moves past the narrow coverage of traditional MoCap-only training and forces the model to handle reconstruction noise, domain shifts, and uneven quality through large temporal capacity and efficient architecture. If the approach holds, the resulting policy should generalize to previously unseen motion types and capture conditions while running in real time on physical robots without any task-specific retraining. The central mechanism is a sparsely activated Mixture-of-Experts Transformer that uses KV-cache for inference efficiency and sequence-level training to process long motion trajectories effectively.

Core claim

HoloMotion-1 is a humanoid motion foundation model trained on a large-scale hybrid motion corpus that combines dominant video-reconstructed motions from in-the-wild videos with curated motion-capture and in-house data. It integrates large-capacity temporal modeling via a sparsely activated Mixture-of-Experts Transformer with KV-cache for real-time control and applies sequence-level training to improve efficiency on extended sequences. Experiments on multiple unseen motion benchmarks demonstrate robust generalization across diverse motion types and capture conditions, higher tracking accuracy than prior methods, and direct zero-shot transfer to a real humanoid robot.

What carries the argument

A sparsely activated Mixture-of-Experts Transformer with KV-cache inference that performs real-time whole-body motion tracking while managing large behavioral variation from heterogeneous training data.

If this is right

Tracking accuracy improves over prior methods on multiple unseen motion benchmarks that vary in type and capture condition.
The learned policy runs in real time on a real humanoid robot without any additional fine-tuning for the deployment task.
Training on the hybrid corpus expands the set of motion styles and environmental conditions the policy can handle compared with MoCap-only baselines.
Sequence-level training reduces the sample inefficiency that normally appears when learning from long, variable-length motion trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid-data strategy could be tested on other whole-body control tasks such as locomotion or manipulation where video data is abundant but noisy.
If the KV-cache mechanism maintains stability over very long horizons, the approach might extend to online adaptation during robot operation rather than pure offline training.
Further growth of the video-reconstructed portion of the corpus could be used to probe the scaling limits of generalization in humanoid motion policies.

Load-bearing premise

That large temporal modeling capacity together with a Mixture-of-Experts Transformer and sequence-level training can overcome reconstruction noise, source-domain mismatch, and uneven motion quality in the hybrid corpus enough to support reliable zero-shot transfer to a physical robot.

What would settle it

Direct measurement showing that the model fails to track a motion sequence on the physical humanoid robot when the same sequence produces low tracking error in the simulation benchmark used during evaluation.

Figures

Figures reproduced from arXiv: 2605.15336 by Bo Zhang, Kaihui Wang, Maiyue Chen, Qijun Huang, Xihan Ma, Yi Ren, Yucheng Wang, Zhiyuan Yang, Zhizhong Su, Zihao Zhu.

**Figure 2.** Figure 2: Real-world zero-shot transfer of the HoloMotion policy. In the first row, the robot performs high [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The HoloMotion system pipeline. The framework provides an end-to-end workflow covering [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Roadmap of HoloMotion toward a foundation model for whole-body humanoid control. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. It scales policy training on a large hybrid corpus dominated by video-reconstructed motions from in-the-wild videos, augmented by curated MoCap and in-house data. The architecture combines large-capacity temporal modeling with a sparsely activated Mixture-of-Experts Transformer using KV-cache for real-time inference and employs sequence-level training to handle extended sequences. The authors report that extensive experiments demonstrate robust generalization across diverse unseen motion benchmarks and capture conditions, improved tracking accuracy relative to prior methods, and direct zero-shot transfer to a real humanoid robot without task-specific fine-tuning.

Significance. If the experimental claims are substantiated with quantitative evidence, this approach could meaningfully advance scalable humanoid control by showing how video-derived data can expand behavioral coverage beyond conventional MoCap-only regimes while maintaining deployability.

major comments (2)

[Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.
[Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.

minor comments (1)

[Abstract] The proportion of video-reconstructed versus MoCap data is described only qualitatively ('dominant source'); a numerical breakdown would help readers assess the scale of the heterogeneity being addressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of how claims and supporting evidence are presented, particularly in the abstract and regarding the hybrid data regime. We address each major comment below and have revised the manuscript to improve clarity and substantiation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that HoloMotion-1 'generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning' are presented without any reported quantitative metrics, baselines, error bars, success rates, or ablation results. This absence prevents verification of whether the MoE Transformer and sequence-level training actually mitigate the reconstruction noise and domain mismatch explicitly flagged in the same paragraph.

Authors: We agree that the abstract would benefit from explicit quantitative anchors to support the high-level claims. The full manuscript reports these details in the Experiments section, including tracking error metrics, baseline comparisons, and success rates across unseen benchmarks. To directly address the concern, we have revised the abstract to include key quantitative results (e.g., average error reductions and generalization metrics) while preserving its concise nature. The MoE Transformer with KV-cache and sequence-level training are shown in the experiments to improve robustness to the noted noise and mismatch factors through comparative results against prior methods. revision: yes
Referee: [Abstract (and implied Experiments section)] The description of the hybrid corpus and training strategy identifies reconstruction noise, source-domain mismatch, and uneven motion quality as central challenges, yet no ablation removing the video-reconstruction component, no distribution-shift statistics between video-recon and MoCap sources, and no real-robot performance conditioned on motion style or capture condition are supplied. These omissions are load-bearing for the zero-shot transfer claim.

Authors: This is a fair observation on the need for targeted evidence supporting the hybrid corpus benefits. The manuscript already includes overall comparisons demonstrating gains from the full hybrid data over MoCap-only training, along with real-robot zero-shot transfer results. However, we acknowledge the value of a dedicated ablation isolating the video-reconstruction component and explicit distribution-shift statistics. We have added these elements in the revised Experiments section, including a new ablation study and quantitative domain-difference measures. For real-robot performance, we have expanded the analysis to report results conditioned on motion style and capture condition categories to further substantiate the zero-shot claim. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on experimental outcomes from hybrid data training, not self-referential definitions or fitted predictions.

full rationale

The paper describes HoloMotion-1 as a scaled policy trained on a hybrid motion corpus (video-reconstructions dominant plus MoCap) using MoE Transformer with KV-cache and sequence-level training. It reports generalization on unseen benchmarks and zero-shot real-robot transfer. No equations, derivations, or first-principles results appear in the abstract or described content. Claims are supported by empirical evaluation rather than quantities defined in terms of their own fitted parameters or self-citation chains that reduce to inputs. The architecture and data regime are presented as design choices, not as outputs derived from the target performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; full manuscript required for audit.

pith-pipeline@v0.9.0 · 5776 in / 1104 out tokens · 75377 ms · 2026-05-20T20:25:45.352993+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

[1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025

work page arXiv 2025
[2]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820 , 2025

work page internal anchor Pith review arXiv 2025
[3]

Track any motions under any disturbances

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833, 2025

work page arXiv 2025
[4]

Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

Jinrui Han, Weiji Xie, Jiakun Zheng, Jiyuan Shi, Weinan Zhang, Ting Xiao, and Chenjia Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control. arXiv preprint arXiv:2509.16638, 2025

work page arXiv 2025
[5]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[6]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13336–13348, 2025

work page 2025
[7]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[9]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5442–5451, 2019

work page 2019
[10]

Robust motion in- betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in- betweening. ACM Transactions on Graphics (TOG) , 39(4):60–1, 2020

work page 2020
[11]

Object motion guided human motion synthesis

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42(6):1–11, 2023

work page 2023
[12]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia , pages 2021–2029, 2020

work page 2021
[13]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10895–10904, 2023

work page 2023
[14]

Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 , 2024. 19

work page arXiv 2024
[15]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023

work page 2023
[16]

H-rdt: Human manipulation enhanced bimanual robotic manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 , 2025

work page arXiv 2025
[17]

Humanplus: Humanoid shad- owing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454 , 2024

work page arXiv 2024
[18]

Maskedmimic: Unified physics-based character control through masked motion inpainting

Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions On Graphics (TOG), 43(6):1–21, 2024

work page 2024
[19]

Humanoid locomotion as next token prediction

I Radosavovic, B Zhang, B Shi, J Rajasegaran, S Kamat, T Darrell, K Sreenath, and J Malik. Humanoid locomotion as next token prediction. arxiv 2024. arXiv preprint arXiv:2402.19469 , 2024

work page arXiv 2024
[20]

From experts to a generalist: Toward general whole-body control for humanoid robots

Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 , 2025

work page arXiv 2025
[21]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

work page 2019
[22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[23]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4895– 4901, 2023

work page 2023
[24]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

work page 2020
[25]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Karen Liu

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system. arXiv preprint arXiv:2511.02832 , 2025. 20

work page arXiv 2025

[1] [1]

Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770 , 2025

work page arXiv 2025

[2] [2]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820 , 2025

work page internal anchor Pith review arXiv 2025

[3] [3]

Track any motions under any disturbances

Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, et al. Track any motions under any disturbances. arXiv preprint arXiv:2509.13833, 2025

work page arXiv 2025

[4] [4]

Kungfubot2: Learn- ing versatile motion skills for humanoid whole-body control.arXiv preprint arXiv:2509.16638, 2025

Jinrui Han, Weiji Xie, Jiakun Zheng, Jiyuan Shi, Weinan Zhang, Ting Xiao, and Chenjia Bai. Kungfubot2: Learning versatile motion skills for humanoid whole-body control. arXiv preprint arXiv:2509.16638, 2025

work page arXiv 2025

[5] [5]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019

[6] [6]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 13336–13348, 2025

work page 2025

[7] [7]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[8] [8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901

[9] [9]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5442–5451, 2019

work page 2019

[10] [10]

Robust motion in- betweening

Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in- betweening. ACM Transactions on Graphics (TOG) , 39(4):60–1, 2020

work page 2020

[11] [11]

Object motion guided human motion synthesis

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42(6):1–11, 2023

work page 2023

[12] [12]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM international conference on multimedia , pages 2021–2029, 2020

work page 2021

[13] [13]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 10895–10904, 2023

work page 2023

[14] [14]

Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 , 2024. 19

work page arXiv 2024

[15] [15]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023

work page 2023

[16] [16]

H-rdt: Human manipulation enhanced bimanual robotic manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 , 2025

work page arXiv 2025

[17] [17]

Humanplus: Humanoid shad- owing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. arXiv preprint arXiv:2406.10454 , 2024

work page arXiv 2024

[18] [18]

Maskedmimic: Unified physics-based character control through masked motion inpainting

Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions On Graphics (TOG), 43(6):1–21, 2024

work page 2024

[19] [19]

Humanoid locomotion as next token prediction

I Radosavovic, B Zhang, B Shi, J Rajasegaran, S Kamat, T Darrell, K Sreenath, and J Malik. Humanoid locomotion as next token prediction. arxiv 2024. arXiv preprint arXiv:2402.19469 , 2024

work page arXiv 2024

[20] [20]

From experts to a generalist: Toward general whole-body control for humanoid robots

Yuxuan Wang, Ming Yang, Ziluo Ding, Yu Zhang, Weishuai Zeng, Xinrun Xu, Haobin Jiang, and Zongqing Lu. From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 , 2025

work page arXiv 2025

[21] [21]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in neural information processing systems, 32, 2019

work page 2019

[22] [22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[23] [23]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4895– 4901, 2023

work page 2023

[24] [24]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

work page 2020

[25] [25]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Karen Liu

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system. arXiv preprint arXiv:2511.02832 , 2025. 20

work page arXiv 2025