Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Jingya Wang; Jingyi Yu; Kaiyang Ji; Ke Yang; Ye Shi; Zhirui Liu

arxiv: 2511.22963 · v3 · submitted 2025-11-28 · 💻 cs.RO · cs.AI

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu , Kaiyang Ji , Ke Yang , Jingyi Yu , Ye Shi , Jingya Wang This is my paper

Pith reviewed 2026-05-17 05:07 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords humanoid robotlanguage-conditioned motionwhole-body controlmotion vocabularyreinforcement learningembodied AIcross-embodiment transfer

0 comments

The pith

A language model translates arbitrary natural language into stable whole-body motions for humanoid robots by learning a shared human-humanoid motion vocabulary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that free-form language commands can drive diverse, physically plausible whole-body actions on humanoids without being limited to scripted instructions or losing stability. It does this by first building a single motion vocabulary that aligns human demonstration data with robot control signals, then applying a two-stage training process: supervised learning of step-by-step motion reasoning followed by reinforcement learning that adds physical feedback. A sympathetic reader would care because this combination could let robots understand everyday spoken requests and act on them safely in the real world, moving embodied AI closer to general-purpose use.

Core claim

Humanoid-LLA translates unconstrained natural language directly into executable whole-body motions by learning a unified human-humanoid motion vocabulary that bridges high-level semantics with physically-grounded control, then applying a two-stage fine-tuning framework of supervised motion Chain-of-Thought learning followed by reinforcement learning refined with physical feedback; experiments in simulation and real-world cross-embodiment settings show superior generalization to novel commands, diverse motion generation, and high physical fidelity.

What carries the argument

The unified human-humanoid motion vocabulary, which aligns semantic language descriptions with physically executable control signals to overcome paired data scarcity.

If this is right

Novel language instructions outside the training distribution produce coherent and executable motions.
Motion variety increases without sacrificing balance or joint limits.
The same model transfers across different humanoid embodiments with minimal additional tuning.
Physical feedback during the second training stage reduces instability that pure imitation learning leaves behind.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The vocabulary approach might allow reuse of existing human motion-capture datasets for other robot morphologies.
Extending the physical feedback loop to include real-world sensor data could close the sim-to-real gap further.
If the two-stage process generalizes, similar pipelines could be applied to non-humanoid platforms such as mobile manipulators.

Load-bearing premise

That a shared motion vocabulary plus supervised reasoning followed by physical-feedback reinforcement learning is enough to produce both diverse and stable motions for any free-form language input.

What would settle it

A test set of previously unseen complex language commands where the generated motions either violate physical constraints in simulation or fail to match the intended action when executed on a real humanoid.

Figures

Figures reproduced from arXiv: 2511.22963 by Jingya Wang, Jingyi Yu, Kaiyang Ji, Ke Yang, Ye Shi, Zhirui Liu.

**Figure 2.** Figure 2: An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world demonstration of free-form language-conditioned humanoid whole-body control. The tested prompts contain unseen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Humanoid-LLA pairs a shared motion vocabulary from human and robot data with a two-stage CoT-then-RL pipeline, but the physical transfer step still looks under-supported by the numbers shown.

read the letter

The main point is that this paper builds Humanoid-LLA around a unified motion vocabulary drawn from both human mocap and humanoid recordings, then applies supervised chain-of-thought training followed by reinforcement learning that incorporates physical feedback. That combination is the concrete new piece: it tries to expand the available motion space beyond what paired language-robot data alone can provide while using the RL stage to recover stability on out-of-distribution commands. The abstract reports better generalization to novel language and maintained physical fidelity in both simulation and real cross-embodiment tests, which is the claim that would matter most to people working on whole-body control. If the experiments include proper ablations on the vocabulary size and the effect of the RL stage, that would be useful evidence for the field. The approach is straightforward enough that a reader already familiar with motion tokenization and language-conditioned policies could pick up the architecture details quickly. The soft spot is exactly the one the stress-test note flags. Human and humanoid kinematics and dynamics differ enough that a learned shared vocabulary can easily encode motions that become marginal or unstable once retargeted. The paper would need to show retargeting error distributions or a feasibility rate for the vocabulary entries before the RL stage; without those, it is hard to know whether the RL step is preserving diversity or simply collapsing to safe behaviors. The abstract does not supply those numbers, so the central performance claims rest on the final sim and real results rather than on a verified intermediate link. This work is aimed at researchers who already run language-to-motion pipelines on humanoids and are looking for ways to loosen the constraints on command variety. A reader who needs a concrete baseline for free-form language control would get value from the training recipe and the reported cross-embodiment tests. The paper shows clear engagement with the data-scarcity and stability problems and does not hide behind vague claims, so it deserves a serious referee to check the quantitative support for the transfer step. I would send it to peer review rather than desk-reject it.

Referee Report

2 major / 2 minor

Summary. The paper introduces Humanoid-LLA, a Large Language Action model for mapping unconstrained natural language commands to whole-body motions on humanoid robots. It addresses paired data scarcity via a learned unified human-humanoid motion vocabulary and physical instability via a two-stage fine-tuning pipeline (supervised Chain-of-Thought motion learning followed by reinforcement learning with physical feedback). Simulation and real-world cross-embodiment experiments are presented to support claims of improved generalization to novel commands, motion diversity, and physical fidelity over prior methods.

Significance. If validated, the work could meaningfully advance general-purpose humanoid control by demonstrating a scalable route from free-form language to diverse, stable whole-body behaviors without heavy reliance on task-specific data collection. The unified vocabulary plus staged CoT-then-RL refinement is a coherent architectural choice that directly targets the diversity-stability trade-off common in prior humanoid language-to-motion systems; reproducible code or parameter-free derivations would further strengthen its contribution.

major comments (2)

[§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.
[§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.

minor comments (2)

[§3] Notation for the motion vocabulary (e.g., size, embedding dimension) should be introduced once in §3 and used consistently; currently the text alternates between descriptive phrases and symbols without a clear definition table.
[Figure 4] Figure 4 (real-world rollout examples) would benefit from overlaid joint-angle traces or CoM stability margins to visually corroborate the claimed physical fidelity rather than relying solely on qualitative video descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the changes made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.

Authors: We agree that explicit quantitative metrics would strengthen the presentation of the unified motion vocabulary. In the revised manuscript we have added these statistics to Section 3, reporting retargeting error, joint-limit violation rates, and post-retargeting feasibility. The new data confirm that the vocabulary maintains high feasibility and that the RL stage improves stability rather than compensating for retargeting-induced issues. revision: yes
Referee: [§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.

Authors: We thank the referee for this observation. While supporting metrics existed in the supplementary material, we have now moved the key results on motion diversity (trajectory variance and coverage), stability (fall rates and torque limits), and statistical comparisons (with error bars) to the main text in Section 5, together with additional ablation controls for the two-stage pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external data and experiments

full rationale

The paper describes a data-driven pipeline that learns a unified human-humanoid motion vocabulary from mocap data and applies two-stage fine-tuning (supervised CoT followed by RL with physical feedback). All performance claims—generalization to novel language, motion diversity, and physical fidelity—are presented as outcomes of simulation and real-world cross-embodiment evaluations rather than as quantities derived by construction from fitted parameters or prior self-citations. No equations or steps reduce the target results to inputs by definition, and the central assumptions about vocabulary transfer and stability are treated as empirical hypotheses tested externally. The derivation chain therefore remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review is based on abstract only; full details on parameters, axioms, and entities are unavailable. The central claim rests on the unstated assumption that language-motion alignment via vocabulary learning plus RL feedback will generalize without explicit derivation of stability guarantees.

free parameters (1)

fine-tuning hyperparameters and vocabulary size
The two-stage training process necessarily involves learned parameters whose specific values are not reported in the abstract.

axioms (1)

domain assumption A shared vocabulary can reliably map high-level language semantics onto physically grounded humanoid control signals
Invoked to address the language-to-motion translation challenge described in the abstract.

invented entities (1)

Humanoid-LLA model no independent evidence
purpose: Translates free-form language into executable whole-body motions
The proposed system itself is the central new entity introduced to solve the stated problem.

pith-pipeline@v0.9.0 · 5483 in / 1310 out tokens · 31230 ms · 2026-05-17T05:07:51.908923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 5 internal anchors

[1]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion re- targeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 1, 3

work page arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 3

work page arXiv 2025
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 3

work page 2023
[6]

Cheng, Y

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body con- trol for humanoid robots.arXiv preprint arXiv:2402.16796,

work page arXiv
[7]

Anyskill: Learning open- vocabulary physical skill for interactive agents

Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, and Siyuan Huang. Anyskill: Learning open- vocabulary physical skill for interactive agents. InConfer- ence on Computer Vision and Pattern Recognition(CVPR),

work page
[8]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025. 1

work page arXiv 2025
[9]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13336– 13348, 2025. 7

work page 2025
[10]

Humanplus: Humanoid shadowing and imita- tion from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025. 3

work page 2025
[11]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 1, 3, 5, 6

work page 2022
[12]

Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model

Gaoge Han, Mingjiang Liang, Jinglei Tang, Yongkang Cheng, Wei Liu, and Shaoli Huang. Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model. In2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 2218–2227. IEEE,

work page
[13]

Learning human- to-humanoid real-time whole-body teleoperation

Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024. 3

work page 2024
[14]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 3

work page arXiv 2025
[15]

Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning. InCon- ference on Robot Learning, pages 1516–1540. PMLR, 2025. 1, 3, 5, 7

work page 2025
[16]

Snapmogen: Human motion generation from expressive texts

Inwoo Hwang, Jian Wang, Bing Zhou, et al. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 5

work page 2025
[17]

Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024. 3

work page arXiv 2024
[18]

Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024. 3

work page 2024
[19]

Padl: Language-directed physics-based character con- trol

Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Padl: Language-directed physics-based character con- trol. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 3

work page 2022
[20]

Superpadl: Scaling language-directed physics-based control with progressive supervised distillation

Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Superpadl: Scaling language-directed physics-based control with progressive supervised distillation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3

work page 2024
[21]

Guided motion diffusion for 9 controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for 9 controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 3

work page 2023
[22]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. 1

work page 2024
[23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025

Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Rizhao Qiu, and Xiaolong Wang. Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025. 3

work page 2025
[25]

Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025. 3

work page 2025
[26]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 2023. 1

work page 2023
[27]

Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025. 3, 5

work page 2025
[28]

Motion-x: A large-scale 3d expressive whole-body human motion dataset

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 7

work page 2023
[29]

Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 3

work page 2015
[30]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023. 3

work page 2023
[31]

Winkler, Kris Ki- tani, and Weipeng Xu

Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Ki- tani, and Weipeng Xu. Perpetual humanoid control for real- time simulated avatars. InInternational Conference on Com- puter Vision (ICCV), 2023. 1, 3

work page 2023
[32]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Ze- huan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A uni- fied tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025. 4

work page arXiv 2025
[33]

Troje, Ger- ard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Ger- ard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision, pages 5442–5451, 2019. 6

work page 2019
[34]

Universal humanoid robot pose learning from internet human videos

Jiageng Mao, Siheng Zhao, Siqi Song, Chuye Hong, Tian- heng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Ji- tendra Malik, Vitor Guizilini, and Yue Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Hu- manoid Robots (Humanoids), pages 1–8, 2025. 1, 2, 3, 7

work page 2025
[35]

Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 3, 5, 6

work page arXiv 2025
[36]

Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018. 3

work page 2018
[37]

Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021. 3

work page 2021
[38]

Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022. 3

work page 2022
[39]

The KIT motion-language dataset.Big Data, 4(4):236–252,

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.Big Data, 4(4):236–252,

work page
[40]

Babel: Bodies, action and behavior with english la- bels

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 1, 5

work page 2021
[41]

Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024

Unitree Robotics. Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024. 3

work page 2024
[42]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

work page 2011
[43]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Robot motion diffusion model: Motion generation for robotic characters

Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024. 3

work page 2024
[45]

Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, and Koushil Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025. 1, 3, 7

work page arXiv 2025
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Adversarial locomotion and motion imitation for humanoid policy learning

Jiyuan Shi, Xinzhe Liu, Dewei Wang, Ouyang Lu, S ¨oren Schwertfeger, Fuchun Sun, Chenjia Bai, and Xuelong Li. Adversarial locomotion and motion imitation for humanoid policy learning. InNeural Information Processing Systems (NeurIPS), 2025. 1, 2, 3, 7 10

work page 2025
[48]

Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024

Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024. 3, 5

work page 2024
[49]

arXiv preprint arXiv:2505.19086 (2025) 2, 3

Chen Tessler, Yifeng Jiang, Erwin Coumans, Zhengyi Luo, Gal Chechik, and Xue Bin Peng. Maskedmanipulator: Versatile whole-body control for loco-manipulation.arXiv preprint arXiv:2505.19086, 2025. 3, 5

work page arXiv 2025
[50]

Human motion diffu- sion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 3, 7

work page 2023
[51]

Closd: Closing the loop between sim- ulation and diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. Closd: Closing the loop between sim- ulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3

work page 2024
[52]

Pdp: Physics-based character animation via dif- fusion policy

Takara Everest Truong, Michael Piseno, Zhaoming Xie, and Karen Liu. Pdp: Physics-based character animation via dif- fusion policy. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3

work page 2024
[53]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4

work page 2017
[54]

Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025

Yan Wu, Korrawe Karunratanakul, Zhengyi Luo, and Siyu Tang. Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025. 3

work page arXiv 2025
[55]

Lagoon: Language-guided motion control

Shusheng Xu, Huaijie Wang, Yutao Ouyang, Jiaxuan Gao, Zhiyu Mei, Chao Yu, and Yi Wu. Lagoon: Language-guided motion control. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9743–9750. IEEE,

work page
[56]

Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024

Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024. 1

work page 2024
[57]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 1

work page arXiv 2025
[58]

Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022

Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022. 5

work page 2022
[59]

Moconvq: Unified physics- based motion control via scalable discrete representations

Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. Moconvq: Unified physics- based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG), 43(4):1–21, 2024. 3

work page 2024
[60]

Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole- body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025. 3, 5

work page arXiv 2025
[61]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, and Jiajun Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 5

work page arXiv 2025
[62]

Learning physically simulated tennis skills from broadcast videos.ACM Trans

Ye Yuan, Viktor Makoviychuk, Y Guo, S Fidler, X Peng, and K Fatahalian. Learning physically simulated tennis skills from broadcast videos.ACM Trans. Graph, 42(4), 2023. 3

work page 2023
[63]

Physdiff: Physics-guided human motion diffusion model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 16010–16021, 2023. 3

work page 2023
[64]

Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025

Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, and Zongqing Lu. Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025. 1, 3, 6, 7

work page arXiv 2025
[65]

Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

Kevin Zakka. Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

work page
[66]

Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 3

work page arXiv 2025
[67]

Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holis- tic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 3

work page arXiv 2025
[68]

T2m-gpt: Generating human motion from textual de- scriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual de- scriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023
[69]

Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 3

work page 2024
[70]

Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025

Haoyu Zhao, Sixu Lin, Qingwei Ben, Minyue Dai, Hao Fei, Jingbo Wang, Hua Zou, and Junting Dong. Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025. 2, 4 11

work page arXiv 2025

[1] [1]

Karen Liu

Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion re- targeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 1, 3

work page arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 3

work page arXiv 2025

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 3

work page 2023

[6] [6]

Cheng, Y

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body con- trol for humanoid robots.arXiv preprint arXiv:2402.16796,

work page arXiv

[7] [7]

Anyskill: Learning open- vocabulary physical skill for interactive agents

Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, and Siyuan Huang. Anyskill: Learning open- vocabulary physical skill for interactive agents. InConfer- ence on Computer Vision and Pattern Recognition(CVPR),

work page

[8] [8]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025. 1

work page arXiv 2025

[9] [9]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13336– 13348, 2025. 7

work page 2025

[10] [10]

Humanplus: Humanoid shadowing and imita- tion from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025. 3

work page 2025

[11] [11]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 1, 3, 5, 6

work page 2022

[12] [12]

Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model

Gaoge Han, Mingjiang Liang, Jinglei Tang, Yongkang Cheng, Wei Liu, and Shaoli Huang. Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model. In2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 2218–2227. IEEE,

work page

[13] [13]

Learning human- to-humanoid real-time whole-body teleoperation

Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024. 3

work page 2024

[14] [14]

Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 3

work page arXiv 2025

[15] [15]

Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning. InCon- ference on Robot Learning, pages 1516–1540. PMLR, 2025. 1, 3, 5, 7

work page 2025

[16] [16]

Snapmogen: Human motion generation from expressive texts

Inwoo Hwang, Jian Wang, Bing Zhou, et al. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 5

work page 2025

[17] [17]

Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024. 3

work page arXiv 2024

[18] [18]

Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024. 3

work page 2024

[19] [19]

Padl: Language-directed physics-based character con- trol

Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Padl: Language-directed physics-based character con- trol. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 3

work page 2022

[20] [20]

Superpadl: Scaling language-directed physics-based control with progressive supervised distillation

Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Superpadl: Scaling language-directed physics-based control with progressive supervised distillation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3

work page 2024

[21] [21]

Guided motion diffusion for 9 controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for 9 controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 3

work page 2023

[22] [22]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. 1

work page 2024

[23] [23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025

Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Rizhao Qiu, and Xiaolong Wang. Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025. 3

work page 2025

[25] [25]

Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025

Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025. 3

work page 2025

[26] [26]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 2023. 1

work page 2023

[27] [27]

Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025. 3, 5

work page 2025

[28] [28]

Motion-x: A large-scale 3d expressive whole-body human motion dataset

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 7

work page 2023

[29] [29]

Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 3

work page 2015

[30] [30]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023. 3

work page 2023

[31] [31]

Winkler, Kris Ki- tani, and Weipeng Xu

Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Ki- tani, and Weipeng Xu. Perpetual humanoid control for real- time simulated avatars. InInternational Conference on Com- puter Vision (ICCV), 2023. 1, 3

work page 2023

[32] [32]

Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Ze- huan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A uni- fied tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025. 4

work page arXiv 2025

[33] [33]

Troje, Ger- ard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Ger- ard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision, pages 5442–5451, 2019. 6

work page 2019

[34] [34]

Universal humanoid robot pose learning from internet human videos

Jiageng Mao, Siheng Zhao, Siqi Song, Chuye Hong, Tian- heng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Ji- tendra Malik, Vitor Guizilini, and Yue Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Hu- manoid Robots (Humanoids), pages 1–8, 2025. 1, 2, 3, 7

work page 2025

[35] [35]

Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 3, 5, 6

work page arXiv 2025

[36] [36]

Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018. 3

work page 2018

[37] [37]

Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021. 3

work page 2021

[38] [38]

Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022. 3

work page 2022

[39] [39]

The KIT motion-language dataset.Big Data, 4(4):236–252,

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.Big Data, 4(4):236–252,

work page

[40] [40]

Babel: Bodies, action and behavior with english la- bels

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 1, 5

work page 2021

[41] [41]

Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024

Unitree Robotics. Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024. 3

work page 2024

[42] [42]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

work page 2011

[43] [43]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

Robot motion diffusion model: Motion generation for robotic characters

Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024. 3

work page 2024

[45] [45]

Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, and Koushil Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025. 1, 3, 7

work page arXiv 2025

[46] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Adversarial locomotion and motion imitation for humanoid policy learning

Jiyuan Shi, Xinzhe Liu, Dewei Wang, Ouyang Lu, S ¨oren Schwertfeger, Fuchun Sun, Chenjia Bai, and Xuelong Li. Adversarial locomotion and motion imitation for humanoid policy learning. InNeural Information Processing Systems (NeurIPS), 2025. 1, 2, 3, 7 10

work page 2025

[48] [48]

Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024

Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024. 3, 5

work page 2024

[49] [49]

arXiv preprint arXiv:2505.19086 (2025) 2, 3

Chen Tessler, Yifeng Jiang, Erwin Coumans, Zhengyi Luo, Gal Chechik, and Xue Bin Peng. Maskedmanipulator: Versatile whole-body control for loco-manipulation.arXiv preprint arXiv:2505.19086, 2025. 3, 5

work page arXiv 2025

[50] [50]

Human motion diffu- sion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 3, 7

work page 2023

[51] [51]

Closd: Closing the loop between sim- ulation and diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. Closd: Closing the loop between sim- ulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3

work page 2024

[52] [52]

Pdp: Physics-based character animation via dif- fusion policy

Takara Everest Truong, Michael Piseno, Zhaoming Xie, and Karen Liu. Pdp: Physics-based character animation via dif- fusion policy. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3

work page 2024

[53] [53]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4

work page 2017

[54] [54]

Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025

Yan Wu, Korrawe Karunratanakul, Zhengyi Luo, and Siyu Tang. Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025. 3

work page arXiv 2025

[55] [55]

Lagoon: Language-guided motion control

Shusheng Xu, Huaijie Wang, Yutao Ouyang, Jiaxuan Gao, Zhiyu Mei, Chao Yu, and Yi Wu. Lagoon: Language-guided motion control. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9743–9750. IEEE,

work page

[56] [56]

Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024

Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024. 1

work page 2024

[57] [57]

Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 1

work page arXiv 2025

[58] [58]

Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022

Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022. 5

work page 2022

[59] [59]

Moconvq: Unified physics- based motion control via scalable discrete representations

Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. Moconvq: Unified physics- based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG), 43(4):1–21, 2024. 3

work page 2024

[60] [60]

Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025

Kangning Yin, Weishuai Zeng, Ke Fan, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole- body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025. 3, 5

work page arXiv 2025

[61] [61]

Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, and Jiajun Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 5

work page arXiv 2025

[62] [62]

Learning physically simulated tennis skills from broadcast videos.ACM Trans

Ye Yuan, Viktor Makoviychuk, Y Guo, S Fidler, X Peng, and K Fatahalian. Learning physically simulated tennis skills from broadcast videos.ACM Trans. Graph, 42(4), 2023. 3

work page 2023

[63] [63]

Physdiff: Physics-guided human motion diffusion model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 16010–16021, 2023. 3

work page 2023

[64] [64]

Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025

Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, and Zongqing Lu. Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025. 1, 3, 6, 7

work page arXiv 2025

[65] [65]

Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

Kevin Zakka. Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

work page

[66] [66]

Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 3

work page arXiv 2025

[67] [67]

Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holis- tic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 3

work page arXiv 2025

[68] [68]

T2m-gpt: Generating human motion from textual de- scriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual de- scriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023

[69] [69]

Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 3

work page 2024

[70] [70]

Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025

Haoyu Zhao, Sixu Lin, Qingwei Ben, Minyue Dai, Hao Fei, Jingbo Wang, Hua Zou, and Junting Dong. Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025. 2, 4 11

work page arXiv 2025