pith. sign in

arxiv: 2511.22963 · v3 · submitted 2025-11-28 · 💻 cs.RO · cs.AI

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Pith reviewed 2026-05-17 05:07 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords humanoid robotlanguage-conditioned motionwhole-body controlmotion vocabularyreinforcement learningembodied AIcross-embodiment transfer
0
0 comments X

The pith

A language model translates arbitrary natural language into stable whole-body motions for humanoid robots by learning a shared human-humanoid motion vocabulary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that free-form language commands can drive diverse, physically plausible whole-body actions on humanoids without being limited to scripted instructions or losing stability. It does this by first building a single motion vocabulary that aligns human demonstration data with robot control signals, then applying a two-stage training process: supervised learning of step-by-step motion reasoning followed by reinforcement learning that adds physical feedback. A sympathetic reader would care because this combination could let robots understand everyday spoken requests and act on them safely in the real world, moving embodied AI closer to general-purpose use.

Core claim

Humanoid-LLA translates unconstrained natural language directly into executable whole-body motions by learning a unified human-humanoid motion vocabulary that bridges high-level semantics with physically-grounded control, then applying a two-stage fine-tuning framework of supervised motion Chain-of-Thought learning followed by reinforcement learning refined with physical feedback; experiments in simulation and real-world cross-embodiment settings show superior generalization to novel commands, diverse motion generation, and high physical fidelity.

What carries the argument

The unified human-humanoid motion vocabulary, which aligns semantic language descriptions with physically executable control signals to overcome paired data scarcity.

If this is right

  • Novel language instructions outside the training distribution produce coherent and executable motions.
  • Motion variety increases without sacrificing balance or joint limits.
  • The same model transfers across different humanoid embodiments with minimal additional tuning.
  • Physical feedback during the second training stage reduces instability that pure imitation learning leaves behind.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The vocabulary approach might allow reuse of existing human motion-capture datasets for other robot morphologies.
  • Extending the physical feedback loop to include real-world sensor data could close the sim-to-real gap further.
  • If the two-stage process generalizes, similar pipelines could be applied to non-humanoid platforms such as mobile manipulators.

Load-bearing premise

That a shared motion vocabulary plus supervised reasoning followed by physical-feedback reinforcement learning is enough to produce both diverse and stable motions for any free-form language input.

What would settle it

A test set of previously unseen complex language commands where the generated motions either violate physical constraints in simulation or fail to match the intended action when executed on a real humanoid.

Figures

Figures reproduced from arXiv: 2511.22963 by Jingya Wang, Jingyi Yu, Kaiyang Ji, Ke Yang, Ye Shi, Zhirui Liu.

Figure 1
Figure 1. Figure 1: An illustration of Humanoid-LLA. Given a high-level [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world demonstration of free-form language-conditioned humanoid whole-body control. The tested prompts contain unseen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Humanoid-LLA, a Large Language Action model for mapping unconstrained natural language commands to whole-body motions on humanoid robots. It addresses paired data scarcity via a learned unified human-humanoid motion vocabulary and physical instability via a two-stage fine-tuning pipeline (supervised Chain-of-Thought motion learning followed by reinforcement learning with physical feedback). Simulation and real-world cross-embodiment experiments are presented to support claims of improved generalization to novel commands, motion diversity, and physical fidelity over prior methods.

Significance. If validated, the work could meaningfully advance general-purpose humanoid control by demonstrating a scalable route from free-form language to diverse, stable whole-body behaviors without heavy reliance on task-specific data collection. The unified vocabulary plus staged CoT-then-RL refinement is a coherent architectural choice that directly targets the diversity-stability trade-off common in prior humanoid language-to-motion systems; reproducible code or parameter-free derivations would further strengthen its contribution.

major comments (2)
  1. [§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.
  2. [§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.
minor comments (2)
  1. [§3] Notation for the motion vocabulary (e.g., size, embedding dimension) should be introduced once in §3 and used consistently; currently the text alternates between descriptive phrases and symbols without a clear definition table.
  2. [Figure 4] Figure 4 (real-world rollout examples) would benefit from overlaid joint-angle traces or CoM stability margins to visually corroborate the claimed physical fidelity rather than relying solely on qualitative video descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the changes made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.

    Authors: We agree that explicit quantitative metrics would strengthen the presentation of the unified motion vocabulary. In the revised manuscript we have added these statistics to Section 3, reporting retargeting error, joint-limit violation rates, and post-retargeting feasibility. The new data confirm that the vocabulary maintains high feasibility and that the RL stage improves stability rather than compensating for retargeting-induced issues. revision: yes

  2. Referee: [§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.

    Authors: We thank the referee for this observation. While supporting metrics existed in the supplementary material, we have now moved the key results on motion diversity (trajectory variance and coverage), stability (fall rates and torque limits), and statistical comparisons (with error bars) to the main text in Section 5, together with additional ablation controls for the two-stage pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external data and experiments

full rationale

The paper describes a data-driven pipeline that learns a unified human-humanoid motion vocabulary from mocap data and applies two-stage fine-tuning (supervised CoT followed by RL with physical feedback). All performance claims—generalization to novel language, motion diversity, and physical fidelity—are presented as outcomes of simulation and real-world cross-embodiment evaluations rather than as quantities derived by construction from fitted parameters or prior self-citations. No equations or steps reduce the target results to inputs by definition, and the central assumptions about vocabulary transfer and stability are treated as empirical hypotheses tested externally. The derivation chain therefore remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review is based on abstract only; full details on parameters, axioms, and entities are unavailable. The central claim rests on the unstated assumption that language-motion alignment via vocabulary learning plus RL feedback will generalize without explicit derivation of stability guarantees.

free parameters (1)
  • fine-tuning hyperparameters and vocabulary size
    The two-stage training process necessarily involves learned parameters whose specific values are not reported in the abstract.
axioms (1)
  • domain assumption A shared vocabulary can reliably map high-level language semantics onto physically grounded humanoid control signals
    Invoked to address the language-to-motion translation challenge described in the abstract.
invented entities (1)
  • Humanoid-LLA model no independent evidence
    purpose: Translates free-form language into executable whole-body motions
    The proposed system itself is the central new entity introduced to solve the stated problem.

pith-pipeline@v0.9.0 · 5483 in / 1310 out tokens · 31230 ms · 2026-05-17T05:07:51.908923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 5 internal anchors

  1. [1]

    Karen Liu

    Joao Pedro Araujo, Yanjie Ze, Pei Xu, Jiajun Wu, and C. Karen Liu. Retargeting matters: General motion re- targeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025. 1, 3

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

    Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 3

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1

  5. [5]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 3

  6. [6]

    Cheng, Y

    Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body con- trol for humanoid robots.arXiv preprint arXiv:2402.16796,

  7. [7]

    Anyskill: Learning open- vocabulary physical skill for interactive agents

    Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, and Siyuan Huang. Anyskill: Learning open- vocabulary physical skill for interactive agents. InConfer- ence on Computer Vision and Pattern Recognition(CVPR),

  8. [8]

    Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

    Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025. 1

  9. [9]

    Go to zero: Towards zero-shot motion generation with million-scale data

    Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13336– 13348, 2025. 7

  10. [10]

    Humanplus: Humanoid shadowing and imita- tion from humans

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025. 3

  11. [11]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 1, 3, 5, 6

  12. [12]

    Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model

    Gaoge Han, Mingjiang Liang, Jinglei Tang, Yongkang Cheng, Wei Liu, and Shaoli Huang. Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model. In2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 2218–2227. IEEE,

  13. [13]

    Learning human- to-humanoid real-time whole-body teleoperation

    Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024. 3

  14. [14]

    Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025

    Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 3

  15. [15]

    Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning

    Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning. InCon- ference on Robot Learning, pages 1516–1540. PMLR, 2025. 1, 3, 5, 7

  16. [16]

    Snapmogen: Human motion generation from expressive texts

    Inwoo Hwang, Jian Wang, Bing Zhou, et al. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 5

  17. [17]

    Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024

    Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024. 3

  18. [18]

    Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024. 3

  19. [19]

    Padl: Language-directed physics-based character con- trol

    Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Padl: Language-directed physics-based character con- trol. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 3

  20. [20]

    Superpadl: Scaling language-directed physics-based control with progressive supervised distillation

    Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Superpadl: Scaling language-directed physics-based control with progressive supervised distillation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3

  21. [21]

    Guided motion diffusion for 9 controllable human motion synthesis

    Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for 9 controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 3

  22. [22]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. 1

  23. [23]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

  24. [24]

    Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025

    Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Rizhao Qiu, and Xiaolong Wang. Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025. 3

  25. [25]

    Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025

    Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025. 3

  26. [26]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 2023. 1

  27. [27]

    Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025

    Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025. 3, 5

  28. [28]

    Motion-x: A large-scale 3d expressive whole-body human motion dataset

    Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 7

  29. [29]

    Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 3

  30. [30]

    Perpetual humanoid control for real-time simulated avatars

    Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023. 3

  31. [31]

    Winkler, Kris Ki- tani, and Weipeng Xu

    Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Ki- tani, and Weipeng Xu. Perpetual humanoid control for real- time simulated avatars. InInternational Conference on Com- puter Vision (ICCV), 2023. 1, 3

  32. [32]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Ze- huan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A uni- fied tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025. 4

  33. [33]

    Troje, Ger- ard Pons-Moll, and Michael J

    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Ger- ard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision, pages 5442–5451, 2019. 6

  34. [34]

    Universal humanoid robot pose learning from internet human videos

    Jiageng Mao, Siheng Zhao, Siqi Song, Chuye Hong, Tian- heng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Ji- tendra Malik, Vitor Guizilini, and Yue Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Hu- manoid Robots (Humanoids), pages 1–8, 2025. 1, 2, 3, 7

  35. [35]

    Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025

    Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 3, 5, 6

  36. [36]

    Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018. 3

  37. [37]

    Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

    Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021. 3

  38. [38]

    Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022

    Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022. 3

  39. [39]

    The KIT motion-language dataset.Big Data, 4(4):236–252,

    Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.Big Data, 4(4):236–252,

  40. [40]

    Babel: Bodies, action and behavior with english la- bels

    Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 1, 5

  41. [41]

    Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024

    Unitree Robotics. Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024. 3

  42. [42]

    A re- duction of imitation learning and structured prediction to no- regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

  43. [43]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 5, 6

  44. [44]

    Robot motion diffusion model: Motion generation for robotic characters

    Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024. 3

  45. [45]

    Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025

    Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, and Koushil Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025. 1, 3, 7

  46. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 5, 6

  47. [47]

    Adversarial locomotion and motion imitation for humanoid policy learning

    Jiyuan Shi, Xinzhe Liu, Dewei Wang, Ouyang Lu, S ¨oren Schwertfeger, Fuchun Sun, Chenjia Bai, and Xuelong Li. Adversarial locomotion and motion imitation for humanoid policy learning. InNeural Information Processing Systems (NeurIPS), 2025. 1, 2, 3, 7 10

  48. [48]

    Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024

    Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024. 3, 5

  49. [49]

    arXiv preprint arXiv:2505.19086 (2025) 2, 3

    Chen Tessler, Yifeng Jiang, Erwin Coumans, Zhengyi Luo, Gal Chechik, and Xue Bin Peng. Maskedmanipulator: Versatile whole-body control for loco-manipulation.arXiv preprint arXiv:2505.19086, 2025. 3, 5

  50. [50]

    Human motion diffu- sion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 3, 7

  51. [51]

    Closd: Closing the loop between sim- ulation and diffusion for multi-task character control

    Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. Closd: Closing the loop between sim- ulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3

  52. [52]

    Pdp: Physics-based character animation via dif- fusion policy

    Takara Everest Truong, Michael Piseno, Zhaoming Xie, and Karen Liu. Pdp: Physics-based character animation via dif- fusion policy. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3

  53. [53]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4

  54. [54]

    Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025

    Yan Wu, Korrawe Karunratanakul, Zhengyi Luo, and Siyu Tang. Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025. 3

  55. [55]

    Lagoon: Language-guided motion control

    Shusheng Xu, Huaijie Wang, Yutao Ouyang, Jiaxuan Gao, Zhiyu Mei, Chao Yu, and Yi Wu. Lagoon: Language-guided motion control. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9743–9750. IEEE,

  56. [56]

    Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024

    Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024. 1

  57. [57]

    Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

    Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 1

  58. [58]

    Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022

    Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022. 5

  59. [59]

    Moconvq: Unified physics- based motion control via scalable discrete representations

    Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. Moconvq: Unified physics- based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG), 43(4):1–21, 2024. 3

  60. [60]

    Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025

    Kangning Yin, Weishuai Zeng, Ke Fan, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole- body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025. 3, 5

  61. [61]

    Visualmimic: Visual hu- manoid loco-manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025

    Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, and Jiajun Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 5

  62. [62]

    Learning physically simulated tennis skills from broadcast videos.ACM Trans

    Ye Yuan, Viktor Makoviychuk, Y Guo, S Fidler, X Peng, and K Fatahalian. Learning physically simulated tennis skills from broadcast videos.ACM Trans. Graph, 42(4), 2023. 3

  63. [63]

    Physdiff: Physics-guided human motion diffusion model

    Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 16010–16021, 2023. 3

  64. [64]

    Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025

    Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, and Zongqing Lu. Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025. 1, 3, 6, 7

  65. [65]

    Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

    Kevin Zakka. Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,

  66. [66]

    Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

    Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 3

  67. [67]

    Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

    Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holis- tic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 3

  68. [68]

    T2m-gpt: Generating human motion from textual de- scriptions with discrete representations

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual de- scriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  69. [69]

    Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 3

  70. [70]

    Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025

    Haoyu Zhao, Sixu Lin, Qingwei Ben, Minyue Dai, Hao Fei, Jingbo Wang, Hua Zou, and Junting Dong. Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025. 2, 4 11