Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
Pith reviewed 2026-05-17 05:07 UTC · model grok-4.3
The pith
A language model translates arbitrary natural language into stable whole-body motions for humanoid robots by learning a shared human-humanoid motion vocabulary.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Humanoid-LLA translates unconstrained natural language directly into executable whole-body motions by learning a unified human-humanoid motion vocabulary that bridges high-level semantics with physically-grounded control, then applying a two-stage fine-tuning framework of supervised motion Chain-of-Thought learning followed by reinforcement learning refined with physical feedback; experiments in simulation and real-world cross-embodiment settings show superior generalization to novel commands, diverse motion generation, and high physical fidelity.
What carries the argument
The unified human-humanoid motion vocabulary, which aligns semantic language descriptions with physically executable control signals to overcome paired data scarcity.
If this is right
- Novel language instructions outside the training distribution produce coherent and executable motions.
- Motion variety increases without sacrificing balance or joint limits.
- The same model transfers across different humanoid embodiments with minimal additional tuning.
- Physical feedback during the second training stage reduces instability that pure imitation learning leaves behind.
Where Pith is reading between the lines
- The vocabulary approach might allow reuse of existing human motion-capture datasets for other robot morphologies.
- Extending the physical feedback loop to include real-world sensor data could close the sim-to-real gap further.
- If the two-stage process generalizes, similar pipelines could be applied to non-humanoid platforms such as mobile manipulators.
Load-bearing premise
That a shared motion vocabulary plus supervised reasoning followed by physical-feedback reinforcement learning is enough to produce both diverse and stable motions for any free-form language input.
What would settle it
A test set of previously unseen complex language commands where the generated motions either violate physical constraints in simulation or fail to match the intended action when executed on a real humanoid.
Figures
read the original abstract
Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Humanoid-LLA, a Large Language Action model for mapping unconstrained natural language commands to whole-body motions on humanoid robots. It addresses paired data scarcity via a learned unified human-humanoid motion vocabulary and physical instability via a two-stage fine-tuning pipeline (supervised Chain-of-Thought motion learning followed by reinforcement learning with physical feedback). Simulation and real-world cross-embodiment experiments are presented to support claims of improved generalization to novel commands, motion diversity, and physical fidelity over prior methods.
Significance. If validated, the work could meaningfully advance general-purpose humanoid control by demonstrating a scalable route from free-form language to diverse, stable whole-body behaviors without heavy reliance on task-specific data collection. The unified vocabulary plus staged CoT-then-RL refinement is a coherent architectural choice that directly targets the diversity-stability trade-off common in prior humanoid language-to-motion systems; reproducible code or parameter-free derivations would further strengthen its contribution.
major comments (2)
- [§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.
- [§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.
minor comments (2)
- [§3] Notation for the motion vocabulary (e.g., size, embedding dimension) should be introduced once in §3 and used consistently; currently the text alternates between descriptive phrases and symbols without a clear definition table.
- [Figure 4] Figure 4 (real-world rollout examples) would benefit from overlaid joint-angle traces or CoM stability margins to visually corroborate the claimed physical fidelity rather than relying solely on qualitative video descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the changes made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Unified Motion Vocabulary): The central claim that the learned vocabulary bridges human mocap to humanoid dynamics without sacrificing feasibility rests on an implicit kinematic/dynamic compatibility assumption. No quantitative retargeting error, joint-limit violation rate, or post-retargeting feasibility statistics are reported; without these, it is impossible to determine whether the subsequent RL stage recovers stability or merely masks vocabulary-induced infeasibility for out-of-distribution language.
Authors: We agree that explicit quantitative metrics would strengthen the presentation of the unified motion vocabulary. In the revised manuscript we have added these statistics to Section 3, reporting retargeting error, joint-limit violation rates, and post-retargeting feasibility. The new data confirm that the vocabulary maintains high feasibility and that the RL stage improves stability rather than compensating for retargeting-induced issues. revision: yes
-
Referee: [§5] §5 (Experiments and Ablations): The abstract asserts superior generalization and physical fidelity, yet the evaluation lacks explicit metrics for motion diversity (e.g., trajectory variance or coverage), stability (e.g., fall rate or torque limits), and statistical comparison to baselines on held-out language commands. If these appear only in supplementary tables, they must be elevated to the main text with error bars and ablation controls to substantiate the two-stage framework's contribution.
Authors: We thank the referee for this observation. While supporting metrics existed in the supplementary material, we have now moved the key results on motion diversity (trajectory variance and coverage), stability (fall rates and torque limits), and statistical comparisons (with error bars) to the main text in Section 5, together with additional ablation controls for the two-stage pipeline. revision: yes
Circularity Check
No circularity: empirical claims rest on external data and experiments
full rationale
The paper describes a data-driven pipeline that learns a unified human-humanoid motion vocabulary from mocap data and applies two-stage fine-tuning (supervised CoT followed by RL with physical feedback). All performance claims—generalization to novel language, motion diversity, and physical fidelity—are presented as outcomes of simulation and real-world cross-embodiment evaluations rather than as quantities derived by construction from fitted parameters or prior self-citations. No equations or steps reduce the target results to inputs by definition, and the central assumptions about vocabulary transfer and stability are treated as empirical hypotheses tested externally. The derivation chain therefore remains self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters and vocabulary size
axioms (1)
- domain assumption A shared vocabulary can reliably map high-level language semantics onto physically grounded humanoid control signals
invented entities (1)
-
Humanoid-LLA model
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 3
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 3
work page 2023
- [6]
-
[7]
Anyskill: Learning open- vocabulary physical skill for interactive agents
Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, and Siyuan Huang. Anyskill: Learning open- vocabulary physical skill for interactive agents. InConfer- ence on Computer Vision and Pattern Recognition(CVPR),
-
[8]
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025. 1
-
[9]
Go to zero: Towards zero-shot motion generation with million-scale data
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13336– 13348, 2025. 7
work page 2025
-
[10]
Humanplus: Humanoid shadowing and imita- tion from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025. 3
work page 2025
-
[11]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 1, 3, 5, 6
work page 2022
-
[12]
Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model
Gaoge Han, Mingjiang Liang, Jinglei Tang, Yongkang Cheng, Wei Liu, and Shaoli Huang. Reindiffuse: Craft- ing physically plausible motions with reinforced diffusion model. In2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 2218–2227. IEEE,
-
[13]
Learning human- to-humanoid real-time whole-body teleoperation
Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024. 3
work page 2024
-
[14]
Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills.arXiv preprint arXiv:2502.01143, 2025. 3
-
[15]
Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning. InCon- ference on Robot Learning, pages 1516–1540. PMLR, 2025. 1, 3, 5, 7
work page 2025
-
[16]
Snapmogen: Human motion generation from expressive texts
Inwoo Hwang, Jian Wang, Bing Zhou, et al. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 5
work page 2025
-
[17]
Exbody2: Advanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024
Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. Exbody2: Ad- vanced expressive humanoid whole-body control.arXiv preprint arXiv:2412.13196, 2024. 3
-
[18]
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Ad- vances in Neural Information Processing Systems, 36, 2024. 3
work page 2024
-
[19]
Padl: Language-directed physics-based character con- trol
Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Padl: Language-directed physics-based character con- trol. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 3
work page 2022
-
[20]
Superpadl: Scaling language-directed physics-based control with progressive supervised distillation
Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Superpadl: Scaling language-directed physics-based control with progressive supervised distillation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3
work page 2024
-
[21]
Guided motion diffusion for 9 controllable human motion synthesis
Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for 9 controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 3
work page 2023
-
[22]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. 1
work page 2024
-
[23]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Rizhao Qiu, and Xiaolong Wang. Amo: Adaptive motion op- timization for hyper-dexterous humanoid whole-body con- trol.Robotics: Science and Systems 2025, 2025. 3
work page 2025
-
[25]
Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025
Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, and Siyuan Huang. Clone: Closed-loop whole- body humanoid teleoperation for long-horizon tasks, 2025. 3
work page 2025
-
[26]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 2023. 1
work page 2023
-
[27]
Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion.arXiv e-prints, pages arXiv–2508, 2025. 3, 5
work page 2025
-
[28]
Motion-x: A large-scale 3d expressive whole-body human motion dataset
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36: 25268–25280, 2023. 7
work page 2023
-
[29]
Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 3
work page 2015
-
[30]
Perpetual humanoid control for real-time simulated avatars
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023. 3
work page 2023
-
[31]
Winkler, Kris Ki- tani, and Weipeng Xu
Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Ki- tani, and Weipeng Xu. Perpetual humanoid control for real- time simulated avatars. InInternational Conference on Com- puter Vision (ICCV), 2023. 1, 3
work page 2023
-
[32]
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Ze- huan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A uni- fied tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025. 4
-
[33]
Troje, Ger- ard Pons-Moll, and Michael J
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Ger- ard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision, pages 5442–5451, 2019. 6
work page 2019
-
[34]
Universal humanoid robot pose learning from internet human videos
Jiageng Mao, Siheng Zhao, Siqi Song, Chuye Hong, Tian- heng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Ji- tendra Malik, Vitor Guizilini, and Yue Wang. Universal humanoid robot pose learning from internet human videos. In2025 IEEE-RAS 24th International Conference on Hu- manoid Robots (Humanoids), pages 1–8, 2025. 1, 2, 3, 7
work page 2025
-
[35]
Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, and Xingang Wang. Motion-r1: Chain-of-thought reasoning and reinforcement learning for human motion generation.arXiv preprint arXiv:2506.10353, 2025. 3, 5, 6
-
[36]
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforce- ment learning of physics-based character skills.ACM Trans- actions On Graphics (TOG), 37(4):1–14, 2018. 3
work page 2018
-
[37]
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for styl- ized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021. 3
work page 2021
-
[38]
Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022. 3
work page 2022
-
[39]
The KIT motion-language dataset.Big Data, 4(4):236–252,
Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset.Big Data, 4(4):236–252,
-
[40]
Babel: Bodies, action and behavior with english la- bels
Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english la- bels. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 722–731, 2021. 1, 5
work page 2021
-
[41]
Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024
Unitree Robotics. Unitree g1 humanoid robot.https:// www.unitree.com/g1, 2024. 3
work page 2024
-
[42]
A re- duction of imitation learning and structured prediction to no- regret online learning
St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5
work page 2011
-
[43]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Robot motion diffusion model: Motion generation for robotic characters
Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz B ¨acher. Robot motion diffusion model: Motion generation for robotic characters. InSIGGRAPH asia 2024 conference papers, pages 1–9, 2024. 3
work page 2024
-
[45]
Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, and Koushil Sreenath. Langwbc: Language-directed humanoid whole-body control via end-to-end learning.arXiv preprint arXiv:2504.21738, 2025. 1, 3, 7
-
[46]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Adversarial locomotion and motion imitation for humanoid policy learning
Jiyuan Shi, Xinzhe Liu, Dewei Wang, Ouyang Lu, S ¨oren Schwertfeger, Fuchun Sun, Chenjia Bai, and Xuelong Li. Adversarial locomotion and motion imitation for humanoid policy learning. InNeural Information Processing Systems (NeurIPS), 2025. 1, 2, 3, 7 10
work page 2025
-
[48]
Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based char- acter control through masked motion inpainting.ACM Trans- actions on Graphics (TOG), 43(6):1–21, 2024. 3, 5
work page 2024
-
[49]
arXiv preprint arXiv:2505.19086 (2025) 2, 3
Chen Tessler, Yifeng Jiang, Erwin Coumans, Zhengyi Luo, Gal Chechik, and Xue Bin Peng. Maskedmanipulator: Versatile whole-body control for loco-manipulation.arXiv preprint arXiv:2505.19086, 2025. 3, 5
-
[50]
Human motion diffu- sion model
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 3, 7
work page 2023
-
[51]
Closd: Closing the loop between sim- ulation and diffusion for multi-task character control
Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. Closd: Closing the loop between sim- ulation and diffusion for multi-task character control. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3
work page 2024
-
[52]
Pdp: Physics-based character animation via dif- fusion policy
Takara Everest Truong, Michael Piseno, Zhaoming Xie, and Karen Liu. Pdp: Physics-based character animation via dif- fusion policy. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3
work page 2024
-
[53]
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4
work page 2017
-
[54]
Yan Wu, Korrawe Karunratanakul, Zhengyi Luo, and Siyu Tang. Uniphys: Unified planner and controller with diffusion for flexible physics-based character control.arXiv preprint arXiv:2504.12540, 2025. 3
-
[55]
Lagoon: Language-guided motion control
Shusheng Xu, Huaijie Wang, Yutao Ouyang, Jiaxuan Gao, Zhiyu Mei, Chao Yu, and Yi Wu. Lagoon: Language-guided motion control. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9743–9750. IEEE,
-
[56]
Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024. 1
work page 2024
-
[57]
Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025. 1
-
[58]
Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters.ACM Transactions on Graph- ics (TOG), 41(6):1–16, 2022. 5
work page 2022
-
[59]
Moconvq: Unified physics- based motion control via scalable discrete representations
Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. Moconvq: Unified physics- based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG), 43(4):1–21, 2024. 3
work page 2024
-
[60]
Unitracker: Learning universal whole-body motion tracker for humanoid robots, 2025
Kangning Yin, Weishuai Zeng, Ke Fan, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, and Weinan Zhang. Unitracker: Learning universal whole- body motion tracker for humanoid robots.arXiv preprint arXiv:2507.07356, 2025. 3, 5
-
[61]
Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, and Jiajun Wu. Visualmimic: Visual humanoid loco- manipulation via motion tracking and generation.arXiv preprint arXiv:2509.20322, 2025. 5
-
[62]
Learning physically simulated tennis skills from broadcast videos.ACM Trans
Ye Yuan, Viktor Makoviychuk, Y Guo, S Fidler, X Peng, and K Fatahalian. Learning physically simulated tennis skills from broadcast videos.ACM Trans. Graph, 42(4), 2023. 3
work page 2023
-
[63]
Physdiff: Physics-guided human motion diffusion model
Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 16010–16021, 2023. 3
work page 2023
-
[64]
Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, and Zongqing Lu. Rl from physical feedback: Align- ing large motion models with humanoid control.arXiv preprint arXiv:2506.12769, 2025. 1, 3, 6, 7
-
[65]
Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,
Kevin Zakka. Mink: Python inverse kinematics based on mu- joco.https://github.com/kevinzakka/mink,
-
[66]
Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025
Yanjie Ze, Zixuan Chen, Jo ˜ao Pedro Ara ´ujo, Zi ang Cao, Xue Bin Peng, Jiajun Wu, and C. Karen Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025. 3
-
[67]
Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holis- tic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 3
-
[68]
T2m-gpt: Generating human motion from textual de- scriptions with discrete representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual de- scriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[69]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 3
work page 2024
-
[70]
Haoyu Zhao, Sixu Lin, Qingwei Ben, Minyue Dai, Hao Fei, Jingbo Wang, Hua Zou, and Junting Dong. Smap: Self-supervised motion adaptation for physically plausible humanoid whole-body control.arXiv preprint arXiv:2505.19463, 2025. 2, 4 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.