Bionic Human-Motion Style Transfer for Physically Executable Whole-Body Control of Humanoid Robots
Pith reviewed 2026-06-28 10:07 UTC · model grok-4.3
The pith
A physics-aware diffusion model enables style transfer from short human motion examples to executable humanoid robot movements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed bionic generation-to-control framework uses a physics-aware multi-condition latent diffusion model to fuse style, content, and trajectory conditions for generating stylized whole-body references, applies classifier-free guidance to control style intensity, and imposes contact-consistency and temporal-smoothness regularization during training, allowing the references to be converted and tracked successfully by a whole-body policy on physical robots.
What carries the argument
Physics-aware multi-condition latent diffusion model fusing style, content, and trajectory conditions with contact-consistency and temporal-smoothness regularization to ensure hardware executability.
If this is right
- Short human style exemplars can be transferred to a variety of robot motion contents.
- Contact and jitter artifacts are reduced compared to animation-oriented style-transfer methods.
- A 96.0% success rate is achieved across 125 reported real-robot trials.
- Style intensity can be adjusted using classifier-free guidance without retraining the model.
Where Pith is reading between the lines
- Robots could exhibit more expressive behaviors in human environments by reusing limited human motion examples.
- The approach may generalize to other types of robots or motion tasks beyond the tested platform.
- Reducing reliance on fixed demonstrations or manual design could accelerate development of natural robot motions.
Load-bearing premise
The diffusion model can effectively combine the different conditions and the regularizations can ensure the motions are executable by the tracking policy on the robot hardware.
What would settle it
A series of real-robot experiments where the generated motions lead to tracking failures or introduce new instabilities at a rate much higher than 4%.
read the original abstract
Expressive whole-body motion is important for humanoid robots operating in human environments, where robots are expected to move stably while presenting readable and adjustable body behaviors. However, most expressive motions are still obtained from fixed demonstrations or manually designed scripts, making it difficult to reuse a demonstrated style across different motion contents. Inspired by the way human motion styles convey affective and intentional cues through gait rhythm, posture, arm swing and body sway, this paper proposes a bionic generation-to-control framework for exemplar-driven style transfer on humanoid robots. Given a short human style exemplar and a target content motion, the proposed framework generates a stylized whole-body reference that preserves the intended motion content while transferring the demonstrated style. A physics-aware multi-condition latent diffusion model is developed to fuse style, content and trajectory conditions, and classifier-free guidance is used to adjust the style intensity without retraining. To improve hardware executability, contact-consistency and temporal-smoothness regularization are imposed on decoded motions during training. The generated references are then converted into G1-compatible robot references and executed by a preview-based whole-body tracking policy trained with a cluster-and-distill strategy. Simulation and Unitree G1 experiments show that the proposed method can transfer short human style exemplars to diverse robot motion contents, reduce contact and jitter artifacts compared with animation-oriented style-transfer baselines, and achieve a 96.0% success rate over 125 reported real-robot trials. The results demonstrate the feasibility of using short human motion exemplars as reusable bionic sources for physically executable expressive humanoid motion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a bionic generation-to-control framework that transfers styles from short human motion exemplars to diverse robot motion contents for humanoid robots. It develops a physics-aware multi-condition latent diffusion model to fuse style, content, and trajectory conditions, employs classifier-free guidance for adjustable style intensity, and adds contact-consistency and temporal-smoothness regularizations during training to promote hardware executability. Stylized references are tracked via a preview-based whole-body policy; simulation and Unitree G1 experiments report reduced artifacts versus baselines and a 96.0% success rate across 125 real-robot trials.
Significance. If the central claims hold, the framework would enable reusable, exemplar-driven expressive motions on humanoids without fixed demonstrations or manual scripting, supporting more natural human-robot interaction. The real-robot validation on 125 trials and the integration of diffusion models with domain-specific regularizations constitute concrete strengths that could be built upon for practical deployment.
major comments (2)
- [Abstract] Abstract: the 96.0% success rate over 125 trials is presented as evidence that the regularized diffusion outputs remain hardware-executable, yet no details on trial diversity, data splits, error bars, or ablation of the regularizers are supplied, leaving open whether the result generalizes or depends on particular style/content pairs.
- [Abstract] Abstract: physics awareness is realized exclusively through two regularizers applied to decoded motions at training time; the text does not describe any explicit physics simulation or constraint projection inside the latent diffusion loop, so it is unclear whether classifier-free guidance at inference can still produce contact or smoothness violations that the downstream tracker cannot handle.
minor comments (1)
- The abstract would be clearer if it briefly characterized the 125 trials (e.g., number of distinct styles, motion contents, and failure modes observed).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the two major comments point by point below, providing clarifications and indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 96.0% success rate over 125 trials is presented as evidence that the regularized diffusion outputs remain hardware-executable, yet no details on trial diversity, data splits, error bars, or ablation of the regularizers are supplied, leaving open whether the result generalizes or depends on particular style/content pairs.
Authors: The abstract is intended as a high-level summary, with full details provided in the body of the paper. Specifically, the 125 trials involve 5 different human motion style exemplars applied to 25 varied content motions, as described in Section 4.3. The data splits for training the diffusion model are outlined in Section 4.1. Ablation studies on the regularizers are reported in Table 3, demonstrating their impact on success rate. Since the primary metric is binary success, error bars were not computed, but we will include standard deviations for secondary metrics such as average contact force violation in the revised version. We will revise the abstract to include a short clause on trial diversity to address this concern. revision: yes
-
Referee: [Abstract] Abstract: physics awareness is realized exclusively through two regularizers applied to decoded motions at training time; the text does not describe any explicit physics simulation or constraint projection inside the latent diffusion loop, so it is unclear whether classifier-free guidance at inference can still produce contact or smoothness violations that the downstream tracker cannot handle.
Authors: We clarify that the multi-condition latent diffusion model is made physics-aware precisely by incorporating the contact-consistency and temporal-smoothness regularizers into the training objective applied to the decoded motions. This trains the model to generate latent representations that decode to physically plausible motions. No explicit physics simulation is performed within the diffusion sampling loop, as this would be computationally prohibitive; instead, the constraints are learned during training. Our real-robot results indicate that the generated motions are successfully tracked without violations that the policy cannot handle. We will add an explicit statement in the methods section (Section 3.2) to describe this design choice and its implications for inference-time guidance. revision: yes
Circularity Check
No circularity: method uses standard diffusion plus regularizers; success rate is empirical hardware result
full rationale
The paper presents a latent diffusion model conditioned on style/content/trajectory, trained with contact-consistency and temporal-smoothness losses, followed by a separate tracking policy. The 96% success rate is reported from 125 real-robot trials, not derived from any fitted quantity defined by the same model. No equations reduce predictions to inputs by construction, no self-citation chains support core claims, and no ansatz or uniqueness theorem is invoked from prior author work. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Philosophical Transactions of the Royal Society B: Biological Sciences364(1535), 3475–3484 (2009)
De Gelder, B.: Why bodies? twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society B: Biological Sciences364(1535), 3475–3484 (2009)
2009
-
[2]
Scientific data7(1), 292 (2020)
Zhang, M., Yu, L., Zhang, K., Du, B., Zhan, B., Chen, S., Jiang, X., Guo, S., Zhao, J., Wang, Y.,et al.: Kinematic dataset of actors expressing emotions. Scientific data7(1), 292 (2020)
2020
-
[3]
Scientific Reports12, 14165 (2022) https://doi.org/10.1038/ s41598-022-17866-w
Lott, L.L., Spengler, F.B., St¨ achele, T., Schiller, B., Heinrichs, M.: Embody/em- face as a new open tool to assess emotion recognition from body and face expressions. Scientific Reports12, 14165 (2022) https://doi.org/10.1038/ s41598-022-17866-w
2022
-
[4]
PLOS ONE18(9), 0290564 (2023) https://doi.org/10.1371/journal.pone.0290564
Riemer, H., Joseph, J.V., Lee, A.Y., Riemer, R.: Emotion and motion: Toward emotion recognition based on standing and walking. PLOS ONE18(9), 0290564 (2023) https://doi.org/10.1371/journal.pone.0290564
-
[5]
Sensors22(12), 4587 (2022) https://doi.org/10.3390/s22124587
Matsumaru, T.: Methods of generating emotional movements and methods of transmitting behavioral intentions: A perspective on human-coexistence robots. Sensors22(12), 4587 (2022) https://doi.org/10.3390/s22124587
-
[6]
PLOS ONE17(8), 0271789 (2022) https://doi.org/10.1371/ journal.pone.0271789
Mahzoon, H., Ueda, A., Yoshikawa, Y., Ishiguro, H.: Effect of robot’s vertical body movement on its perceived emotion: A preliminary study on vertical oscilla- tion and transition. PLOS ONE17(8), 0271789 (2022) https://doi.org/10.1371/ journal.pone.0271789
2022
-
[7]
Skeleton-aware networks for deep motion retargeting,
Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM Transactions on Graphics39(4) (2020) https://doi.org/10.1145/3386569.3392469
-
[8]
Holden, D., Habibie, I., Kusajima, I., Komura, T.: Fast neural style transfer for motion data. IEEE Comput. Graph. Appl.37(4), 42–49 (2017) https://doi.org/ 10.1109/MCG.2017.3271464 20
-
[9]
ACM Transactions on Graphics41(3), 1–16 (2022) https://doi
Jang, D.-K., Park, S., Lee, S.-H.: Motion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics41(3), 1–16 (2022) https://doi. org/10.1145/3516429
-
[10]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 821–830 (2024)
Song, W., Jin, X., Li, S., Chen, C., Hao, A., Hou, X., Li, N., Qin, H.: Arbitrary motion style transfer with multi-condition motion latent diffusion model. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 821–830 (2024)
2024
-
[11]
Journal of Bionic Engineering21(6), 2759–2778 (2024) https://doi.org/ 10.1007/s42235-024-00586-4
Wang, X., Guo, W., He, Z., Li, R., Zha, F., Sun, L.: Bionic jumping of humanoid robot via online centroid trajectory optimization and high dynamic motion con- troller. Journal of Bionic Engineering21(6), 2759–2778 (2024) https://doi.org/ 10.1007/s42235-024-00586-4
-
[12]
Journal of Bionic Engineering20, 1449–1466 (2023) https://doi.org/10.1007/s42235-023-00347-9
Li, J., Gao, H., Wan, Y., Yu, H., Zhou, C.: A real-time planning and control framework for robust and dynamic quadrupedal locomotion. Journal of Bionic Engineering20, 1449–1466 (2023) https://doi.org/10.1007/s42235-023-00347-9
-
[13]
Biomimetics10(9), 637 (2025) https://doi.org/10.3390/biomimetics10090637
Fang, J., Jin, Y., Wang, B., Liu, Z.: Bio-inspired central pattern generator for adaptive gait generation and stability in humanoid robots on sloped surfaces. Biomimetics10(9), 637 (2025) https://doi.org/10.3390/biomimetics10090637
-
[14]
In: 2003 IEEE International Conference on Robotics and Automation (Cat
Kajita, S., Kanehiro, F., Kaneko, K., Fujiwara, K., Harada, K., Yokoi, K., Hirukawa, H.: Biped walking pattern generation by using preview control of zero-moment point. In: 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), vol. 2, pp. 1620–16262 (2003). https://doi. org/10.1109/ROBOT.2003.1241826
-
[15]
In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp
Pratt, J., Carff, J., Drakunov, S., Goswami, A.: Capture point: A step toward humanoid push recovery. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp. 200–207 (2006). https://doi.org/10.1109/ICHR.2006. 321385
-
[16]
The International Journal of Robotics Research31(9), 1094–1113 (2012)
Koolen, T., Boer, T.D., Rebula, J., Goswami, A., Pratt, J.: Capturability-based analysis and control of legged locomotion, part 1: Theory and application to three simple gait models. The International Journal of Robotics Research31(9), 1094–1113 (2012)
2012
-
[17]
In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp
Wieber, P.-b.: Trajectory free linear model predictive control for stable walking in the presence of strong perturbations. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, pp. 137–142 (2006). https://doi.org/10.1109/ ICHR.2006.321375
arXiv 2006
-
[18]
Advanced Robotics24, 719–737 (2010) https://doi.org/10.1163/016918610X493552 21
Herdt, A., Diedam, H., Wieber, P.-B., Dimitrov, D., Mombaur, K., Diehl, M.: Online walking motion generation with automatic foot step placement. Advanced Robotics24, 719–737 (2010) https://doi.org/10.1163/016918610X493552 21
-
[19]
Autonomous Robots 40(2015) https://doi.org/10.1007/s10514-015-9479-3
Kuindersma, S., Deits, R., Fallon, M., Valenzuela, A., Dai, H., Permenter, F., Koolen, T., Marion, P., Tedrake, R.: Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Autonomous Robots 40(2015) https://doi.org/10.1007/s10514-015-9479-3
-
[20]
ACM Transactions on Graphics (TOG) (2020)
Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG) (2020)
2020
-
[21]
IEEE (2024)
Kim, B., Kim, J., Chang, H.J., Choi, J.Y.: Most: Motion style transformer between diverse action contents. IEEE (2024)
2024
-
[22]
Guo, C., Mu, Y., Zuo, X., Dai, P., Yan, Y., Lu, J., Cheng, L.: Generative human motion stylization in latent space (2024)
2024
-
[23]
https://arxiv.org/abs/2407.12783
Zhong, L., Xie, Y., Jampani, V., Sun, D., Jiang, H.: SMooDi: Stylized Motion Diffusion Model (2024). https://arxiv.org/abs/2407.12783
arXiv 2024
-
[24]
https://arxiv.org/abs/2405.06646
Hu, L., Zhang, Z., Ye, Y., Xu, Y., Xia, S.: Diffusion-based Human Motion Style Transfer with Semantic Guidance (2024). https://arxiv.org/abs/2405.06646
arXiv 2024
-
[25]
https://arxiv.org/abs/2209.14916
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human Motion Diffusion Model (2022). https://arxiv.org/abs/2209.14916
Pith/arXiv arXiv 2022
-
[26]
https:// arxiv.org/abs/2208.15001
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model (2022). https:// arxiv.org/abs/2208.15001
arXiv 2022
-
[27]
In: Computer Vision and Pattern Recognition (CVPR) (2023)
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: A framework for denoising-diffusion-based motion synthesis. In: Computer Vision and Pattern Recognition (CVPR) (2023)
2023
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided human motion diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16010–16021 (2023)
2023
-
[29]
Available: http://dx.doi.org/10.1145/3197517.3201311
Peng, X.B., Abbeel, P., Levine, S., Panne, M.: Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics37(4), 1–14 (2018) https://doi.org/10.1145/3197517.3201311
-
[30]
Amp: adversarial motion priors for stylized physics-based character control,
Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics40(4), 1–20 (2021) https://doi.org/10.1145/3450626.3459670
-
[31]
https://arxiv.org/abs/2305.06456
Luo, Z., Cao, J., Winkler, A., Kitani, K., Xu, W.: Perpetual Humanoid Control for Real-time Simulated Avatars (2023). https://arxiv.org/abs/2305.06456
arXiv 2023
-
[32]
https:// arxiv.org/abs/2310.04582
Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K., Xu, W.: Universal 22 Humanoid Motion Representations for Physics-Based Control (2024). https:// arxiv.org/abs/2310.04582
arXiv 2024
-
[33]
https://arxiv.org/abs/2402.16796
Cheng, X., Ji, Y., Chen, J., Yang, R., Yang, G., Wang, X.: Expressive Whole-Body Control for Humanoid Robots (2024). https://arxiv.org/abs/2402.16796
arXiv 2024
-
[34]
https://arxiv.org/ abs/2412.13196
Ji, M., Peng, X., Liu, F., Li, J., Yang, G., Cheng, X., Wang, X.: ExBody2: Advanced Expressive Humanoid Whole-Body Control (2025). https://arxiv.org/ abs/2412.13196
arXiv 2025
-
[35]
Chen, Z., Ji, M., Cheng, X., Peng, X., Peng, X.B., Wang, X.: Gmt: General motion tracking for humanoid whole-body control. arXiv:2506.14770 (2025)
arXiv 2025
-
[36]
https://arxiv.org/abs/2509.13833
Zhang, Z., Guo, J., Chen, C., Wang, J., Lin, C., Lian, Y., Xue, H., Wang, Z., Liu, M., Lyu, J., Liu, H., Wang, H., Yi, L.: Track Any Motions under Any Disturbances (2025). https://arxiv.org/abs/2509.13833
arXiv 2025
-
[37]
https://arxiv.org/abs/2502.01143
He, T., Gao, J., Xiao, W., Zhang, Y., Wang, Z., Wang, J., Luo, Z., He, G., Sobanbab, N., Pan, C., Yi, Z., Qu, G., Kitani, K., Hodgins, J., Fan, L.J., Zhu, Y., Liu, C., Shi, G.: ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills (2025). https://arxiv.org/abs/2502.01143
arXiv 2025
-
[38]
https://arxiv.org/abs/2508.08241
Liao, Q., Truong, T.E., Huang, X., Gao, Y., Tevet, G., Sreenath, K., Liu, C.K.: BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion (2025). https://arxiv.org/abs/2508.08241
Pith/arXiv arXiv 2025
-
[39]
https://arxiv.org/abs/2506.12851
Xie, W., Han, J., Zheng, J., Li, H., Liu, X., Shi, J., Zhang, W., Bai, C., Li, X.: KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly- Dynamic Skills (2025). https://arxiv.org/abs/2506.12851
arXiv 2025
-
[40]
https://arxiv.org/abs/2509.16638
Han, J., Xie, W., Zheng, J., Shi, J., Zhang, W., Xiao, T., Bai, C.: KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control (2025). https://arxiv.org/abs/2509.16638
arXiv 2025
-
[41]
arXiv preprint arXiv:2203.08063 (2022)
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motion- clip: Exposing human motion generation to clip space. arXiv preprint arXiv:2203.08063 (2022)
arXiv 2022
-
[42]
Park, S., Jang, D.-K., Lee, S.-H.: Diverse motion stylization for multiple style domains via spatial-temporal graph-based generative model. Proc. ACM Comput. Graph. Interact. Tech.4(3) (2021) https://doi.org/10.1145/3480145
-
[43]
Dudley, and Per Ola Kristensson
Song, W., Jin, X., Li, S., Chen, C., Hao, A., Hou, X.: Finestyle: Semantic- aware fine-grained motion style transfer with dual interactive-flow fusion. IEEE Transactions on Visualization and Computer Graphics29(11), 4361–4371 (2023) https://doi.org/10.1109/TVCG.2023.3320216 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.