pith. sign in

arxiv: 2606.26981 · v1 · pith:HUTRXO6Anew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics

Pith reviewed 2026-06-26 05:13 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords text-to-motion synthesislanguage model planningphysics simulationmodel predictive controlopen-vocabulary generationhuman motionclosed-loop refinementsemantic alignment
0
0 comments X

The pith

ICMPG turns open-vocabulary text commands into human motions by running LLM candidate generation inside a closed loop that scores sequences with physical simulation and semantic alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing text-to-motion methods split into two camps: language models interpret varied instructions but often ignore physics, while simulation-based controllers produce realistic trajectories yet falter on novel or fine-grained language. The paper introduces In-Context Model Predictive Generation, which reframes synthesis as an iterative model-predictive process. An LLM first decomposes the command and emits candidate motion-token sequences; a second module then simulates each candidate, measures physical violations and text alignment, computes a composite reward, and feeds the best sequence back to steer the next LLM step. Because this refinement occurs at inference time without retraining, the method claims both broader command coverage and higher physical-semantic consistency on standard and zero-shot benchmarks.

Core claim

By treating motion synthesis as a Model Predictive Control process, the Context-Aware Motion Generation module uses an LLM to produce candidate sequences from motion tokens while the Model Predictive Generation module evaluates those candidates via physical simulation and semantic alignment, selects the highest-reward sequence, and uses it to condition subsequent generation steps, yielding motions that are more physically plausible and semantically faithful than open-loop baselines without task-specific policy retraining.

What carries the argument

The Model Predictive Generation (MPG) module, which runs physical simulation and semantic scoring on LLM-generated candidate sequences to compute a composite reward and select the sequence that guides the next generation iteration.

If this is right

  • The closed-loop selection improves physical plausibility while preserving the LLM's ability to handle open-vocabulary and zero-shot instructions.
  • No task-specific retraining or hand-tuned weights are required for the refinement process to function across evaluated settings.
  • Different LLM backbones can be swapped into the planner module while retaining the same physical-semantic feedback loop.
  • Motions adapt to both input semantics and simulated environment constraints at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same candidate-evaluation loop could be tested on longer-horizon or multi-person motion tasks where error accumulation is more severe.
  • If the composite reward proves stable, the framework might reduce reliance on curated motion-capture datasets by letting simulation provide the primary learning signal.
  • Extending the physical simulator to include contact-rich or object-interaction scenarios would directly test whether the current reward formulation generalizes beyond locomotion.

Load-bearing premise

A single composite reward derived from simulation errors and text alignment scores can reliably rank motion candidates and steer the LLM planner across diverse commands without any per-task weight tuning.

What would settle it

On the paper's reported benchmarks, the method shows no statistically significant gains over baselines in combined physical metrics (foot sliding, penetration) and semantic metrics (text-motion similarity) when the same LLM backbone is used without the closed-loop selection step.

Figures

Figures reproduced from arXiv: 2606.26981 by Guanbin Li, Junfan Lin, Liang Lin, Xiaomeng Fu, Yang Liu, Yaowei Wang, Ziliang Chen.

Figure 1
Figure 1. Figure 1: Conceptual Analogy Between Classical Model Predictive Control [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Complete Workflow of the In-Context Model Predictive Generation (ICMPG) Framework. The pipeline begins with a one-time offline preparation phase, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison Under Open-Vocabulary Textual Descriptions. We compare ICMPG with MotionCLIP and AnySkill using the input texts “Ballroom dance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Motions Under Physical Simulation. We compare ICMPG with MotionCLIP and AnySkill using the input texts “Perform a squat” and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Intermediate Motions Across MPG Iterations. We [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Long-Sequence Motion Generation Examples Using the BABEL Annotation Protocol. The generated motions exhibit semantic alignment with the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of Codebook Size. We evaluate various codebook sizes and [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of the Reward Function. We conduct an ablation study to [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE visualization of motion codebook utilization. Blue points [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic complexity, fine-grained instructions, and novel concepts. To address this gap, we propose In-Context Model Predictive Generation (ICMPG), a framework that integrates language-model planning with inference-time physical feedback. ICMPG reformulates motion synthesis as a Model Predictive Control (MPC)-like process with two modules. The Context-Aware Motion Generation (CAMG) module uses an LLM as a planner to decompose textual commands and generate candidate motion sequences from motion tokens. The Model Predictive Generation (MPG) module evaluates these candidates through physical simulation and semantic alignment, estimates a composite reward, and selects the best sequence to guide subsequent generation steps. Unlike open-loop generation, this closed-loop refinement enables ICMPG to adapt motions to both the input semantics and the simulated physical environment without task-specific policy retraining. Extensive experiments across standard and zero-shot open-vocabulary settings show that ICMPG generalizes robustly to diverse commands and produces motions that are more physically plausible and semantically faithful than representative baselines on the evaluated benchmarks. The framework bridges semantic interpretation and physical simulation while remaining flexible enough to incorporate different LLM backbones, enabling more versatile and controllable text-driven motion synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces In-Context Model Predictive Generation (ICMPG) for open-vocabulary text-to-motion synthesis. It consists of a Context-Aware Motion Generation (CAMG) module that uses an LLM to decompose commands into motion-token sequences and a Model Predictive Generation (MPG) module that evaluates candidate sequences via physical simulation and semantic alignment, computes a composite reward, and feeds the best sequence back to guide the next LLM generation step. The central claim is that this closed-loop, inference-time process yields motions that are both more physically plausible and semantically faithful than baselines on standard and zero-shot benchmarks, without task-specific retraining or hand-tuned weights.

Significance. If the central claims hold, the work would meaningfully advance text-driven motion synthesis by demonstrating a practical way to close the loop between LLM planning and physics simulation at inference time. The avoidance of task-specific retraining and the ability to swap LLM backbones are explicit strengths. The closed-loop refinement directly targets the semantic-vs-physical trade-off identified in the introduction.

major comments (2)
  1. [MPG module description] MPG module (composite reward definition): The claim that the composite reward requires 'no hand-tuned weights' is load-bearing for the no-retraining guarantee. The manuscript does not supply the explicit combination rule (scaling, normalization, or selection threshold) between the physical simulation score and the semantic alignment score. Any fixed but author-chosen weighting would constitute hand-tuning, directly contradicting the central assertion; an ablation showing invariance to the combination method is therefore required.
  2. [Experiments] Experiments section: The abstract asserts robust generalization and superiority on 'evaluated benchmarks,' yet no quantitative tables, ablation studies, or error breakdowns are referenced. Without these, it is impossible to verify whether the reported gains in physical plausibility and semantic fidelity are supported by the data or sensitive to post-hoc choices in candidate selection.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'representative baselines' is vague; naming the specific methods compared would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where additional clarity is needed on the MPG module and experimental reporting. We will revise the manuscript to address both points directly.

read point-by-point responses
  1. Referee: [MPG module description] MPG module (composite reward definition): The claim that the composite reward requires 'no hand-tuned weights' is load-bearing for the no-retraining guarantee. The manuscript does not supply the explicit combination rule (scaling, normalization, or selection threshold) between the physical simulation score and the semantic alignment score. Any fixed but author-chosen weighting would constitute hand-tuning, directly contradicting the central assertion; an ablation showing invariance to the combination method is therefore required.

    Authors: We agree that the explicit combination rule for the composite reward was not provided in sufficient detail. In the revised manuscript we will add the precise formulation, including the scaling, normalization procedure, and selection threshold used to combine the physical simulation score and semantic alignment score. We will also include an ablation study that varies the combination method to demonstrate performance invariance, thereby supporting the no-task-specific-tuning claim. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts robust generalization and superiority on 'evaluated benchmarks,' yet no quantitative tables, ablation studies, or error breakdowns are referenced. Without these, it is impossible to verify whether the reported gains in physical plausibility and semantic fidelity are supported by the data or sensitive to post-hoc choices in candidate selection.

    Authors: We acknowledge that the abstract and main text should more explicitly reference the quantitative results. In the revision we will add clear pointers to the existing tables, ablation studies, and error breakdowns already present in the Experiments section. If additional breakdowns on candidate selection sensitivity are needed to fully address the concern, they will be incorporated as well. revision: yes

Circularity Check

0 steps flagged

No circularity; framework relies on external simulator and benchmark evaluation

full rationale

The abstract describes ICMPG as using an external physical simulator plus semantic alignment to form a composite reward that guides LLM steps, with claims evaluated on standard and zero-shot benchmarks. No equations, fitted parameters, self-citations, or ansatzes are shown that reduce any prediction or uniqueness claim to the inputs by construction. The method is presented as self-contained against external benchmarks without task-specific retraining, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about LLM planning quality and simulator fidelity that are not independently validated in the provided abstract.

axioms (2)
  • domain assumption An LLM can reliably decompose open-vocabulary text commands into usable motion token sequences
    This is the core function of the CAMG module.
  • domain assumption Physical simulation plus semantic alignment produces a reward signal that improves subsequent generation steps
    This underpins the closed-loop refinement in the MPG module.

pith-pipeline@v0.9.1-grok · 5831 in / 1265 out tokens · 23344 ms · 2026-06-26T05:13:28.355621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 9 linked inside Pith

  1. [1]

    Human motion diffusion model,

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022

  2. [2]

    Generating diverse and natural 3D human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3D human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

  3. [3]

    Generating human motion from textual descriptions with discrete rep- resentations,

    J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, and Y . Shan, “Generating human motion from textual descriptions with discrete rep- resentations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 730–14 740

  4. [4]

    MotionDiffuse: Text-driven human motion generation with diffusion models,

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “MotionDiffuse: Text-driven human motion generation with diffusion models,”arXiv preprint arXiv:2208.15001, 2022

  5. [5]

    Motiongpt: Human motion as a foreign language,

    B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 067–20 079, Dec. 2023

  6. [6]

    AttT2M: Text-driven human motion generation with a multi-perspective attention mechanism,

    C. Zhong, L. Hu, Z. Zhang, and S. Xia, “AttT2M: Text-driven human motion generation with a multi-perspective attention mechanism,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 509–519

  7. [7]

    InstructAvatar: Text-guided emotion and motion control for avatar generation,

    Y . Wang, J. Guo, J. Bai, R. Yu, T. He, X. Tan, X. Sun, and J. Bian, “InstructAvatar: Text-guided emotion and motion control for avatar generation,” inProceedings of the AAAI Conference on Artificial In- telligence, vol. 39, no. 8, 2025, pp. 8132–8140

  8. [8]

    MotionCraft: Crafting whole-body motion with plug-and-play multi- modal controls,

    Y . Bian, A. Zeng, X. Ju, X. Liu, Z. Zhang, W. Liu, and Q. Xu, “MotionCraft: Crafting whole-body motion with plug-and-play multi- modal controls,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, 2025, pp. 1880–1888

  9. [9]

    HOI-Diff: Text-driven synthesis of 3D human-object interactions using diffusion models,

    X. Peng, Y . Xie, Z. Wu, V . Jampani, D. Sun, and H. Jiang, “HOI-Diff: Text-driven synthesis of 3D human-object interactions using diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 2878–2888

  10. [10]

    MotionCLIP: Exposing human motion generation to CLIP space,

    G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to CLIP space,” in European Conference on Computer Vision, 2022, pp. 358–374

  11. [11]

    AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars,

    F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars,”arXiv preprint arXiv:2205.08535, 2022

  12. [12]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763

  13. [13]

    LaMDA: Language models for dialog applications,

    R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, H. Cheng, Y . Li, H. W. Lee, A. Ghafouri, J. Natarajan, B. Cohenet al., “LaMDA: Language models for dialog applications,”arXiv preprint arXiv:2201.08239, 2022

  14. [14]

    GPT-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  15. [15]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  16. [16]

    Prompt, plan, perform: LLM-based humanoid control via quantized imitation learning,

    J. Sun, Q. Zhang, Y . Duan, X. Jiang, C. Cheng, and R. Xu, “Prompt, plan, perform: LLM-based humanoid control via quantized imitation learning,” inProceedings of the IEEE International Conference on Robotics and Automation, 2024, pp. 16 236–16 242

  17. [17]

    Plan, posture and go: Towards open-world text-to-motion generation,

    J. Liu, W. Dai, C. Wang, Y . Cheng, Y . Tang, and X. Tong, “Plan, posture and go: Towards open-world text-to-motion generation,”arXiv preprint arXiv:2312.14828, 2023

  18. [18]

    PhysDiff: Physics- guided human motion diffusion model,

    Y . Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “PhysDiff: Physics- guided human motion diffusion model,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 010–16 021

  19. [19]

    AnySkill: Learning open-vocabulary physical skill for interactive agents,

    J. Cui, T. Liu, N. Liu, Y . Yang, Y . Zhu, and S. Huang, “AnySkill: Learning open-vocabulary physical skill for interactive agents,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 852–862

  20. [20]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, pp. 279–292, 1992

  21. [21]

    Learning to predict by the methods of temporal differ- ences,

    R. S. Sutton, “Learning to predict by the methods of temporal differ- ences,”Machine Learning, vol. 3, pp. 9–44, 1988

  22. [22]

    Model predictive heuristic control: Applications to industrial processes,

    J. Richalet, A. Rault, J. L. Testud, and J. Papon, “Model predictive heuristic control: Applications to industrial processes,”Automatica, vol. 14, no. 5, pp. 413–428, 1978

  23. [23]

    Kouvaritakis and M

    B. Kouvaritakis and M. Cannon,Model Predictive Control: Classical, Robust and Stochastic. Cham, Switzerland: Springer International Publishing, 2016

  24. [24]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

  25. [25]

    Light-T2M: A lightweight and fast model for text-to-motion generation,

    L.-A. Zeng, G. Huang, G. Wu, and W.-S. Zheng, “Light-T2M: A lightweight and fast model for text-to-motion generation,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9797–9805

  26. [26]

    Human motion diffusion as a generative prior,

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023. 15

  27. [27]

    EMDM: Efficient motion diffusion model for fast and high-quality motion generation,

    W. Zhou, Z. Dou, Z. Cao, Z. Liao, J. Wang, W. Wang, Y . Liu, T. Komura, W. Wang, and L. Liu, “EMDM: Efficient motion diffusion model for fast and high-quality motion generation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 18–38

  28. [28]

    Executing your commands via motion diffusion in latent space,

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

  29. [29]

    The KIT motion-language dataset,

    M. Plappert, C. Mandery, and T. Asfour, “The KIT motion-language dataset,”Big Data, vol. 4, no. 4, pp. 236–252, 2016

  30. [30]

    Flag3D: A 3D fitness activity dataset with language instruction,

    Y . Tang, J. Liu, A. Liu, B. Yang, W. Dai, Y . Rao, J. Lu, J. Zhou, and X. Li, “Flag3D: A 3D fitness activity dataset with language instruction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 106–22 117

  31. [31]

    AMASS: Archive of motion capture as surface shapes,

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5442–5451

  32. [32]

    Motion-x: A large-scale 3d expressive whole-body human motion dataset,

    J. Lin, A. Zeng, S. Lu, Y . Cai, R. Zhang, H. Wang, and L. Zhang, “Motion-x: A large-scale 3d expressive whole-body human motion dataset,”Advances in Neural Information Processing Systems, vol. 36, pp. 25 268–25 280, 2023

  33. [33]

    Motion-X++: A large-scale multimodal 3D whole-body human motion dataset,

    Y . Zhang, J. Lin, A. Zeng, G. Wu, S. Lu, Y . Fu, Y . Cai, R. Zhang, H. Wang, and L. Zhang, “Motion-X++: A large-scale multimodal 3D whole-body human motion dataset,”arXiv preprint arXiv:2501.05098, 2025

  34. [34]

    Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training,

    J. Lin, J. Chang, L. Liu, G. Li, L. Lin, Q. Tian, and C.-W. Chen, “Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 222–23 231

  35. [35]

    Action- GPT: Leveraging large-scale language models for improved and gen- eralized action generation,

    S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action- GPT: Leveraging large-scale language models for improved and gen- eralized action generation,” inProceedings of the IEEE International Conference on Multimedia and Expo, 2023, pp. 31–36

  36. [36]

    MoConVQ: Uni- fied physics-based motion control via scalable discrete representations,

    H. Yao, Z. Song, Y . Zhou, T. Ao, B. Chen, and L. Liu, “MoConVQ: Uni- fied physics-based motion control via scalable discrete representations,” ACM Transactions on Graphics, vol. 43, no. 4, pp. 1–21, 2024

  37. [37]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  38. [38]

    UniPhys: Unified planner and controller with diffusion for flexible physics-based character control,

    Y . Wu, K. Karunratanakul, Z. Luo, and S. Tang, “UniPhys: Unified planner and controller with diffusion for flexible physics-based character control,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 214–13 224

  39. [39]

    Physics-based character controllers with reinforcement learn- ing,

    D. Reda, “Physics-based character controllers with reinforcement learn- ing,” Ph.D. dissertation, University of British Columbia, 2025

  40. [40]

    Policy-space diffusion for physics-based character animation,

    M. Rocca, S. Darkner, K. Erleben, and S. Andrews, “Policy-space diffusion for physics-based character animation,”ACM Transactions on Graphics, vol. 44, no. 3, pp. 1–18, 2025

  41. [41]

    PhysicsFC: Learning user-controlled skills for a physics-based football player controller,

    M. Kim, E. Jung, and Y . Lee, “PhysicsFC: Learning user-controlled skills for a physics-based football player controller,”arXiv preprint arXiv:2504.21216, 2025

  42. [42]

    EDGE: Editable dance generation from music,

    J. Tseng, R. Castellon, and K. Liu, “EDGE: Editable dance generation from music,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 448–458

  43. [43]

    Full-body articulated human-object interaction,

    N. Jiang, T. Liu, Z. Cao, J. Cui, Z. Zhang, Y . Chen, H. Wang, Y . Zhu, and S. Huang, “Full-body articulated human-object interaction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9365–9376

  44. [44]

    PADL: Language- directed physics-based character control,

    J. Juravsky, Y . Guo, S. Fidler, and X. B. Peng, “PADL: Language- directed physics-based character control,” inSIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9

  45. [45]

    PDP: Physics-based character animation via diffusion policy,

    T. E. Truong, M. Piseno, Z. Xie, and K. Liu, “PDP: Physics-based character animation via diffusion policy,” inSIGGRAPH Asia 2024 Conference Papers, Dec. 2024, pp. 1–10, art. no. 86

  46. [46]

    CLoSD: Closing the loop between simulation and diffusion for multi-task character control,

    G. Tevet, S. Raab, S. Cohan, D. Reda, Z. Luo, X. B. Peng, A. H. Bermano, and M. van de Panne, “CLoSD: Closing the loop between simulation and diffusion for multi-task character control,”arXiv preprint arXiv:2410.03441, 2024

  47. [47]

    Generative adversarial imitation learning,

    J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in Neural Information Processing Systems, vol. 29, 2016

  48. [48]

    Catch & carry: Reusable neural controllers for vision-guided whole-body tasks,

    J. Merel, S. Tunyasuvunakool, A. Ahuja, Y . Tassa, L. Hasenclever, V . Pham, T. Erez, G. Wayne, and N. Heess, “Catch & carry: Reusable neural controllers for vision-guided whole-body tasks,”ACM Transac- tions on Graphics, vol. 39, no. 4, pp. 39:1–39:12, 2020

  49. [49]

    Perpetual humanoid control for real-time simulated avatars,

    Z. Luo, J. Cao, K. Kitani, W. Xuet al., “Perpetual humanoid control for real-time simulated avatars,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 895–10 904

  50. [50]

    Isaac Gym: High- performance GPU-based physics simulation for robot learning,

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handaet al., “Isaac Gym: High- performance GPU-based physics simulation for robot learning,”arXiv preprint arXiv:2108.10470, 2021

  51. [51]

    SMPL: A skinned multi-person linear model,

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” inSeminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866

  52. [52]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

  53. [53]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  54. [54]

    The Llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  55. [55]

    CMU Graphics Lab Motion Capture Database,

    J. Hodgins, “CMU Graphics Lab Motion Capture Database,” [Online]. Available: http://mocap.cs.cmu.edu, 2015, accessed: Jun. 27, 2020

  56. [56]

    ActionCLIP: A new paradigm for video action recognition,

    M. Wang, J. Xing, and Y . Liu, “ActionCLIP: A new paradigm for video action recognition,”arXiv preprint arXiv:2109.08472, 2021

  57. [57]

    Go to zero: Towards zero-shot motion generation with million- scale data,

    K. Fan, S. Lu, M. Dai, R. Yu, L. Xiao, Z. Dou, J. Dong, L. Ma, and J. Wang, “Go to zero: Towards zero-shot motion generation with million- scale data,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 336–13 348

  58. [58]

    PP-Motion: Physical-perceptual fidelity evaluation for human motion generation,

    S. Zhao, Z. Wang, T. Luan, J. Jia, W. Zhu, J. Luo, J. Yuan, and N. Xi, “PP-Motion: Physical-perceptual fidelity evaluation for human motion generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 6840–6849

  59. [59]

    Llama 2: Open foundation and fine-tuned chat models,

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  60. [60]

    Phi- 4 technical report,

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi- 4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  61. [61]

    Gemini 3 Pro Model Card,

    Google DeepMind, “Gemini 3 Pro Model Card,” [Online]. Available: https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf, 2026, model released Nov. 2025; last updated May 2026; accessed Jun. 24, 2026

  62. [62]

    MotionLLM: Understanding human behaviors from human motions and videos,

    L.-H. Chen, S. Lu, A. Zeng, H. Zhang, B. Wang, R. Zhang, and L. Zhang, “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. Xiaomeng Fureceived the B.E. degree from Sun Yat-sen University in 2021. He is currently a first- year Ph.D. student at Peng Cheng Laboratory...