pith. sign in

arxiv: 2605.20209 · v1 · pith:LYEEZBC6new · submitted 2026-04-15 · 💻 cs.GR · cs.LG· cs.RO

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

Pith reviewed 2026-05-21 09:37 UTC · model grok-4.3

classification 💻 cs.GR cs.LGcs.RO
keywords character controldiffusion modelsreinforcement learningphysics-based animationmotion generationlatent noise manipulationwhole-body control
0
0 comments X

The pith

Reinforcement learning manipulates latent noise in a diffusion motion prior to deliver fast, task-specific character control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NaP-Control, a technique that trains a reinforcement learning agent to adjust the noise inputs to a pre-trained, task-agnostic diffusion policy. This steering produces motions that satisfy specific control objectives without requiring slow gradient guidance at every denoising step. Because the agent interacts with the physics environment during training, it learns to correct motions on the fly and optimize task rewards. The result is higher success rates, quicker inference, and preserved natural movement across varied animation tasks.

Core claim

NaP-Control uses reinforcement learning to directly predict task-optimized diffusion noise from a task-agnostic prior, eliminating iterative test-time guidance while still achieving robust whole-body control and high motion fidelity through online correction of motions.

What carries the argument

Reinforcement learning policy that outputs adjustments to the latent noise of a pre-trained diffusion model to steer generated character motions toward task goals.

If this is right

  • Inference becomes substantially faster because no per-step gradient computations are needed during denoising.
  • Success rates rise on diverse control tasks because the method corrects motions through direct environment interaction.
  • Natural motion quality is retained by keeping the generation process anchored to the original diffusion prior.
  • The approach supports adaptation to challenging scenarios that offline training alone cannot handle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same noise-manipulation idea might transfer to other diffusion-based generation domains where test-time optimization currently dominates runtime cost.
  • Pre-trained motion priors may contain more task-flexible knowledge than is typically accessed through fixed guidance schemes.
  • Combining the method with newer reinforcement learning algorithms could further improve sample efficiency during the noise-steering training phase.

Load-bearing premise

The motions encoded in the task-agnostic diffusion prior are rich enough that noise manipulation can reliably reach new task objectives without creating artifacts or unstable behavior.

What would settle it

If side-by-side tests on standard character control benchmarks show that NaP-Control produces lower success rates or slower inference than gradient-guided diffusion baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.20209 by Chia-Wen Chen, Korrawe Karunratanakul, Siyu Tang, Yan Wu.

Figure 1
Figure 1. Figure 1: NaP-Control is a latent noise optimization framework combining reinforcement learning and diffusion-based prior for physics-based character control. We showcase its effectiveness in (a) far goal reaching, (b) agile hand reaching, (c) velocity control, (d) object interaction tasks, as well as its adaptation on uneven terrains. Abstract. Achieving precise, versatile whole-body character control in physics-ba… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. (a) The RL policy πθ receives environment and proprio￾ceptive states from the physics simulator. (b) The actor learns to predict optimal noise ω ∈ W aligned with task goals. (c) These predicted noises are denoised and decoded into executable actions a via a pretrained diffusion prior and a latent action decoder. Resulting transitions are then used to iteratively optimize the noise navig… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison of Object Interaction Task. advantage in motion naturalness becomes even more pronounced in this setting, as the task demands rapid directional and height changes that strongly challenge the temporal coherence of conventional RL-based policies. 4.4 Velocity Control For the velocity control task, the direction and speed of the target velocity are randomly sampled within 3m/s for each … view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies. (a) Effect of state representation on flat-ground far goal reach￾ing. (b–c) Effect of action chunk size k for agile hand reaching on flat ground (b) and uneven terrain (c). (d) Comparison of joint state-action noise optimizing versus action￾only noise optimizing for agile hand reaching. RL exploration efficiency and control stability, as evidenced by our flat-ground far goal-reaching resu… view at source ↗
read the original abstract

Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces NaP-Control (NaP), a method for whole-body character control in physics-based animation. It trains a reinforcement learning policy to directly predict task-optimized latent noise for a fixed, task-agnostic diffusion prior, thereby steering generated motions toward task objectives without test-time gradient guidance. The approach claims to improve success rates and inference speed while maintaining motion naturalness by allowing environment interaction during training to correct motions.

Significance. If the experimental claims are substantiated, the method would offer a practical way to combine the expressiveness of diffusion priors with the adaptability of RL, potentially reducing the computational cost of guidance-based diffusion control while improving robustness across diverse tasks.

major comments (2)
  1. [Abstract] Abstract: the claims of 'higher success rates and faster inference' are presented without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence prevents verification of the central performance assertions and leaves the strength of the contribution unclear.
  2. [Method] Method description (implicit in the abstract and skeptic note): the assumption that a fixed task-agnostic diffusion prior already encodes sufficiently rich and locally correctable motion distributions for RL-based latent noise manipulation to reach high task rewards without artifacts or instability is load-bearing but not yet supported by concrete evidence of stability or out-of-distribution behavior.
minor comments (1)
  1. [Abstract] Abstract: consider including at least one concrete performance number or reference to a results table/figure to ground the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying our performance claims and methodological assumptions. We address each major comment below and have revised the manuscript to strengthen the presentation of results and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims of 'higher success rates and faster inference' are presented without any quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence prevents verification of the central performance assertions and leaves the strength of the contribution unclear.

    Authors: We agree that the abstract would benefit from explicit quantitative highlights to immediately substantiate the claims. The full manuscript already contains detailed metrics, baseline comparisons, ablations, and analysis in the Experiments section. In the revised version, we have updated the abstract to include specific results such as success rate improvements and inference speedups relative to guidance-based baselines, with pointers to the supporting tables and figures. This addresses the concern without altering the abstract's brevity. revision: yes

  2. Referee: [Method] Method description (implicit in the abstract and skeptic note): the assumption that a fixed task-agnostic diffusion prior already encodes sufficiently rich and locally correctable motion distributions for RL-based latent noise manipulation to reach high task rewards without artifacts or instability is load-bearing but not yet supported by concrete evidence of stability or out-of-distribution behavior.

    Authors: We acknowledge the importance of evidencing this core assumption. The RL policy is trained with direct environment interaction to correct motions toward task rewards, and our experiments demonstrate stable, high-fidelity outputs without artifacts across diverse tasks. To provide more concrete support, the revised manuscript adds a dedicated discussion subsection on the prior's motion distribution coverage, including qualitative visualizations and analysis of out-of-distribution handling via noise prediction. Stability is further supported by the reported success rates and motion quality metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: method and claims remain independent of reported outcomes

full rationale

The abstract and method description present NaP-Control as using RL to manipulate latent noise from a pre-existing task-agnostic diffusion policy prior, with environment interaction during training to correct motions and optimize rewards. No equations, derivations, or self-referential definitions are shown that reduce the claimed success rates or inference speed to fitted parameters or prior outputs by construction. The performance claims are positioned as experimental results rather than tautological consequences of the method definition itself. The central premise about the prior's richness is treated as an assumption to be validated externally, not derived internally from the paper's own fitted values or self-citations in a load-bearing way.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the existence of a high-quality task-agnostic diffusion prior and on RL being able to optimize noise inputs without destabilizing the generative process; no new physical entities are introduced.

free parameters (1)
  • RL reward function weights and noise manipulation hyperparameters
    These are tuned during training to balance task success against motion naturalness.
axioms (1)
  • domain assumption A pre-trained task-agnostic diffusion policy prior captures sufficiently diverse and physically plausible whole-body motions that can be steered by latent noise changes.
    Invoked when the paper states that the prior is manipulated toward task-specific behaviors.

pith-pipeline@v0.9.0 · 5701 in / 1315 out tokens · 47308 ms · 2026-05-21T09:37:47.717061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

  1. [1]

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al.: Training a helpful and harmless assistant withreinforcementlearningfromhumanfeedback.arXivpreprintarXiv:2204.05862 (2022) 4

  2. [2]

    In: Pro- ceedings ofthe 26thannualinternational conferenceon machine learning.pp

    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro- ceedings ofthe 26thannualinternational conferenceon machine learning.pp. 41–48 (2009) 10

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 4

  4. [4]

    Advances in neural information processing systems30(2017) 4

    Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Advances in neural information processing systems30(2017) 4

  5. [5]

    In: European Confer- ence on Computer Vision

    Dai, W., Chen, L.H., Wang, J., Liu, J., Dai, B., Tang, Y.: Motionlcm: Real-time controllable motion generation via latent consistency model. In: European Confer- ence on Computer Vision. pp. 390–408. Springer (2024) 3

  6. [6]

    ACM transactions on graphics (TOG)29(4), 1–10 (2010) 3

    De Lasa, M., Mordatch, I., Hertzmann, A.: Feature-based locomotion controllers. ACM transactions on graphics (TOG)29(4), 1–10 (2010) 3

  7. [7]

    Advances in Neural Information Processing Systems37, 125487–125519 (2024) 4

    Eyring, L., Karthik, S., Roth, K., Dosovitskiy, A., Akata, Z.: Reno: Enhancing one-step text-to-image models through reward-based noise optimization. Advances in Neural Information Processing Systems37, 125487–125519 (2024) 4

  8. [8]

    Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

    Fan, Y., Lee, K.: Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362 (2023) 4

  9. [9]

    Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

    Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 4

  10. [10]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Con- ference Papers

    Gat, I., Raab, S., Tevet, G., Reshef, Y., Bermano, A.H., Cohen-Or, D.: Anytop: Character animation diffusion with any topology. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Con- ference Papers. pp. 1–10 (2025) 3

  11. [11]

    In: CVPR (2024) 6

    Guo, X., Liu, J., Cui, M., Li, J., Yang, H., Huang, D.: Initno: Boosting text-to- image diffusion models via initial noise optimization. In: CVPR (2024) 6

  12. [12]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J.G., Levine, S.: Idql: Im- plicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573 (2023) 4

  13. [13]

    ACM Transactions on Graphics (TOG)44(4), 1–12 (2025) 2, 4, 6, 8

    Huang, X., Truong, T., Zhang, Y., Yu, F., Sleiman, J.P., Hodgins, J., Sreenath, K., Farshidian, F.: Diffuse-cloc: Guided diffusion for physics-based character look- ahead control. ACM Transactions on Graphics (TOG)44(4), 1–12 (2025) 2, 4, 6, 8

  14. [14]

    In: European Conference on Computer Vision

    Huang, Y., Wan, W., Yang, Y., Callison-Burch, C., Yatskar, M., Liu, L.: Como: Controllable motion generation through language guided pose code editing. In: European Conference on Computer Vision. pp. 180–196. Springer (2024) 3

  15. [15]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Karunratanakul, K., Preechakul, K., Aksan, E., Beeler, T., Suwajanakorn, S., Tang, S.: Optimizing diffusion noise can serve as universal motion priors. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1334–1345 (2024) 2, 3, 6

  16. [16]

    In: Proceedings of the 16 C.-W

    Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided mo- tion diffusion for controllable human motion synthesis. In: Proceedings of the 16 C.-W. Chen et al. IEEE/CVF International Conference on Computer Vision. pp. 2151–2162 (2023) 3

  17. [17]

    arXiv preprint arXiv:2505.21837 (2025) 3

    Khani, A., Rampini, A., Atherton, E., Roy, B.: Unimogen: Universal motion gen- eration. arXiv preprint arXiv:2505.21837 (2025) 3

  18. [18]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li,J.,Cao,J.,Zhang,H.,Rempe,D.,Kautz,J.,Iqbal,U.,Yuan,Y.:Genmo:Agen- eralist model for human motion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11766–11776 (2025) 3

  19. [19]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Z., Cheng, K., Ghosh, A., Bhattacharya, U., Gui, L., Bera, A.: Simmotionedit: Text-based human motion editing with motion similarity prediction. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27827–27837 (2025) 3

  20. [20]

    In: Conference on Robot Learning

    Liang, J., Makoviychuk, V., Handa, A., Chentanez, N., Macklin, M., Fox, D.: Gpu-accelerated robotic simulation for distributed reinforcement learning. In: Conference on Robot Learning. pp. 270–282. PMLR (2018),https : / / api . semanticscholar.org/CorpusID:5308461010

  21. [21]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    Liao, Q., Truong, T.E., Huang, X., Tevet, G., Sreenath, K., Liu, C.K.: Beyond- mimic: From motion tracking to versatile humanoid control via guided diffusion. arXiv preprint arXiv:2508.08241 (2025) 2, 3

  22. [22]

    In: The Twelfth International Conference on Learning Representations 4

    Liu, H., Sferrazza, C., Abbeel, P.: Chain of hindsight aligns language models with feedback. In: The Twelfth International Conference on Learning Representations 4

  23. [23]

    In: ACM SIGGRAPH 2010 papers, pp

    Liu, L., Yin, K., Van de Panne, M., Shao, T., Xu, W.: Sampling-based contact-rich motion control. In: ACM SIGGRAPH 2010 papers, pp. 1–10 (2010) 3

  24. [24]

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinnedmulti-personlinearmodel.ACMTrans.Graphics(Proc.SIGGRAPHAsia) 34(6), 248:1–248:16 (Oct 2015) 10

  25. [25]

    In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview

    Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Universal humanoid motion representations for physics-based control. In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview. net/forum?id=OrOd8PxOO22, 4, 6, 7, 10, 13

  26. [26]

    In: International Conference on Computer Vision (ICCV) (2023) 2, 3

    Luo, Z., Cao, J., Winkler, A.W., Kitani, K., Xu, W.: Perpetual humanoid control for real-time simulated avatars. In: International Conference on Computer Vision (ICCV) (2023) 2, 3

  27. [27]

    In: International Conference on Com- puter Vision

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: International Conference on Com- puter Vision. pp. 5442–5451 (Oct 2019) 5

  28. [28]

    Deepmimic: Example-guided deep reinforcement learning of physics-based character skills

    Peng, X.B., Abbeel, P., Levine, S., van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37(4), 143:1–143:14 (Jul 2018).https://doi.org/10.1145/3197517.3201311, http://doi.acm.org/10.1145/3197517.32013112, 3, 10

  29. [29]

    Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022) 3

  30. [30]

    ACM Transactions on Graphics (ToG)40(4), 1–20 (2021) 3

    Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021) 3

  31. [31]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pinyoanuntapong, E., Saleem, M., Karunratanakul, K., Wang, P., Xue, H., Chen, C., Guo, C., Cao, J., Ren, J., Tulyakov, S.: Maskcontrol: Spatio-temporal con- trol for masked motion synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9955–9965 (2025) 3 NaP-Control 17

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pinyoanuntapong, E., Wang, P., Lee, M., Chen, C.: Mmm: Generative masked motion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1546–1555 (2024) 3

  33. [33]

    Diffusion Policy Policy Optimization

    Ren, A.Z., Lidard, J., Ankile, L.L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., Simchowitz, M.: Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588 (2024) 4

  34. [34]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015) 8

  35. [36]

    ACM Trans

    Shi, Y., Wang, J., Jiang, X., Lin, B., Dai, B., Peng, X.B.: Interactive character con- trol with auto-regressive motion diffusion models. ACM Trans. Graph.43(4) (Jul 2024).https://doi.org/10.1145/3658140,https://doi.org/10.1145/3658140 3

  36. [37]

    Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conference on Learning Representations (ICLR) (2021) 6

  37. [38]

    ACM Trans- actions on Graphics (TOG)43(6), 1–21 (2024) 2, 3, 4, 10, 13

    Tessler, C., Guo, Y., Nabati, O., Chechik, G., Peng, X.B.: Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Trans- actions on Graphics (TOG)43(6), 1–21 (2024) 2, 3, 4, 10, 13

  38. [39]

    arXiv preprint arXiv:2505.19086 (2025) 2, 3

    Tessler, C., Jiang, Y., Coumans, E., Luo, Z., Chechik, G., Peng, X.B.: Masked- manipulator: Versatile whole-body control for loco-manipulation. arXiv preprint arXiv:2505.19086 (2025) 2, 3

  39. [40]

    In: ACM SIGGRAPH 2023 Conference Proceedings

    Tessler, C., Kasten, Y., Guo, Y., Mannor, S., Chechik, G., Peng, X.B.: Calm: Conditional adversarial latent models for directable virtual characters. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–9 (2023) 3

  40. [41]

    In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= pZISppZSTv3, 4, 10, 13

    Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: CLoSD: Closing the loop between simulation and diffu- sion for multi-task character control. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= pZISppZSTv3, 4, 10, 13

  41. [42]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu3

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu3

  42. [43]

    In: SIGGRAPH Asia 2024 Conference Papers

    Truong, T.E., Piseno, M., Xie, Z., Liu, K.: Pdp: Physics-based character animation via diffusion policy. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–10 (2024) 4

  43. [44]

    Conference on Robot Learning (2025) 2, 4, 6

    Wagenmaker, A., Nakamoto, M., Zhang, Y., Park, S., Yagoub, W., Nagabandi, A., Gupta,A.,Levine,S.:Steeringyourdiffusionpolicywithlatentspacereinforcement learning. Conference on Robot Learning (2025) 2, 4, 6

  44. [45]

    arXiv preprint arXiv:2311.17135 (2023) 3

    Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135 (2023) 3

  45. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, J., Luo, Z., Yuan, Y., Li, Y., Dai, B.: Pacer+: On-demand pedestrian anima- tion controller in driving scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 718–728 (2024) 3

  46. [47]

    International Journal of Computer Vision133(7), 4277–4293 (2025) 3 18 C.-W

    Wang, Y., Li, M., Liu, J., Leng, Z., Li, F.W., Zhang, Z., Liang, X.: Fg-t2m++: Llms-augmented fine-grained text driven human motion generation. International Journal of Computer Vision133(7), 4277–4293 (2025) 3 18 C.-W. Chen et al

  47. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) 2, 4, 5, 6, 8, 10, 13

    Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) 2, 4, 5, 6, 8, 10, 13

  48. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Xiao, L., Lu, S., Pi, H., Fan, K., Pan, L., Zhou, Y., Feng, Z., Zhou, X., Peng, S., Wang, J.: Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10086–10096 (October

  49. [51]

    arXiv preprint arXiv:2309.07918 (2023)

    Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918 (2023) 2

  50. [52]

    Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint atanytimeforhumanmotiongeneration.In:TheTwelfthInternationalConference on Learning Representations (2024) 3

  51. [53]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Yang,H.,Su,K.,Zhang,Y.,Chen,J.,Qian,K.,Liu,G.,Gan,C.:Unimumo:Unified text, music, and motion generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25615–25623 (2025) 3

  52. [54]

    ACM Transactions on Graphics (TOG) 41(6), 1–16 (2022) 2, 3

    Yao, H., Song, Z., Chen, B., Liu, L.: Controlvae: Model-based learning of generative controllers for physics-based characters. ACM Transactions on Graphics (TOG) 41(6), 1–16 (2022) 2, 3

  53. [55]

    ACM Transactions on Graphics (TOG)43(4), 1–21 (2024) 2, 3

    Yao, H., Song, Z., Zhou, Y., Ao, T., Chen, B., Liu, L.: Moconvq: Unified physics- based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG)43(4), 1–21 (2024) 2, 3

  54. [56]

    In: Proceedings of the IEEE/CVF international con- ference on computer vision

    Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided hu- man motion diffusion model. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 16010–16021 (2023) 3

  55. [57]

    IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024) 3

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024) 3

  56. [58]

    In: European Conference on Computer Vision

    Zhang, Y., Tzeng, E., Du, Y., Kislyuk, D.: Large-scale reinforcement learning for diffusion models. In: European Conference on Computer Vision. pp. 1–17. Springer (2024) 4

  57. [59]

    Zhao, K., Li, G., Tang, S.: DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025) 3 NaP-Control 1 1 Supplementary We provide comprehensive qualitative results and side-by-side baseline compar- isons in the accompanying suppl...

  58. [60]

    arXiv preprint arXiv:2110.15191 (2021) 4

    Laskin, M., Yarats, D., Liu, H., Lee, K., Zhan, A., Lu, K., Cang, C., Pinto, L., Abbeel, P.: Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191 (2021) 4

  59. [61]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017),https://arxiv.org/abs/1707.063477, 8, 2

  60. [62]

    Ad- vances in Neural Information Processing Systems34, 13–23 (2021) 4

    Touati, A., Ollivier, Y.: Learning one representation to optimize all rewards. Ad- vances in Neural Information Processing Systems34, 13–23 (2021) 4

  61. [63]

    arXiv preprint arXiv:2511.19236 (2025) 4

    Wang,Y.,Jiang,H.,Yao,S.,Ding,Z.,Lu,Z.:Sentinel:Afullyend-to-endlanguage- action model for humanoid whole body control. arXiv preprint arXiv:2511.19236 (2025) 4