pith. sign in

arxiv: 2605.14417 · v2 · pith:2IFRYFVVnew · submitted 2026-05-14 · 💻 cs.RO · cs.CV

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Pith reviewed 2026-05-21 09:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords humanoid controllanguage-conditioned controlanticipatory intentdiffusion policyjoint intentmotion generationwhole-body controlstreaming control
0
0 comments X

The pith

DAJI learns anticipatory joint intent to enable proactive language-conditioned control of humanoid robots before physical movements begin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DAJI as a way to make language instructions for humanoid robots anticipatory rather than reactive. Current approaches generate references that trackers must fix on the fly or produce outputs that ignore upcoming contacts and balance needs during streaming control. DAJI uses a hierarchical setup where one part distills future knowledge into a diffusion policy and another generates intent chunks ahead of time from language history. This matters for creating smoother whole-body motions that prepare for transitions instead of recovering from them. The reported experiments show high success rates in motion generation and low error in sequence following tasks.

Core claim

DAJI is a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. This setup allows control representations that are executable in the present while anticipating physical transitions such as contact changes, support transfers, and balance preparation.

What carries the argument

Dynamics-Aligned Joint Intent (DAJI), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control.

If this is right

  • Streaming language instructions can be followed with proactive preparation for balance and contacts instead of post-hoc repairs by low-level trackers.
  • Single-instruction motion generation reaches 94.42 percent rollout success on HumanML3D-style tasks.
  • Streaming instruction following achieves a subsequence FID of 0.152 on the BABEL dataset.
  • Anticipatory latent representations improve handling of upcoming physical transitions in language-conditioned policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce reliance on separate reactive trackers if the intent interface can be transferred directly to hardware.
  • Similar distillation of future-aware knowledge could apply to other sequential control problems where language must drive continuous physical actions.
  • Testing the policy on longer horizons or with noisy language input would reveal how well the anticipatory chunks hold up outside the reported benchmarks.

Load-bearing premise

The distillation process from a future-aware teacher into a deployable diffusion policy via student-driven rollouts preserves the anticipatory properties without significant degradation in handling contact changes and balance preparation.

What would settle it

A controlled rollout in which the deployed diffusion policy shows delayed weight shifts or no preparation for an upcoming contact change that the teacher model anticipates would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.14417 by Haozhe Jia, Honglei Jin, Jianfei Song, Kuimou Yu, Lei Wang, Shaofeng Liang, Shuxu Jin, Wenshuo Chen, Youcheng Fan, Yuan Zhang, Yutao Yue, Zinuo Zhang.

Figure 1
Figure 1. Figure 1: Teaser of DAJI. Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control. 1–20 arXiv:2605.14417v1 [cs.RO] 14 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DAJI framework. DAJI separates online deployment and offline training. DAJI-Flow predicts joint-intent latents from language and latent history, while DAJI-Act decodes each intent with live proprioception. DAJI-Act learns the executable joint-intent interface through student-driven in-the-loop distillation from a future-aware privileged teacher. learns the interface between the two modules: DAJI-Flow gener… view at source ↗
Figure 3
Figure 3. Figure 3: Tracker-level rollout visualization in simulation. DAJI decodes generated joint-intent latents into continuous whole-body humanoid motions, including dynamic and highly articulated behaviors. where 𝛼̄ 𝜏 is the cumulative noise schedule. Conditioned on 𝐜𝑡 , a lightweight denoiser 𝒟𝜙 predicts the clean action. The DAJI-Act objective is ℒAct = 𝔼(𝐨𝑡 ,𝐚 tea 𝑡 )∼𝒟student, 𝜏,𝝐𝑎,𝝐𝑧 [ ‖‖‖‖ 𝐚 tea 𝑡 − 𝒟𝜙 (𝐱𝜏 , 𝜏, 𝐜𝑡 … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative deployment results on physical humanoid hardware. DAJI produces executable motions under both streaming instruction switches and single-instruction generation. Any object-related phrases shown in qualitative prompts are interpreted only as body-motion descriptions; no object state or manipulation outcome is modeled or evaluated. 4.4. Main Benchmarks 4.4.1. HumanML3D-Style Robot Motion Generatio… view at source ↗
Figure 5
Figure 5. Figure 5: Sim (MuJoCo) Validation Results: Robust tracking performance on simple gestures to complex maneuvers. Any prompt phrase is interpreted only as a body-motion description; no object state or manipulation outcome is modeled or evaluated. 18–20 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DAJI (Dynamics-Aligned Joint Intent), a hierarchical framework for language-conditioned humanoid control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy via student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. The central claim is that this enables streaming whole-body control that is executable now yet anticipatory of future physical transitions such as contact changes and balance preparation. Experiments are reported to achieve 94.42% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL, outperforming kinematic and latent baselines.

Significance. If the performance claims and preservation of anticipatory properties hold under detailed scrutiny, the work would offer a useful advance for language-conditioned humanoid systems by moving beyond purely reactive kinematic references. The teacher-to-student distillation via rollouts is a concrete mechanism for transferring future-aware knowledge into a deployable policy, and the reported metrics suggest measurable gains in streaming instruction following. These elements could influence subsequent research on anticipatory interfaces in robotics, provided the experimental evidence is fully documented.

major comments (2)
  1. [Abstract] Abstract: quantitative results (94.42% rollout success, 0.152 subsequence FID) are presented without any description of experimental setup, baselines, data splits, error analysis, or statistical tests. This information is load-bearing for assessing whether the numbers support the claims of superior anticipatory latent learning and streaming control.
  2. [DAJI-Act] DAJI-Act section: the distillation of the future-aware teacher into the student diffusion policy is described only at a high level, with no metrics, ablation, or analysis demonstrating that anticipatory encoding of contact changes, support transfers, and balance preparation survives the student-driven rollouts without measurable degradation. This directly affects the core distinction from reactive baselines.
minor comments (1)
  1. [Abstract] The abstract uses the acronym DAJI before its parenthetical expansion; while expanded later, this reduces immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around experimental reporting and the distillation process. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: quantitative results (94.42% rollout success, 0.152 subsequence FID) are presented without any description of experimental setup, baselines, data splits, error analysis, or statistical tests. This information is load-bearing for assessing whether the numbers support the claims of superior anticipatory latent learning and streaming control.

    Authors: We agree that the abstract would benefit from additional context to make the quantitative claims more interpretable on first reading. In the revised version, we have expanded the abstract to briefly reference the HumanML3D-style and BABEL datasets, the kinematic and latent baselines, and the primary evaluation metrics. Full details on experimental setup, data splits, error analysis, and statistical significance testing remain in the Experiments section, and we have added explicit cross-references from the abstract to those sections. This change improves accessibility without altering the abstract's length constraints. revision: yes

  2. Referee: [DAJI-Act] DAJI-Act section: the distillation of the future-aware teacher into the student diffusion policy is described only at a high level, with no metrics, ablation, or analysis demonstrating that anticipatory encoding of contact changes, support transfers, and balance preparation survives the student-driven rollouts without measurable degradation. This directly affects the core distinction from reactive baselines.

    Authors: We acknowledge that the original DAJI-Act description focused on the overall procedure rather than quantitative verification of property preservation. To address this, the revised manuscript now includes dedicated ablation studies and metrics in the DAJI-Act section. These additions report the retention of contact-change and balance-preparation signals across teacher-to-student transfer, measured via rollout comparisons and intent-trajectory analysis. The new results indicate limited degradation relative to the teacher and provide a clearer contrast with reactive baselines. Visualizations of selected intent sequences have also been added to illustrate the surviving anticipatory features. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes DAJI as a hierarchical framework with DAJI-Act distilling a future-aware teacher into a diffusion policy via student-driven rollouts and DAJI-Flow for autoregressive intent generation from language and history. No equations, derivations, or mathematical reductions are shown that would make any prediction or result equivalent to its inputs by construction. Performance metrics (94.42% success, 0.152 FID) are reported as experimental outcomes from rollouts and evaluations rather than self-referential fits. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to load-bear the central claims. The approach is self-contained with independent experimental validation against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on standard assumptions in imitation learning and diffusion policies; no explicit free parameters or invented entities detailed in available text.

axioms (1)
  • domain assumption A future-aware teacher model exists that can be distilled without loss of anticipatory information
    Invoked in the description of DAJI-Act distillation process.

pith-pipeline@v0.9.0 · 5750 in / 1157 out tokens · 46640 ms · 2026-05-21T09:03:54.794355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

  1. [1]

    Movement,postureandequilibrium:Interaction and coordination

    J.Massion,“Movement,postureandequilibrium:Interaction and coordination”,Progress in neurobiology, vol. 38, no. 1, pp.35–56,1992

  2. [2]

    Posture,dynamicstability,andvol- untarymovement

    S.BouissetandM.-C.Do,“Posture,dynamicstability,andvol- untarymovement”,NeurophysiologieClinique/ClinicalNeuro- physiology,vol.38,no.6,pp.345–362,2008

  3. [3]

    Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

    E.Todorov,T.Erez,andY.Tassa,“Mujoco:Aphysicsengine formodel-basedcontrol”,in2012IEEE/RSJInternationalCon- ferenceonIntelligentRobotsandSystems,IEEE,2012,pp.5026– 5033.doi:10.1109/IROS.2012.6386109

  4. [4]

    The kit motion- languagedataset

    M. Plappert, C. Mandery, and T. Asfour, “The kit motion- languagedataset”,BigData,vol.4,no.4,pp.236–252,2016

  5. [5]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms”,CoRR, vol.abs/1707.06347,2017.arXiv:1707.06347.[Online].Avail- able:http://arxiv.org/abs/1707.06347

  6. [6]

    Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills

    X.B.Peng,P.Abbeel,S.Levine,andM.VandePanne,“Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills”,ACMTransactionsOnGraphics(TOG), vol.37,no.4,pp.1–14,2018

  7. [7]

    Denoisingdiffusionprobabilistic models

    J.Ho,A.Jain,andP.Abbeel,“Denoisingdiffusionprobabilistic models”,inAdvancesinNeuralInformationProcessingSystems, vol.33,CurranAssociates,Inc.,2020,pp.6840–6851.[Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

  8. [8]

    Babel:Bodies,actionandbehavior withenglishlabels

    A.R.Punnakkal,A.Chandrasekaran,N.Athanasiou,A.Quiros- Ramirez,andM.J.Black,“Babel:Bodies,actionandbehavior withenglishlabels”,inProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2021, pp.722–731

  9. [9]

    Denoisingdiffusionimplicit models

    J.Song,C.Meng,andS.Ermon,“Denoisingdiffusionimplicit models”,in9thInternationalConferenceonLearningRepresen- tations,ICLR2021,OpenReview.net,2021.[Online].Available: https://openreview.net/forum?id=St1giarCHLP

  10. [10]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M.Ahnetal.,“Doasican,notasisay:Groundinglanguagein roboticaffordances”,arXivpreprintarXiv:2204.01691,2022

  11. [11]

    Generatingdiverseandnatural3dhumanmo- tionsfromtext

    C.Guoetal.,“Generatingdiverseandnatural3dhumanmo- tionsfromtext”,inCVPR,2022

  12. [12]

    Human Motion Diffusion Model

    G.Tevet,S.Raab,B.Gordon,Y.Shafir,D.Cohen-Or,andA.H. Bermano, “Human motion diffusion model”,arXiv preprint arXiv:2209.14916,2022

  13. [13]

    Momask: Generative masked modeling of 3d human motions

    C.Guo,Y.Mu,M.G.Javed,S.Wang,andL.Cheng,“Momask: Generative masked modeling of 3d human motions”,arXiv preprintarXiv:2312.00063,2023

  14. [14]

    Motionflowmatchingforhumanmotionsyn- thesisandediting

    V.T.Huetal.,“Motionflowmatchingforhumanmotionsyn- thesisandediting”,arXivpreprintarXiv:2312.08895,2023

  15. [15]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer:Composable3dvaluemapsforroboticmanipulation withlanguagemodels”,arXivpreprintarXiv:2307.05973,2023

  16. [16]

    Codeaspolicies:Languagemodelprogramsfor embodiedcontrol

    J.Liangetal.,“Codeaspolicies:Languagemodelprogramsfor embodiedcontrol”,in2023IEEEInternationalconferenceon roboticsandautomation(ICRA),IEEE,2023,pp.9493–9500

  17. [17]

    Flowmatchingforgenerativemodeling

    Y.Lipman,R.T.Q.Chen,H.Ben-Hamu,M.Nickel,andM.Le, “Flowmatchingforgenerativemodeling”,inTheEleventhIn- ternationalConferenceonLearningRepresentations,ICLR2023, OpenReview.net,2023.[Online].Available:https://openreview. net/forum?id=PqvMRDCJT9t

  18. [18]

    Perpetual humanoid controlforreal-timesimulatedavatars

    Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual humanoid controlforreal-timesimulatedavatars”,inProceedingsofthe IEEE/CVFInternationalConferenceonComputerVision,2023, pp.10895–10904

  19. [19]

    Scalable diffusion models with trans- formers

    W. Peebles and S. Xie, “Scalable diffusion models with trans- formers”,inProceedingsoftheIEEE/CVFInternationalConfer- enceonComputerVision(ICCV),Oct.2023,pp.4195–4205

  20. [20]

    Calm: Conditional adversarial latent models for directablevirtualcharacters

    C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik, and X. B. Peng, “Calm: Conditional adversarial latent models for directablevirtualcharacters”,inACMSIGGRAPH2023confer- enceproceedings,2023,pp.1–9

  21. [21]

    T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations

    J.Zhangetal.,“T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations”,inCVPR,2023

  22. [22]

    Seamless human motion composition with blended positional encodings

    G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings”, in CVPR,2024

  23. [23]

    Sato:Stabletext-to-motionframework

    W.chenetal.,“Sato:Stabletext-to-motionframework”,inPro- ceedings of the 32nd ACM International Conference on Mul- timedia, ser. MM ’24, Melbourne VIC, Australia: Associa- tion for Computing Machinery, 2024, pp. 6989–6997,isbn: 9798400706868.doi:10.1145/3664647.3681034[Online].Avail- able:https://doi.org/10.1145/3664647.3681034

  24. [24]

    arXiv preprint arXiv:2410.12773 (2024)

    Z. Jiang, Y. Xie, J. Li, Y. Yuan, Y. Zhu, and Y. Zhu, “Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions”, inConference on Robot Learning, 2024. arXiv:2410.12773[cs.RO]

  25. [25]

    Qwen3-VL Technical Report

    S. Bai, Y. Cai, R. Chen, K. Chen, et al., “Qwen3-vl technical report”,arXivpreprintarXiv:2511.21631,2025.[Online].Avail- able:https://arxiv.org/abs/2511.21631

  26. [26]

    Ant: Adaptive neural temporal-aware text- to-motion model

    W. Chen et al., “Ant: Adaptive neural temporal-aware text- to-motion model”, inProceedings of the 33rd ACM Interna- tionalConferenceonMultimedia,ser.MM’25,ACM,Oct.2025, pp.9852–9861.doi:10.1145/3746027.3755168[Online].Avail- able:http://dx.doi.org/10.1145/3746027.3755168

  27. [27]

    Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

    W. Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

  28. [28]

    W.Chenetal.,Polaris:Projection-orthogonalleastsquaresfor robustandadaptiveinversionindiffusionmodels,2025.arXiv: 2512.00369 [cs.CV].[Online].Available:https://arxiv.org/abs/ 2512.00369

  29. [29]

    Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

    H. Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

  30. [30]

    [Online]

    arXiv: 2509.25304 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2509.25304

  31. [31]

    H.Jiaetal.,Physics-informedrepresentationalignmentforsparse radio-mapreconstruction,2025.arXiv:2501.19160 [cs.CV].[On- line].Available:https://arxiv.org/abs/2501.19160

  32. [32]

    Physics-informed representation alignment for sparse radio-map reconstruction

    H. Jia et al., “Physics-informed representation alignment for sparse radio-map reconstruction”, inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25, Dublin,Ireland:AssociationforComputingMachinery,2025, pp.12352–12360,isbn:9798400720352.doi:10.1145/3746027. 3758161[Online].Available:https://doi.org/10.1145/3746027. 3758161

  33. [33]

    Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025

    N. Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025. arXiv: 2512.24321 [cs.CV]

  34. [34]

    Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025

    Z. Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025. arXiv: 2510. 14952[cs.RO]. 9–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

  35. [35]

    Q. Lu, Y. Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu,Gen- tlehumanoid:Learningupper-bodycomplianceforcontact-rich humanandobjectinteraction,2025.arXiv:2511.04679 [cs.RO]. [Online].Available:https://arxiv.org/abs/2511.04679

  36. [36]

    arXiv: 2412.15032[cs.CV]

    M.Ningetal.,Dctdiff:Intriguingpropertiesofimagegenerative modeling in the dct space, 2025. arXiv: 2412.15032[cs.CV]. [Online].Available:https://arxiv.org/abs/2412.15032

  37. [37]

    Y.Shaoetal.,LangWBC:Language-directedhumanoidwhole- bodycontrolviaend-to-endlearning,2025.arXiv:2504.21738 [cs.RO]

  38. [38]

    Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces

    B.Tian,W.Chen,Z.Li,S.Lai,J.Wu,andY.Yue,“Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces”, inProceedingsofthe33rdACMInternationalConferenceonMul- timedia,ser.MM’25,Dublin,Ireland:AssociationforComput- ingMachinery,2025,pp.10152–10160,isbn:9798400720352. doi:10.1145/3746027.3755441[Online].Available:https://doi. org/10.1145/3746027.3755441

  39. [39]

    Y. Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu,SENTINEL: A fullyend-to-endlanguage-actionmodelforhumanoidwholebody control,2025.arXiv:2511.19236[cs.RO]

  40. [40]

    MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace

    L.Xiaoetal.,“MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace”, inProceedings of the IEEE/CVF International Conference on ComputerVision,2025.arXiv:2503.15451[cs.CV]

  41. [41]

    Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

    H. Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

  42. [42]

    H. Li, W. Chen, S. Liang, L. Wang, K. Yuan, and Y. Yue,𝑍2- sampling: Zero-cost zigzag trajectories for semantic alignment indiffusionmodels,2026.arXiv:2604.23536 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.23536

  43. [43]

    H. Li, W. Chen, L. Wang, S. Liang, H. Jia, and Y. Yue,Oracle noise:Fastersemanticsphericalalignmentforinterpretablela- tent optimization, 2026. arXiv: 2604.23540[cs.CV]. [Online]. Available:https://arxiv.org/abs/2604.23540

  44. [44]

    Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    H.Lietal.,Deltascorematters!spatialadaptivemultiguidance indiffusionmodels,2026.arXiv:2604.26503 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.26503

  45. [45]

    arXiv: 2601.12799 [cs.RO]

    P.Lietal.,FRoM-W1:Towardsgeneralhumanoidwhole-body control with language instructions, 2026. arXiv: 2601.12799 [cs.RO]

  46. [46]

    arXiv: 2602.07439 [cs.RO]

    W.Xieetal.,TextOp:Real-timeinteractivetext-drivenhumanoid robotmotiongenerationandcontrol, 2026. arXiv: 2602.07439 [cs.RO]

  47. [47]

    Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

    X. Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

  48. [48]

    Towards betterevaluationmetricsfortext-to-motiongeneration

    W.Chen,H.Jia,K.Yu,S.Lai,L.Wang,andY.Yue,“Towards betterevaluationmetricsfortext-to-motiongeneration”,inThe SecondInternationalWorkshoponTransformativeInsightsin MultifacetedEvaluationatTheWebConference2026. 10–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control A. EVALUATION PROTOCOL AND EXPERIMENTAL SETU...