Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Haozhe Jia; Honglei Jin; Jianfei Song; Kuimou Yu; Lei Wang; Shaofeng Liang; Shuxu Jin; Wenshuo Chen; Youcheng Fan; Yuan Zhang

arxiv: 2605.14417 · v2 · pith:2IFRYFVVnew · submitted 2026-05-14 · 💻 cs.RO · cs.CV

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Haozhe Jia , Honglei Jin , Yuan Zhang , Youcheng Fan , Shaofeng Liang , Lei Wang , Shuxu Jin , Kuimou Yu

show 4 more authors

Zinuo Zhang Jianfei Song Wenshuo Chen Yutao Yue

This is my paper

Pith reviewed 2026-05-21 09:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords humanoid controllanguage-conditioned controlanticipatory intentdiffusion policyjoint intentmotion generationwhole-body controlstreaming control

0 comments

The pith

DAJI learns anticipatory joint intent to enable proactive language-conditioned control of humanoid robots before physical movements begin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DAJI as a way to make language instructions for humanoid robots anticipatory rather than reactive. Current approaches generate references that trackers must fix on the fly or produce outputs that ignore upcoming contacts and balance needs during streaming control. DAJI uses a hierarchical setup where one part distills future knowledge into a diffusion policy and another generates intent chunks ahead of time from language history. This matters for creating smoother whole-body motions that prepare for transitions instead of recovering from them. The reported experiments show high success rates in motion generation and low error in sequence following tasks.

Core claim

DAJI is a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. This setup allows control representations that are executable in the present while anticipating physical transitions such as contact changes, support transfers, and balance preparation.

What carries the argument

Dynamics-Aligned Joint Intent (DAJI), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control.

If this is right

Streaming language instructions can be followed with proactive preparation for balance and contacts instead of post-hoc repairs by low-level trackers.
Single-instruction motion generation reaches 94.42 percent rollout success on HumanML3D-style tasks.
Streaming instruction following achieves a subsequence FID of 0.152 on the BABEL dataset.
Anticipatory latent representations improve handling of upcoming physical transitions in language-conditioned policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce reliance on separate reactive trackers if the intent interface can be transferred directly to hardware.
Similar distillation of future-aware knowledge could apply to other sequential control problems where language must drive continuous physical actions.
Testing the policy on longer horizons or with noisy language input would reveal how well the anticipatory chunks hold up outside the reported benchmarks.

Load-bearing premise

The distillation process from a future-aware teacher into a deployable diffusion policy via student-driven rollouts preserves the anticipatory properties without significant degradation in handling contact changes and balance preparation.

What would settle it

A controlled rollout in which the deployed diffusion policy shows delayed weight shifts or no preparation for an upcoming contact change that the teacher model anticipates would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.14417 by Haozhe Jia, Honglei Jin, Jianfei Song, Kuimou Yu, Lei Wang, Shaofeng Liang, Shuxu Jin, Wenshuo Chen, Youcheng Fan, Yuan Zhang, Yutao Yue, Zinuo Zhang.

**Figure 1.** Figure 1: Teaser of DAJI. Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control. 1–20 arXiv:2605.14417v1 [cs.RO] 14 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: DAJI framework. DAJI separates online deployment and offline training. DAJI-Flow predicts joint-intent latents from language and latent history, while DAJI-Act decodes each intent with live proprioception. DAJI-Act learns the executable joint-intent interface through student-driven in-the-loop distillation from a future-aware privileged teacher. learns the interface between the two modules: DAJI-Flow gener… view at source ↗

**Figure 3.** Figure 3: Tracker-level rollout visualization in simulation. DAJI decodes generated joint-intent latents into continuous whole-body humanoid motions, including dynamic and highly articulated behaviors. where 𝛼̄ 𝜏 is the cumulative noise schedule. Conditioned on 𝐜𝑡 , a lightweight denoiser 𝒟𝜙 predicts the clean action. The DAJI-Act objective is ℒAct = 𝔼(𝐨𝑡 ,𝐚 tea 𝑡 )∼𝒟student, 𝜏,𝝐𝑎,𝝐𝑧 [ ‖‖‖‖ 𝐚 tea 𝑡 − 𝒟𝜙 (𝐱𝜏 , 𝜏, 𝐜𝑡 … view at source ↗

**Figure 4.** Figure 4: Qualitative deployment results on physical humanoid hardware. DAJI produces executable motions under both streaming instruction switches and single-instruction generation. Any object-related phrases shown in qualitative prompts are interpreted only as body-motion descriptions; no object state or manipulation outcome is modeled or evaluated. 4.4. Main Benchmarks 4.4.1. HumanML3D-Style Robot Motion Generatio… view at source ↗

**Figure 5.** Figure 5: Sim (MuJoCo) Validation Results: Robust tracking performance on simple gestures to complex maneuvers. Any prompt phrase is interpreted only as a body-motion description; no object state or manipulation outcome is modeled or evaluated. 18–20 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAJI distills a future-aware teacher into a diffusion student policy plus autoregressive intent chunks for language-driven humanoid control, but the abstract gives no direct check that anticipation of contacts and balance survives the student rollouts.

read the letter

The paper's main move is a two-part hierarchy: DAJI-Act distills a future-aware teacher into a deployable diffusion action policy using student-driven rollouts, while DAJI-Flow generates future intent chunks autoregressively from language and history. This targets the gap where most language-conditioned humanoid systems either output kinematic references that need reactive fixing or use latent policies that ignore upcoming support changes and balance prep. The reported 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL are the concrete numbers they lead with. The framing around explicit joint intent that encodes physical transitions is a useful way to think about streaming whole-body control, and the distillation route is a practical way to avoid running the teacher at deployment time. The autoregressive chunk generation also fits naturally with instruction following over time. The soft spot is exactly the one the stress-test flags. The abstract states the teacher is future-aware but does not show any metric confirming that contact events, support transfers, or balance preparation carry through the student rollouts without measurable loss. If the final policy becomes more reactive than the teacher, the claimed distinction from existing baselines weakens. The abstract is also thin on experimental setup, baselines, data splits, and error analysis, so the performance claims are hard to assess from what's given. This is for robotics groups working on language interfaces for humanoids and hierarchical control policies. Someone already running similar distillation or diffusion setups on locomotion data would get practical ideas from the architecture. It deserves peer review so the full methods and any additional checks on preserved anticipation can be examined properly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DAJI (Dynamics-Aligned Joint Intent), a hierarchical framework for language-conditioned humanoid control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy via student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. The central claim is that this enables streaming whole-body control that is executable now yet anticipatory of future physical transitions such as contact changes and balance preparation. Experiments are reported to achieve 94.42% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL, outperforming kinematic and latent baselines.

Significance. If the performance claims and preservation of anticipatory properties hold under detailed scrutiny, the work would offer a useful advance for language-conditioned humanoid systems by moving beyond purely reactive kinematic references. The teacher-to-student distillation via rollouts is a concrete mechanism for transferring future-aware knowledge into a deployable policy, and the reported metrics suggest measurable gains in streaming instruction following. These elements could influence subsequent research on anticipatory interfaces in robotics, provided the experimental evidence is fully documented.

major comments (2)

[Abstract] Abstract: quantitative results (94.42% rollout success, 0.152 subsequence FID) are presented without any description of experimental setup, baselines, data splits, error analysis, or statistical tests. This information is load-bearing for assessing whether the numbers support the claims of superior anticipatory latent learning and streaming control.
[DAJI-Act] DAJI-Act section: the distillation of the future-aware teacher into the student diffusion policy is described only at a high level, with no metrics, ablation, or analysis demonstrating that anticipatory encoding of contact changes, support transfers, and balance preparation survives the student-driven rollouts without measurable degradation. This directly affects the core distinction from reactive baselines.

minor comments (1)

[Abstract] The abstract uses the acronym DAJI before its parenthetical expansion; while expanded later, this reduces immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around experimental reporting and the distillation process. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: quantitative results (94.42% rollout success, 0.152 subsequence FID) are presented without any description of experimental setup, baselines, data splits, error analysis, or statistical tests. This information is load-bearing for assessing whether the numbers support the claims of superior anticipatory latent learning and streaming control.

Authors: We agree that the abstract would benefit from additional context to make the quantitative claims more interpretable on first reading. In the revised version, we have expanded the abstract to briefly reference the HumanML3D-style and BABEL datasets, the kinematic and latent baselines, and the primary evaluation metrics. Full details on experimental setup, data splits, error analysis, and statistical significance testing remain in the Experiments section, and we have added explicit cross-references from the abstract to those sections. This change improves accessibility without altering the abstract's length constraints. revision: yes
Referee: [DAJI-Act] DAJI-Act section: the distillation of the future-aware teacher into the student diffusion policy is described only at a high level, with no metrics, ablation, or analysis demonstrating that anticipatory encoding of contact changes, support transfers, and balance preparation survives the student-driven rollouts without measurable degradation. This directly affects the core distinction from reactive baselines.

Authors: We acknowledge that the original DAJI-Act description focused on the overall procedure rather than quantitative verification of property preservation. To address this, the revised manuscript now includes dedicated ablation studies and metrics in the DAJI-Act section. These additions report the retention of contact-change and balance-preparation signals across teacher-to-student transfer, measured via rollout comparisons and intent-trajectory analysis. The new results indicate limited degradation relative to the teacher and provide a clearer contrast with reactive baselines. Visualizations of selected intent sequences have also been added to illustrate the surviving anticipatory features. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes DAJI as a hierarchical framework with DAJI-Act distilling a future-aware teacher into a diffusion policy via student-driven rollouts and DAJI-Flow for autoregressive intent generation from language and history. No equations, derivations, or mathematical reductions are shown that would make any prediction or result equivalent to its inputs by construction. Performance metrics (94.42% success, 0.152 FID) are reported as experimental outcomes from rollouts and evaluations rather than self-referential fits. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to load-bear the central claims. The approach is self-contained with independent experimental validation against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on standard assumptions in imitation learning and diffusion policies; no explicit free parameters or invented entities detailed in available text.

axioms (1)

domain assumption A future-aware teacher model exists that can be distilled without loss of anticipatory information
Invoked in the description of DAJI-Act distillation process.

pith-pipeline@v0.9.0 · 5750 in / 1157 out tokens · 46640 ms · 2026-05-21T09:03:54.794355+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

anticipatory joint-intent latent... upcoming contact changes, support transfers, and balance preparation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[1]

Movement,postureandequilibrium:Interaction and coordination

J.Massion,“Movement,postureandequilibrium:Interaction and coordination”,Progress in neurobiology, vol. 38, no. 1, pp.35–56,1992

work page 1992
[2]

Posture,dynamicstability,andvol- untarymovement

S.BouissetandM.-C.Do,“Posture,dynamicstability,andvol- untarymovement”,NeurophysiologieClinique/ClinicalNeuro- physiology,vol.38,no.6,pp.345–362,2008

work page 2008
[3]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

E.Todorov,T.Erez,andY.Tassa,“Mujoco:Aphysicsengine formodel-basedcontrol”,in2012IEEE/RSJInternationalCon- ferenceonIntelligentRobotsandSystems,IEEE,2012,pp.5026– 5033.doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[4]

The kit motion- languagedataset

M. Plappert, C. Mandery, and T. Asfour, “The kit motion- languagedataset”,BigData,vol.4,no.4,pp.236–252,2016

work page 2016
[5]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms”,CoRR, vol.abs/1707.06347,2017.arXiv:1707.06347.[Online].Avail- able:http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills

X.B.Peng,P.Abbeel,S.Levine,andM.VandePanne,“Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills”,ACMTransactionsOnGraphics(TOG), vol.37,no.4,pp.1–14,2018

work page 2018
[7]

Denoisingdiffusionprobabilistic models

J.Ho,A.Jain,andP.Abbeel,“Denoisingdiffusionprobabilistic models”,inAdvancesinNeuralInformationProcessingSystems, vol.33,CurranAssociates,Inc.,2020,pp.6840–6851.[Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[8]

Babel:Bodies,actionandbehavior withenglishlabels

A.R.Punnakkal,A.Chandrasekaran,N.Athanasiou,A.Quiros- Ramirez,andM.J.Black,“Babel:Bodies,actionandbehavior withenglishlabels”,inProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2021, pp.722–731

work page 2021
[9]

Denoisingdiffusionimplicit models

J.Song,C.Meng,andS.Ermon,“Denoisingdiffusionimplicit models”,in9thInternationalConferenceonLearningRepresen- tations,ICLR2021,OpenReview.net,2021.[Online].Available: https://openreview.net/forum?id=St1giarCHLP

work page 2021
[10]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M.Ahnetal.,“Doasican,notasisay:Groundinglanguagein roboticaffordances”,arXivpreprintarXiv:2204.01691,2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Generatingdiverseandnatural3dhumanmo- tionsfromtext

C.Guoetal.,“Generatingdiverseandnatural3dhumanmo- tionsfromtext”,inCVPR,2022

work page 2022
[12]

Human Motion Diffusion Model

G.Tevet,S.Raab,B.Gordon,Y.Shafir,D.Cohen-Or,andA.H. Bermano, “Human motion diffusion model”,arXiv preprint arXiv:2209.14916,2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Momask: Generative masked modeling of 3d human motions

C.Guo,Y.Mu,M.G.Javed,S.Wang,andL.Cheng,“Momask: Generative masked modeling of 3d human motions”,arXiv preprintarXiv:2312.00063,2023

work page arXiv 2023
[14]

Motionflowmatchingforhumanmotionsyn- thesisandediting

V.T.Huetal.,“Motionflowmatchingforhumanmotionsyn- thesisandediting”,arXivpreprintarXiv:2312.08895,2023

work page arXiv 2023
[15]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer:Composable3dvaluemapsforroboticmanipulation withlanguagemodels”,arXivpreprintarXiv:2307.05973,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Codeaspolicies:Languagemodelprogramsfor embodiedcontrol

J.Liangetal.,“Codeaspolicies:Languagemodelprogramsfor embodiedcontrol”,in2023IEEEInternationalconferenceon roboticsandautomation(ICRA),IEEE,2023,pp.9493–9500

work page 2023
[17]

Flowmatchingforgenerativemodeling

Y.Lipman,R.T.Q.Chen,H.Ben-Hamu,M.Nickel,andM.Le, “Flowmatchingforgenerativemodeling”,inTheEleventhIn- ternationalConferenceonLearningRepresentations,ICLR2023, OpenReview.net,2023.[Online].Available:https://openreview. net/forum?id=PqvMRDCJT9t

work page 2023
[18]

Perpetual humanoid controlforreal-timesimulatedavatars

Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual humanoid controlforreal-timesimulatedavatars”,inProceedingsofthe IEEE/CVFInternationalConferenceonComputerVision,2023, pp.10895–10904

work page 2023
[19]

Scalable diffusion models with trans- formers

W. Peebles and S. Xie, “Scalable diffusion models with trans- formers”,inProceedingsoftheIEEE/CVFInternationalConfer- enceonComputerVision(ICCV),Oct.2023,pp.4195–4205

work page 2023
[20]

Calm: Conditional adversarial latent models for directablevirtualcharacters

C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik, and X. B. Peng, “Calm: Conditional adversarial latent models for directablevirtualcharacters”,inACMSIGGRAPH2023confer- enceproceedings,2023,pp.1–9

work page 2023
[21]

T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations

J.Zhangetal.,“T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations”,inCVPR,2023

work page 2023
[22]

Seamless human motion composition with blended positional encodings

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings”, in CVPR,2024

work page 2024
[23]

Sato:Stabletext-to-motionframework

W.chenetal.,“Sato:Stabletext-to-motionframework”,inPro- ceedings of the 32nd ACM International Conference on Mul- timedia, ser. MM ’24, Melbourne VIC, Australia: Associa- tion for Computing Machinery, 2024, pp. 6989–6997,isbn: 9798400706868.doi:10.1145/3664647.3681034[Online].Avail- able:https://doi.org/10.1145/3664647.3681034

work page doi:10.1145/3664647.3681034 2024
[24]

arXiv preprint arXiv:2410.12773 (2024)

Z. Jiang, Y. Xie, J. Li, Y. Yuan, Y. Zhu, and Y. Zhu, “Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions”, inConference on Robot Learning, 2024. arXiv:2410.12773[cs.RO]

work page arXiv 2024
[25]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, et al., “Qwen3-vl technical report”,arXivpreprintarXiv:2511.21631,2025.[Online].Avail- able:https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Ant: Adaptive neural temporal-aware text- to-motion model

W. Chen et al., “Ant: Adaptive neural temporal-aware text- to-motion model”, inProceedings of the 33rd ACM Interna- tionalConferenceonMultimedia,ser.MM’25,ACM,Oct.2025, pp.9852–9861.doi:10.1145/3746027.3755168[Online].Avail- able:http://dx.doi.org/10.1145/3746027.3755168

work page doi:10.1145/3746027.3755168 2025
[27]

Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

W. Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

work page arXiv 2025
[28]

W.Chenetal.,Polaris:Projection-orthogonalleastsquaresfor robustandadaptiveinversionindiffusionmodels,2025.arXiv: 2512.00369 [cs.CV].[Online].Available:https://arxiv.org/abs/ 2512.00369

work page arXiv 2025
[29]

Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

H. Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

work page
[30]

[Online]

arXiv: 2509.25304 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2509.25304

work page arXiv
[31]

H.Jiaetal.,Physics-informedrepresentationalignmentforsparse radio-mapreconstruction,2025.arXiv:2501.19160 [cs.CV].[On- line].Available:https://arxiv.org/abs/2501.19160

work page arXiv 2025
[32]

Physics-informed representation alignment for sparse radio-map reconstruction

H. Jia et al., “Physics-informed representation alignment for sparse radio-map reconstruction”, inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25, Dublin,Ireland:AssociationforComputingMachinery,2025, pp.12352–12360,isbn:9798400720352.doi:10.1145/3746027. 3758161[Online].Available:https://doi.org/10.1145/3746027. 3758161

work page doi:10.1145/3746027 2025
[33]

Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025

N. Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025. arXiv: 2512.24321 [cs.CV]

work page arXiv 2025
[34]

Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025

Z. Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025. arXiv: 2510. 14952[cs.RO]. 9–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

work page 2025
[35]

Q. Lu, Y. Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu,Gen- tlehumanoid:Learningupper-bodycomplianceforcontact-rich humanandobjectinteraction,2025.arXiv:2511.04679 [cs.RO]. [Online].Available:https://arxiv.org/abs/2511.04679

work page arXiv 2025
[36]

arXiv: 2412.15032[cs.CV]

M.Ningetal.,Dctdiff:Intriguingpropertiesofimagegenerative modeling in the dct space, 2025. arXiv: 2412.15032[cs.CV]. [Online].Available:https://arxiv.org/abs/2412.15032

work page arXiv 2025
[37]

Y.Shaoetal.,LangWBC:Language-directedhumanoidwhole- bodycontrolviaend-to-endlearning,2025.arXiv:2504.21738 [cs.RO]

work page arXiv 2025
[38]

Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces

B.Tian,W.Chen,Z.Li,S.Lai,J.Wu,andY.Yue,“Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces”, inProceedingsofthe33rdACMInternationalConferenceonMul- timedia,ser.MM’25,Dublin,Ireland:AssociationforComput- ingMachinery,2025,pp.10152–10160,isbn:9798400720352. doi:10.1145/3746027.3755441[Online].Available:https://doi. org/10.1145/3746027.3755441

work page doi:10.1145/3746027.3755441 2025
[39]

Y. Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu,SENTINEL: A fullyend-to-endlanguage-actionmodelforhumanoidwholebody control,2025.arXiv:2511.19236[cs.RO]

work page arXiv 2025
[40]

MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace

L.Xiaoetal.,“MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace”, inProceedings of the IEEE/CVF International Conference on ComputerVision,2025.arXiv:2503.15451[cs.CV]

work page arXiv 2025
[41]

Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

H. Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

work page arXiv 2026
[42]

H. Li, W. Chen, S. Liang, L. Wang, K. Yuan, and Y. Yue,𝑍2- sampling: Zero-cost zigzag trajectories for semantic alignment indiffusionmodels,2026.arXiv:2604.23536 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.23536

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

H. Li, W. Chen, L. Wang, S. Liang, H. Jia, and Y. Yue,Oracle noise:Fastersemanticsphericalalignmentforinterpretablela- tent optimization, 2026. arXiv: 2604.23540[cs.CV]. [Online]. Available:https://arxiv.org/abs/2604.23540

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

H.Lietal.,Deltascorematters!spatialadaptivemultiguidance indiffusionmodels,2026.arXiv:2604.26503 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.26503

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

arXiv: 2601.12799 [cs.RO]

P.Lietal.,FRoM-W1:Towardsgeneralhumanoidwhole-body control with language instructions, 2026. arXiv: 2601.12799 [cs.RO]

work page arXiv 2026
[46]

arXiv: 2602.07439 [cs.RO]

W.Xieetal.,TextOp:Real-timeinteractivetext-drivenhumanoid robotmotiongenerationandcontrol, 2026. arXiv: 2602.07439 [cs.RO]

work page arXiv 2026
[47]

Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

X. Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

work page arXiv 2026
[48]

Towards betterevaluationmetricsfortext-to-motiongeneration

W.Chen,H.Jia,K.Yu,S.Lai,L.Wang,andY.Yue,“Towards betterevaluationmetricsfortext-to-motiongeneration”,inThe SecondInternationalWorkshoponTransformativeInsightsin MultifacetedEvaluationatTheWebConference2026. 10–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control A. EVALUATION PROTOCOL AND EXPERIMENTAL SETU...

work page

[1] [1]

Movement,postureandequilibrium:Interaction and coordination

J.Massion,“Movement,postureandequilibrium:Interaction and coordination”,Progress in neurobiology, vol. 38, no. 1, pp.35–56,1992

work page 1992

[2] [2]

Posture,dynamicstability,andvol- untarymovement

S.BouissetandM.-C.Do,“Posture,dynamicstability,andvol- untarymovement”,NeurophysiologieClinique/ClinicalNeuro- physiology,vol.38,no.6,pp.345–362,2008

work page 2008

[3] [3]

Mujoco: A physics en- gine for model-based control, in: 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, IEEE

E.Todorov,T.Erez,andY.Tassa,“Mujoco:Aphysicsengine formodel-basedcontrol”,in2012IEEE/RSJInternationalCon- ferenceonIntelligentRobotsandSystems,IEEE,2012,pp.5026– 5033.doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[4] [4]

The kit motion- languagedataset

M. Plappert, C. Mandery, and T. Asfour, “The kit motion- languagedataset”,BigData,vol.4,no.4,pp.236–252,2016

work page 2016

[5] [5]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms”,CoRR, vol.abs/1707.06347,2017.arXiv:1707.06347.[Online].Avail- able:http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills

X.B.Peng,P.Abbeel,S.Levine,andM.VandePanne,“Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills”,ACMTransactionsOnGraphics(TOG), vol.37,no.4,pp.1–14,2018

work page 2018

[7] [7]

Denoisingdiffusionprobabilistic models

J.Ho,A.Jain,andP.Abbeel,“Denoisingdiffusionprobabilistic models”,inAdvancesinNeuralInformationProcessingSystems, vol.33,CurranAssociates,Inc.,2020,pp.6840–6851.[Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020

[8] [8]

Babel:Bodies,actionandbehavior withenglishlabels

A.R.Punnakkal,A.Chandrasekaran,N.Athanasiou,A.Quiros- Ramirez,andM.J.Black,“Babel:Bodies,actionandbehavior withenglishlabels”,inProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2021, pp.722–731

work page 2021

[9] [9]

Denoisingdiffusionimplicit models

J.Song,C.Meng,andS.Ermon,“Denoisingdiffusionimplicit models”,in9thInternationalConferenceonLearningRepresen- tations,ICLR2021,OpenReview.net,2021.[Online].Available: https://openreview.net/forum?id=St1giarCHLP

work page 2021

[10] [10]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M.Ahnetal.,“Doasican,notasisay:Groundinglanguagein roboticaffordances”,arXivpreprintarXiv:2204.01691,2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Generatingdiverseandnatural3dhumanmo- tionsfromtext

C.Guoetal.,“Generatingdiverseandnatural3dhumanmo- tionsfromtext”,inCVPR,2022

work page 2022

[12] [12]

Human Motion Diffusion Model

G.Tevet,S.Raab,B.Gordon,Y.Shafir,D.Cohen-Or,andA.H. Bermano, “Human motion diffusion model”,arXiv preprint arXiv:2209.14916,2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Momask: Generative masked modeling of 3d human motions

C.Guo,Y.Mu,M.G.Javed,S.Wang,andL.Cheng,“Momask: Generative masked modeling of 3d human motions”,arXiv preprintarXiv:2312.00063,2023

work page arXiv 2023

[14] [14]

Motionflowmatchingforhumanmotionsyn- thesisandediting

V.T.Huetal.,“Motionflowmatchingforhumanmotionsyn- thesisandediting”,arXivpreprintarXiv:2312.08895,2023

work page arXiv 2023

[15] [15]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer:Composable3dvaluemapsforroboticmanipulation withlanguagemodels”,arXivpreprintarXiv:2307.05973,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Codeaspolicies:Languagemodelprogramsfor embodiedcontrol

J.Liangetal.,“Codeaspolicies:Languagemodelprogramsfor embodiedcontrol”,in2023IEEEInternationalconferenceon roboticsandautomation(ICRA),IEEE,2023,pp.9493–9500

work page 2023

[17] [17]

Flowmatchingforgenerativemodeling

Y.Lipman,R.T.Q.Chen,H.Ben-Hamu,M.Nickel,andM.Le, “Flowmatchingforgenerativemodeling”,inTheEleventhIn- ternationalConferenceonLearningRepresentations,ICLR2023, OpenReview.net,2023.[Online].Available:https://openreview. net/forum?id=PqvMRDCJT9t

work page 2023

[18] [18]

Perpetual humanoid controlforreal-timesimulatedavatars

Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual humanoid controlforreal-timesimulatedavatars”,inProceedingsofthe IEEE/CVFInternationalConferenceonComputerVision,2023, pp.10895–10904

work page 2023

[19] [19]

Scalable diffusion models with trans- formers

W. Peebles and S. Xie, “Scalable diffusion models with trans- formers”,inProceedingsoftheIEEE/CVFInternationalConfer- enceonComputerVision(ICCV),Oct.2023,pp.4195–4205

work page 2023

[20] [20]

Calm: Conditional adversarial latent models for directablevirtualcharacters

C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik, and X. B. Peng, “Calm: Conditional adversarial latent models for directablevirtualcharacters”,inACMSIGGRAPH2023confer- enceproceedings,2023,pp.1–9

work page 2023

[21] [21]

T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations

J.Zhangetal.,“T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations”,inCVPR,2023

work page 2023

[22] [22]

Seamless human motion composition with blended positional encodings

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings”, in CVPR,2024

work page 2024

[23] [23]

Sato:Stabletext-to-motionframework

W.chenetal.,“Sato:Stabletext-to-motionframework”,inPro- ceedings of the 32nd ACM International Conference on Mul- timedia, ser. MM ’24, Melbourne VIC, Australia: Associa- tion for Computing Machinery, 2024, pp. 6989–6997,isbn: 9798400706868.doi:10.1145/3664647.3681034[Online].Avail- able:https://doi.org/10.1145/3664647.3681034

work page doi:10.1145/3664647.3681034 2024

[24] [24]

arXiv preprint arXiv:2410.12773 (2024)

Z. Jiang, Y. Xie, J. Li, Y. Yuan, Y. Zhu, and Y. Zhu, “Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions”, inConference on Robot Learning, 2024. arXiv:2410.12773[cs.RO]

work page arXiv 2024

[25] [25]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, et al., “Qwen3-vl technical report”,arXivpreprintarXiv:2511.21631,2025.[Online].Avail- able:https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Ant: Adaptive neural temporal-aware text- to-motion model

W. Chen et al., “Ant: Adaptive neural temporal-aware text- to-motion model”, inProceedings of the 33rd ACM Interna- tionalConferenceonMultimedia,ser.MM’25,ACM,Oct.2025, pp.9852–9861.doi:10.1145/3746027.3755168[Online].Avail- able:http://dx.doi.org/10.1145/3746027.3755168

work page doi:10.1145/3746027.3755168 2025

[27] [27]

Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

W. Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

work page arXiv 2025

[28] [28]

W.Chenetal.,Polaris:Projection-orthogonalleastsquaresfor robustandadaptiveinversionindiffusionmodels,2025.arXiv: 2512.00369 [cs.CV].[Online].Available:https://arxiv.org/abs/ 2512.00369

work page arXiv 2025

[29] [29]

Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

H. Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

work page

[30] [30]

[Online]

arXiv: 2509.25304 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2509.25304

work page arXiv

[31] [31]

H.Jiaetal.,Physics-informedrepresentationalignmentforsparse radio-mapreconstruction,2025.arXiv:2501.19160 [cs.CV].[On- line].Available:https://arxiv.org/abs/2501.19160

work page arXiv 2025

[32] [32]

Physics-informed representation alignment for sparse radio-map reconstruction

H. Jia et al., “Physics-informed representation alignment for sparse radio-map reconstruction”, inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25, Dublin,Ireland:AssociationforComputingMachinery,2025, pp.12352–12360,isbn:9798400720352.doi:10.1145/3746027. 3758161[Online].Available:https://doi.org/10.1145/3746027. 3758161

work page doi:10.1145/3746027 2025

[33] [33]

Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025

N. Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025. arXiv: 2512.24321 [cs.CV]

work page arXiv 2025

[34] [34]

Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025

Z. Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025. arXiv: 2510. 14952[cs.RO]. 9–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

work page 2025

[35] [35]

Q. Lu, Y. Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu,Gen- tlehumanoid:Learningupper-bodycomplianceforcontact-rich humanandobjectinteraction,2025.arXiv:2511.04679 [cs.RO]. [Online].Available:https://arxiv.org/abs/2511.04679

work page arXiv 2025

[36] [36]

arXiv: 2412.15032[cs.CV]

M.Ningetal.,Dctdiff:Intriguingpropertiesofimagegenerative modeling in the dct space, 2025. arXiv: 2412.15032[cs.CV]. [Online].Available:https://arxiv.org/abs/2412.15032

work page arXiv 2025

[37] [37]

Y.Shaoetal.,LangWBC:Language-directedhumanoidwhole- bodycontrolviaend-to-endlearning,2025.arXiv:2504.21738 [cs.RO]

work page arXiv 2025

[38] [38]

Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces

B.Tian,W.Chen,Z.Li,S.Lai,J.Wu,andY.Yue,“Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces”, inProceedingsofthe33rdACMInternationalConferenceonMul- timedia,ser.MM’25,Dublin,Ireland:AssociationforComput- ingMachinery,2025,pp.10152–10160,isbn:9798400720352. doi:10.1145/3746027.3755441[Online].Available:https://doi. org/10.1145/3746027.3755441

work page doi:10.1145/3746027.3755441 2025

[39] [39]

Y. Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu,SENTINEL: A fullyend-to-endlanguage-actionmodelforhumanoidwholebody control,2025.arXiv:2511.19236[cs.RO]

work page arXiv 2025

[40] [40]

MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace

L.Xiaoetal.,“MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace”, inProceedings of the IEEE/CVF International Conference on ComputerVision,2025.arXiv:2503.15451[cs.CV]

work page arXiv 2025

[41] [41]

Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

H. Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

work page arXiv 2026

[42] [42]

H. Li, W. Chen, S. Liang, L. Wang, K. Yuan, and Y. Yue,𝑍2- sampling: Zero-cost zigzag trajectories for semantic alignment indiffusionmodels,2026.arXiv:2604.23536 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.23536

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

H. Li, W. Chen, L. Wang, S. Liang, H. Jia, and Y. Yue,Oracle noise:Fastersemanticsphericalalignmentforinterpretablela- tent optimization, 2026. arXiv: 2604.23540[cs.CV]. [Online]. Available:https://arxiv.org/abs/2604.23540

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

H.Lietal.,Deltascorematters!spatialadaptivemultiguidance indiffusionmodels,2026.arXiv:2604.26503 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.26503

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

arXiv: 2601.12799 [cs.RO]

P.Lietal.,FRoM-W1:Towardsgeneralhumanoidwhole-body control with language instructions, 2026. arXiv: 2601.12799 [cs.RO]

work page arXiv 2026

[46] [46]

arXiv: 2602.07439 [cs.RO]

W.Xieetal.,TextOp:Real-timeinteractivetext-drivenhumanoid robotmotiongenerationandcontrol, 2026. arXiv: 2602.07439 [cs.RO]

work page arXiv 2026

[47] [47]

Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

X. Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

work page arXiv 2026

[48] [48]

Towards betterevaluationmetricsfortext-to-motiongeneration

W.Chen,H.Jia,K.Yu,S.Lai,L.Wang,andY.Yue,“Towards betterevaluationmetricsfortext-to-motiongeneration”,inThe SecondInternationalWorkshoponTransformativeInsightsin MultifacetedEvaluationatTheWebConference2026. 10–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control A. EVALUATION PROTOCOL AND EXPERIMENTAL SETU...

work page