Coordinate-Based Dual-Constrained Autoregressive Motion Generation

Hongsong Wang; Jie Gui; Kang Ding; Liang Wang

arxiv: 2604.08088 · v1 · submitted 2026-04-09 · 💻 cs.CV

Coordinate-Based Dual-Constrained Autoregressive Motion Generation

Kang Ding , Hongsong Wang , Jie Gui , Liang Wang This is my paper

Pith reviewed 2026-05-10 17:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-motion generationautoregressive modelscoordinate-based motiondual-constrained causal maskmotion editingsemantic consistencymotion fidelity

0 comments

The pith

A coordinate-based autoregressive model with dual constraints generates text-to-motion sequences with higher fidelity and semantic consistency than prior approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-motion generation creates human movements from descriptions for animation, virtual reality, and robotics. Diffusion models accumulate prediction errors while autoregressive models collapse to repeated patterns after discretizing motions. The proposed method feeds continuous motion coordinates into an autoregressive generator improved by diffusion-inspired multi-layer perceptrons. A Dual-Constrained Causal Mask concatenates motion token priors with text encodings to enforce alignment. New benchmarks are created for coordinate-based synthesis, and the method reports leading scores on fidelity and consistency measures.

Core claim

Feeding motion coordinates directly into an autoregressive model, boosted by diffusion-inspired MLPs and controlled by a Dual-Constrained Causal Mask that concatenates motion tokens as priors with textual encodings, yields motions that better match natural dynamics and input semantics than earlier diffusion or autoregressive techniques on the introduced benchmarks.

What carries the argument

The Dual-Constrained Causal Mask, which incorporates motion tokens as priors concatenated with textual encodings to guide autoregressive prediction of continuous coordinate sequences.

Load-bearing premise

That coordinate-based continuous inputs plus the dual-constrained mask avoid mode collapse and error amplification on unbiased new benchmarks without hidden post-processing.

What would settle it

Reproducing the experiments on the paper's benchmarks and finding lower fidelity scores such as FID or lower semantic alignment metrics like R-precision than competing methods would disprove the superiority claim.

Figures

Figures reproduced from arXiv: 2604.08088 by Hongsong Wang, Jie Gui, Kang Ding, Liang Wang.

**Figure 2.** Figure 2: Architecture illustration of CDAMD. (a) Hybrid Motion Encoders encodes the raw motion sequence into a compact fine-grained latent space. (b) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of temporal editing tasks, inpainting, outpainting, prefix, and suffix where orange indicates conditioned motion and blue refers to [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of text-to-motion generation on HumanML3D. Our approach is compared with BAMM [11] and MoMask [17], which are [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The failure cases of text-to-motion on HumanML3D test set. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization comparison if textual to motion to state-of-the-art [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs continuous coordinate inputs with a dual-constrained causal mask and diffusion-style MLPs inside an autoregressive loop for text-to-motion, plus new benchmarks, but the SOTA claims rest on unreported metrics and author-defined test sets.

read the letter

The core idea is to run autoregressive motion generation directly on raw coordinates rather than tokens, using a dual-constrained causal mask that mixes motion priors with text encodings and swapping in diffusion-inspired MLPs for the prediction step. This is meant to cut error buildup from diffusion noise schedules and mode collapse from discretization. They also create fresh benchmarks for text-to-motion and motion editing because prior coordinate-based work is thin. That combination is the actual novelty; it is a practical tweak on existing autoregressive and diffusion lines rather than a new framework.

Referee Report

2 major / 2 minor

Summary. The paper proposes Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD), a framework for text-to-motion generation and editing. It operates on continuous motion coordinates using an autoregressive paradigm augmented with diffusion-inspired MLPs for improved fidelity, and introduces a Dual-Constrained Causal Mask that concatenates motion tokens as priors with textual encodings to enhance semantic consistency. Due to limited prior coordinate-based work, the authors establish new benchmarks for text-to-motion generation and motion editing, on which they claim state-of-the-art performance in both fidelity and semantic consistency.

Significance. If the experimental results and benchmark fairness can be verified, the work offers a hybrid approach that may mitigate error amplification in diffusion models and mode collapse in discrete autoregressive models by staying in continuous coordinate space. The dual-constrained mask provides a concrete mechanism for incorporating motion priors, which could influence future autoregressive motion synthesis designs. The new benchmarks, if shown to be unbiased and reproducible, would also provide a useful evaluation resource for coordinate-based methods.

major comments (2)

Abstract: the assertion that the approach 'achieves state-of-the-art performance in terms of both fidelity and semantic consistency' is presented without any quantitative metrics, baseline comparisons, tables, or error analysis, rendering the central empirical claim unverifiable from the provided information.
Benchmark establishment section: the construction of the new text-to-motion and motion-editing benchmarks must explicitly detail data sources, train/test splits, metric definitions, and reimplementation protocols for baselines to demonstrate that they do not inadvertently favor coordinate inputs or the dual causal mask; without this, the SOTA claim rests on potentially circular evaluation design.

minor comments (2)

Abstract: the phrase 'diffusion-inspired multi-layer perceptrons' is used without specifying architectural differences from standard MLPs or the precise integration point within the autoregressive pipeline.
Notation: clarify whether the Dual-Constrained Causal Mask is applied only during training or also at inference, and provide the exact formulation of how motion tokens are concatenated with textual encodings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results and benchmarks.

read point-by-point responses

Referee: Abstract: the assertion that the approach 'achieves state-of-the-art performance in terms of both fidelity and semantic consistency' is presented without any quantitative metrics, baseline comparisons, tables, or error analysis, rendering the central empirical claim unverifiable from the provided information.

Authors: We acknowledge that the abstract provides a high-level summary of the empirical claims without embedding specific numerical values or tables, which is a common practice due to strict length constraints in abstracts. The full quantitative support—including FID, R-Precision, and other fidelity/semantic metrics, baseline comparisons, and error analyses—is presented in Section 4 with Tables 1–4. To improve direct verifiability, we will partially revise the abstract to incorporate a concise reference to key performance highlights (e.g., specific FID improvements and consistency scores) while preserving its brevity. revision: partial
Referee: Benchmark establishment section: the construction of the new text-to-motion and motion-editing benchmarks must explicitly detail data sources, train/test splits, metric definitions, and reimplementation protocols for baselines to demonstrate that they do not inadvertently favor coordinate inputs or the dual causal mask; without this, the SOTA claim rests on potentially circular evaluation design.

Authors: We agree that explicit documentation is essential for reproducibility and to confirm evaluation fairness. Section 3.2 describes the new benchmarks, which were created due to the limited existing coordinate-based methods; they are derived from the standard HumanML3D dataset using its conventional splits, with metrics defined consistently with prior text-to-motion literature and baselines reimplemented from their original public implementations (adapted only for continuous coordinate inputs, without applying our dual-constrained mask). To fully address concerns about potential bias or circularity, we will expand this section with precise data source citations, exact train/test split ratios, complete metric definitions and computation details, and a summary of baseline reimplementation protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture + new benchmarks with no self-referential derivations

full rationale

The paper describes a coordinate-based autoregressive model (CDAMD) with diffusion-inspired MLPs and a Dual-Constrained Causal Mask, then reports experimental results on newly established text-to-motion and motion-editing benchmarks. No equations, parameter fits, or predictions are presented that reduce by construction to the inputs or to self-citations. The justification for new benchmarks is the scarcity of prior coordinate-based work, which is an external observation rather than a self-definition. All load-bearing claims rest on empirical fidelity and consistency metrics rather than any fitted-input-renamed-as-prediction or ansatz-smuggled-via-citation pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical performance on newly created benchmarks rather than theoretical derivation; standard deep-learning training assumptions are used but not load-bearing for the novelty claim.

axioms (1)

standard math Standard deep-learning assumptions including convergence of gradient-based optimization and sufficient model capacity for sequence modeling.
Implicit in any neural-network training for motion prediction.

pith-pipeline@v0.9.0 · 5491 in / 1180 out tokens · 59407 ms · 2026-05-10T17:05:04.812496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2430– 2449, 2023

work page 2023
[2]

Motion generation: A survey of gen- erative approaches and benchmarks,

A. Khani, A. Rampini, B. Roy, L. Nadela, N. Kaplan, E. Atherton, D. Cheung, and J. Bibliowicz, “Motion generation: A survey of gen- erative approaches and benchmarks,”arXiv preprint arXiv:2507.05419, 2025

work page arXiv 2025
[3]

Text-driven motion generation: Overview, challenges and directions,

A. R. Sahili, N. Neji, and H. Tabia, “Text-driven motion generation: Overview, challenges and directions,”arXiv preprint arXiv:2505.09379, 2025

work page arXiv 2025
[4]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

work page 2022
[5]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3813–3824

work page 2023
[6]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

work page 2023
[7]

SoPo: Text-to-motion gener- ation using semi-online preference optimization,

X. Tan, H. Wang, X. Geng, and P. Zhou, “SoPo: Text-to-motion gener- ation using semi-online preference optimization,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025
[8]

Realign: text-to-motion generation via step-aware reward-guided alignment,

W. Weng, X. Tan, J. Wang, G.-S. Xie, P. Zhou, and H. Wang, “Realign: text-to-motion generation via step-aware reward-guided alignment,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 13, 2026, pp. 10 621–10 629

work page 2026
[9]

Temporal consistency-aware text-to-motion generation,

H. Wang, W. Yan, Q. Lai, and X. Geng, “Temporal consistency-aware text-to-motion generation,”Visual Intelligence, vol. 4, no. 1, p. 7, 2026

work page 2026
[10]

Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression,

Z. Meng, Y . Xie, X. Peng, Z. Han, and H. Jiang, “Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 859–27 871

work page 2025
[11]

BAMM: Bidirectional autoregressive motion model,

E. Pinyoanuntapong, M. U. Saleem, P. Wang, M. Lee, S. Das, and C. Chen, “BAMM: Bidirectional autoregressive motion model,” in European Conference on Computer Vision. Springer, 2024, pp. 172– 190

work page 2024
[12]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 730–14 740

work page 2023
[13]

Motiongpt: Finetuned llms are general-purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “Motiongpt: Finetuned llms are general-purpose motion generators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7368–7376

work page 2024
[14]

SnapMoGen: Human motion genera- tion from expressive texts,

I. Hwang, J. Wang, and B. Zhou, “SnapMoGen: Human motion genera- tion from expressive texts,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025
[15]

Autoregressive motion generation with gaussian mixture-guided latent sampling,

L. Tu, L. Meng, Z. Li, H. Ling, and S. Huang, “Autoregressive motion generation with gaussian mixture-guided latent sampling,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025
[16]

Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025

Z. Meng, Z. Han, X. Peng, Y . Xie, and H. Jiang, “Absolute coordinates make motion generation easy,”arXiv preprint arXiv:2505.19377, 2025

work page arXiv 2025
[17]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1900–1910

work page 2024
[18]

Autoregressive image generation using residual quantization,

D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 513–11 522

work page 2022
[19]

EasyTune: Efficient step-aware fine-tuning for diffusion-based motion generation,

X. Tan, W. Weng, H. Lei, and H. Wang, “EasyTune: Efficient step-aware fine-tuning for diffusion-based motion generation,” inInternational Conference on Learning Representations, 2026

work page 2026
[20]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inInternational Conference on Learning Representations, 2023

work page 2023
[21]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[22]

Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,

Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 035–22 044

work page 2023
[23]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 364–373

work page 2023
[24]

Mofusion: A framework for denoising-diffusion-based motion synthesis,

R. Dabral, M. H. Mughal, V . Golyanik, and C. Theobalt, “Mofusion: A framework for denoising-diffusion-based motion synthesis,” inPro- ceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2023, pp. 9760–9770

work page 2023
[25]

Less is more: Improving motion diffusion models with sparse keyframes,

J. Bae, I. Hwang, Y .-Y . Lee, Z. Guo, J. Liu, Y . Ben-Shabat, Y . M. Kim, and M. Kapadia, “Less is more: Improving motion diffusion models with sparse keyframes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 11 069–11 078

work page 2025
[26]

AttT2M: Text-driven hu- man motion generation with multi-perspective attention mechanism,

C. Zhong, L. Hu, Z. Zhang, and S. Xia, “AttT2M: Text-driven hu- man motion generation with multi-perspective attention mechanism,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 509–519

work page 2023
[27]

Motiongpt: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 067–20 079, 2023

work page 2023
[28]

AMD: Au- toregressive motion diffusion,

B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: Au- toregressive motion diffusion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2022–2030

work page 2024
[29]

Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10 086–10 096

work page 2025
[30]

Discord: Discrete tokens to continuous motion via rectified flow decoding,

J. Cho, J. Kim, J. Kim, M. Kim, M. Kang, S. Hong, T.-H. Oh, and Y . Yu, “Discord: Discrete tokens to continuous motion via rectified flow decoding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 602–14 612

work page 2025
[31]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2151–2162. 11

work page 2023
[32]

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs,

P. Jin, Y . Wu, Y . Fan, Z. Sun, W. Yang, and L. Yuan, “Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 497–15 518, 2023

work page 2023
[33]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1334–1345

work page 2024
[34]

Omnicontrol: Con- trol any joint at any time for human motion generation,

Y . Xie, V . Jampani, L. Zhong, D. Sun, and H. Jiang, “Omnicontrol: Con- trol any joint at any time for human motion generation,” inInternational Conference on Learning Representations, 2024

work page 2024
[35]

MotionLCM: Real-time controllable motion generation via latent consistency model,

W. Dai, L.-H. Chen, J. Wang, J. Liu, B. Dai, and Y . Tang, “MotionLCM: Real-time controllable motion generation via latent consistency model,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 390– 408

work page 2024
[36]

SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,

Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 827–27 837

work page 2025
[37]

Dartcontrol: A diffusion-based autore- gressive motion model for real-time text-driven motion control,

K. Zhao, G. Li, and S. Tang, “Dartcontrol: A diffusion-based autore- gressive motion model for real-time text-driven motion control,” in International Conference on Learning Representations, 2025

work page 2025
[38]

Dynamic motion blending for versatile motion editing,

N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 735–22 745

work page 2025
[39]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[40]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers,

N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie, “Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 23–40

work page 2024
[41]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

work page 2022
[42]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, no. 4, pp. 236–252, 2016, pMID: 27992262

work page 2016
[43]

The theory and design of plate glass polishing machines,

C. M. University, “Cmu graphics lab motion capture database,” http://mocap.cs.cmu.edu/, 2017. [Online]. Available: https://cir.nii.ac.jp/ crid/1571417125676818048

work page arXiv 2017
[44]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inIEEE International Conference on Computer Vision, Oct 2019. [Online]. Available: https://amass.is.tue.mpg.de

work page 2019
[45]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2021–2029

work page 2020
[46]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in European Conference on Computer Vision. Springer, 2022, pp. 358– 374

work page 2022
[47]

CLIPScore: a reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “CLIPScore: a reference-free evaluation metric for image captioning,” inEmpirical Methods in Natural Language Processing, 2021, pp. 7514–7528

work page 2021
[48]

head”, “neck

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”arXiv preprint arXiv:2208.15001, 2022

work page arXiv 2022
[49]

MMM: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “MMM: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[1] [1]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2430– 2449, 2023

work page 2023

[2] [2]

Motion generation: A survey of gen- erative approaches and benchmarks,

A. Khani, A. Rampini, B. Roy, L. Nadela, N. Kaplan, E. Atherton, D. Cheung, and J. Bibliowicz, “Motion generation: A survey of gen- erative approaches and benchmarks,”arXiv preprint arXiv:2507.05419, 2025

work page arXiv 2025

[3] [3]

Text-driven motion generation: Overview, challenges and directions,

A. R. Sahili, N. Neji, and H. Tabia, “Text-driven motion generation: Overview, challenges and directions,”arXiv preprint arXiv:2505.09379, 2025

work page arXiv 2025

[4] [4]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

work page 2022

[5] [5]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3813–3824

work page 2023

[6] [6]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

work page 2023

[7] [7]

SoPo: Text-to-motion gener- ation using semi-online preference optimization,

X. Tan, H. Wang, X. Geng, and P. Zhou, “SoPo: Text-to-motion gener- ation using semi-online preference optimization,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025

[8] [8]

Realign: text-to-motion generation via step-aware reward-guided alignment,

W. Weng, X. Tan, J. Wang, G.-S. Xie, P. Zhou, and H. Wang, “Realign: text-to-motion generation via step-aware reward-guided alignment,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 13, 2026, pp. 10 621–10 629

work page 2026

[9] [9]

Temporal consistency-aware text-to-motion generation,

H. Wang, W. Yan, Q. Lai, and X. Geng, “Temporal consistency-aware text-to-motion generation,”Visual Intelligence, vol. 4, no. 1, p. 7, 2026

work page 2026

[10] [10]

Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression,

Z. Meng, Y . Xie, X. Peng, Z. Han, and H. Jiang, “Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 859–27 871

work page 2025

[11] [11]

BAMM: Bidirectional autoregressive motion model,

E. Pinyoanuntapong, M. U. Saleem, P. Wang, M. Lee, S. Das, and C. Chen, “BAMM: Bidirectional autoregressive motion model,” in European Conference on Computer Vision. Springer, 2024, pp. 172– 190

work page 2024

[12] [12]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 730–14 740

work page 2023

[13] [13]

Motiongpt: Finetuned llms are general-purpose motion generators,

Y . Zhang, D. Huang, B. Liu, S. Tang, Y . Lu, L. Chen, L. Bai, Q. Chu, N. Yu, and W. Ouyang, “Motiongpt: Finetuned llms are general-purpose motion generators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7368–7376

work page 2024

[14] [14]

SnapMoGen: Human motion genera- tion from expressive texts,

I. Hwang, J. Wang, and B. Zhou, “SnapMoGen: Human motion genera- tion from expressive texts,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025

[15] [15]

Autoregressive motion generation with gaussian mixture-guided latent sampling,

L. Tu, L. Meng, Z. Li, H. Ling, and S. Huang, “Autoregressive motion generation with gaussian mixture-guided latent sampling,” inAnnual Conference on Neural Information Processing Systems, 2025

work page 2025

[16] [16]

Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025

Z. Meng, Z. Han, X. Peng, Y . Xie, and H. Jiang, “Absolute coordinates make motion generation easy,”arXiv preprint arXiv:2505.19377, 2025

work page arXiv 2025

[17] [17]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1900–1910

work page 2024

[18] [18]

Autoregressive image generation using residual quantization,

D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 513–11 522

work page 2022

[19] [19]

EasyTune: Efficient step-aware fine-tuning for diffusion-based motion generation,

X. Tan, W. Weng, H. Lei, and H. Wang, “EasyTune: Efficient step-aware fine-tuning for diffusion-based motion generation,” inInternational Conference on Learning Representations, 2026

work page 2026

[20] [20]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inInternational Conference on Learning Representations, 2023

work page 2023

[21] [21]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[22] [22]

Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,

Y . Wang, Z. Leng, F. W. Li, S.-C. Wu, and X. Liang, “Fg-T2M: Fine- grained text-driven human motion generation via diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 035–22 044

work page 2023

[23] [23]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 364–373

work page 2023

[24] [24]

Mofusion: A framework for denoising-diffusion-based motion synthesis,

R. Dabral, M. H. Mughal, V . Golyanik, and C. Theobalt, “Mofusion: A framework for denoising-diffusion-based motion synthesis,” inPro- ceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2023, pp. 9760–9770

work page 2023

[25] [25]

Less is more: Improving motion diffusion models with sparse keyframes,

J. Bae, I. Hwang, Y .-Y . Lee, Z. Guo, J. Liu, Y . Ben-Shabat, Y . M. Kim, and M. Kapadia, “Less is more: Improving motion diffusion models with sparse keyframes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 11 069–11 078

work page 2025

[26] [26]

AttT2M: Text-driven hu- man motion generation with multi-perspective attention mechanism,

C. Zhong, L. Hu, Z. Zhang, and S. Xia, “AttT2M: Text-driven hu- man motion generation with multi-perspective attention mechanism,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 509–519

work page 2023

[27] [27]

Motiongpt: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 067–20 079, 2023

work page 2023

[28] [28]

AMD: Au- toregressive motion diffusion,

B. Han, H. Peng, M. Dong, Y . Ren, Y . Shen, and C. Xu, “AMD: Au- toregressive motion diffusion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2022–2030

work page 2024

[29] [29]

Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,

L. Xiao, S. Lu, H. Pi, K. Fan, L. Pan, Y . Zhou, Z. Feng, X. Zhou, S. Peng, and J. Wang, “Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10 086–10 096

work page 2025

[30] [30]

Discord: Discrete tokens to continuous motion via rectified flow decoding,

J. Cho, J. Kim, J. Kim, M. Kim, M. Kang, S. Hong, T.-H. Oh, and Y . Yu, “Discord: Discrete tokens to continuous motion via rectified flow decoding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 602–14 612

work page 2025

[31] [31]

Guided motion diffusion for controllable human motion synthesis,

K. Karunratanakul, K. Preechakul, S. Suwajanakorn, and S. Tang, “Guided motion diffusion for controllable human motion synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2151–2162. 11

work page 2023

[32] [32]

Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs,

P. Jin, Y . Wu, Y . Fan, Z. Sun, W. Yang, and L. Yuan, “Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 497–15 518, 2023

work page 2023

[33] [33]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1334–1345

work page 2024

[34] [34]

Omnicontrol: Con- trol any joint at any time for human motion generation,

Y . Xie, V . Jampani, L. Zhong, D. Sun, and H. Jiang, “Omnicontrol: Con- trol any joint at any time for human motion generation,” inInternational Conference on Learning Representations, 2024

work page 2024

[35] [35]

MotionLCM: Real-time controllable motion generation via latent consistency model,

W. Dai, L.-H. Chen, J. Wang, J. Liu, B. Dai, and Y . Tang, “MotionLCM: Real-time controllable motion generation via latent consistency model,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 390– 408

work page 2024

[36] [36]

SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,

Z. Li, K. Cheng, A. Ghosh, U. Bhattacharya, L. Gui, and A. Bera, “SimMotionEdit: Text-based human motion editing with motion simi- larity prediction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 827–27 837

work page 2025

[37] [37]

Dartcontrol: A diffusion-based autore- gressive motion model for real-time text-driven motion control,

K. Zhao, G. Li, and S. Tang, “Dartcontrol: A diffusion-based autore- gressive motion model for real-time text-driven motion control,” in International Conference on Learning Representations, 2025

work page 2025

[38] [38]

Dynamic motion blending for versatile motion editing,

N. Jiang, H. Li, Z. Yuan, Z. He, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Dynamic motion blending for versatile motion editing,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 735–22 745

work page 2025

[39] [39]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[40] [40]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers,

N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie, “Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 23–40

work page 2024

[41] [41]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

work page 2022

[42] [42]

The kit motion-language dataset,

M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,”Big Data, vol. 4, no. 4, pp. 236–252, 2016, pMID: 27992262

work page 2016

[43] [43]

The theory and design of plate glass polishing machines,

C. M. University, “Cmu graphics lab motion capture database,” http://mocap.cs.cmu.edu/, 2017. [Online]. Available: https://cir.nii.ac.jp/ crid/1571417125676818048

work page arXiv 2017

[44] [44]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inIEEE International Conference on Computer Vision, Oct 2019. [Online]. Available: https://amass.is.tue.mpg.de

work page 2019

[45] [45]

Action2motion: Conditioned generation of 3d human motions,

C. Guo, X. Zuo, S. Wang, S. Zou, Q. Sun, A. Deng, M. Gong, and L. Cheng, “Action2motion: Conditioned generation of 3d human motions,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2021–2029

work page 2020

[46] [46]

Motionclip: Exposing human motion generation to clip space,

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in European Conference on Computer Vision. Springer, 2022, pp. 358– 374

work page 2022

[47] [47]

CLIPScore: a reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y . Choi, “CLIPScore: a reference-free evaluation metric for image captioning,” inEmpirical Methods in Natural Language Processing, 2021, pp. 7514–7528

work page 2021

[48] [48]

head”, “neck

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”arXiv preprint arXiv:2208.15001, 2022

work page arXiv 2022

[49] [49]

MMM: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “MMM: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024