MotionHiFlow: Text-to-motion via hierarchical flow matching
Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3
The pith
MotionHiFlow generates 3D human motions from text by building flows progressively across increasing temporal scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MotionHiFlow shows that flow matching for text-to-motion succeeds when flow paths are built from low to high temporal scales, with lower-scale flows capturing semantics and coarse motion structures and higher-scale flows adding temporal details, all linked by a cross-scale transition process that maintains continuity and noise consistency, and integrated with a Text-Motion Diffusion Transformer plus topology-aware Motion VAE that models joint dependencies through positional encoding and skeletal topology.
What carries the argument
hierarchical flow matching framework that constructs paths from low to high temporal scales and links them with a cross-scale transition process
If this is right
- Motions achieve tighter semantic alignment with text because high-level concepts are established first before details are added.
- Temporal coherence across entire sequences improves through consistent noise handling during cross-scale transitions.
- Fine-grained joint movements become more precise without sacrificing overall motion structure due to the topology-aware components.
- Ablation results indicate that removing the hierarchy or transition process reduces performance on standard motion datasets.
Where Pith is reading between the lines
- The same progressive scale construction could be tested on other sequence generation problems such as text-to-video or speech synthesis.
- If the hierarchy assumption holds, it may guide simpler training schedules that start coarse and increase resolution in flow-based models generally.
- Extending the cross-scale links to non-flow matching architectures might reveal whether the continuity benefit is specific to this setup.
Load-bearing premise
Complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system.
What would settle it
A non-hierarchical single-scale flow matching model trained on the same data and benchmarks would match or exceed the reported performance on HumanML3D and KIT-ML.
Figures
read the original abstract
Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose \textit{MotionHiFlow}, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components. Code is available at https://github.com/ai-lh/MotionHiFlow.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MotionHiFlow, a hierarchical flow-matching framework for text-to-motion generation. Motions are synthesized progressively across temporal scales, with lower-scale flows capturing high-level semantics and coarse structure while higher-scale flows refine temporal details. A cross-scale transition process maintains continuity and noise consistency between scales. The model integrates a Text-Motion Diffusion Transformer and a topology-aware Motion VAE that incorporates joint-aware positional encoding and skeletal topology. Extensive experiments on the HumanML3D and KIT-ML benchmarks are reported to achieve state-of-the-art performance, with ablation studies validating the hierarchical design and individual components. Code is released at the provided GitHub link.
Significance. If the empirical results hold, the work provides a coherent extension of flow matching to the multi-scale regime for human motion synthesis, directly addressing the single-scale limitation of prior diffusion and flow-based methods. The explicit modeling of skeletal topology and the cross-scale transition mechanism are technically natural additions that preserve the advantages of flow matching while improving semantic alignment and detail. The public code release strengthens reproducibility and enables direct follow-up work. Overall, the contribution is a solid incremental advance that could influence subsequent hierarchical generative models in computer vision and graphics.
minor comments (3)
- [Abstract / §3] The abstract states that lower-scale flows capture semantics while higher scales add detail, but the precise definition of the scale hierarchy (e.g., number of scales, temporal downsampling factors) is not quantified; adding a short table or paragraph in §3 would clarify the construction.
- [§3.2] The claim that the cross-scale transition 'preserves noise consistency' is stated without an explicit equation or proof sketch; a brief derivation or reference to the flow-matching ODE in the methods section would strengthen the technical presentation.
- [§4.3] Ablation results are summarized as 'confirming effectiveness,' yet the main text does not report the exact metric deltas (e.g., FID or R-Precision) when the hierarchical component is removed; including these numbers in Table X would make the contribution of each module immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the positive and constructive review of our manuscript on MotionHiFlow. The summary accurately captures the core ideas of our hierarchical flow-matching approach, the cross-scale transition process, the Text-Motion Diffusion Transformer, and the topology-aware Motion VAE. We appreciate the recognition of the work as a coherent extension of flow matching to the multi-scale regime and the value placed on the public code release for reproducibility. No specific major comments were provided in the report.
Circularity Check
No significant circularity detected
full rationale
The paper proposes a hierarchical flow-matching architecture for text-to-motion generation, extending standard flow matching with cross-scale transitions, a Text-Motion Diffusion Transformer, and a topology-aware VAE. These components are introduced as novel combinations of established techniques rather than derived from self-referential equations or fitted parameters. Performance claims rest on empirical results from HumanML3D and KIT-ML benchmarks plus ablations, with no load-bearing steps that reduce by construction to the authors' own prior definitions, self-citations, or ansatzes. The derivation chain remains self-contained against external benchmarks and prior flow-matching literature.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lan- guage2pose: Natural language grounded pose forecasting
Chaitanya Ahuja and Louis-Philippe Morency. Lan- guage2pose: Natural language grounded pose forecasting. In2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019. 2
2019
-
[2]
Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, and Lei Zhang. Motionllm: Understanding human behaviors from human motions and videos.arXiv preprint arXiv:2405.20340, 2024. 2
-
[3]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963, 2025. 2, 4
-
[4]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18000–18010. IEEE,
2023
-
[5]
Snap- mogen: Human motion generation from expressive texts
chuan guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snap- mogen: Human motion generation from expressive texts. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 7
2025
-
[6]
Manolo Canales Cuba and Jo ˜ao Paulo Gois. Flowmotion: Target-predictive flow matching for realistic text-driven hu- man motion generation.arXiv preprint arXiv:2504.01338,
-
[7]
Mofusion: A framework for denoising-diffusion-based motion synthesis
Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9760–9770, 2023. 2
2023
-
[8]
Motionlcm: Real-time controllable motion generation via latent consistency model
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. 2024. 2
2024
-
[9]
Wandr: Intention- guided human motion generation
Markos Diomataris, Nikos Athanasiou, Omid Taheri, Xi Wang, Otmar Hilliges, and Michael J Black. Wandr: Intention- guided human motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 927–936, 2024. 2
2024
-
[10]
A family of embedded runge-kutta formulae.Journal of computational and applied mathematics, 6(1):19–26, 1980
John R Dormand and Peter J Prince. A family of embedded runge-kutta formulae.Journal of computational and applied mathematics, 6(1):19–26, 1980. 3
1980
-
[11]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first International Conference on Machine Learning, ICML 2024, V...
2024
-
[12]
Ac- tion2motion: Conditioned generation of 3d human motions
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Ac- tion2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020. 6
2021
-
[13]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InCVPR, pages 5152–5161, 2022. 1, 2, 3, 6, 7, 8
2022
-
[14]
Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InECCV, pages 580–597. Springer, 2022. 2
2022
-
[15]
Momask: Generative masked modeling of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1900–1910. IEEE, 2024. 1, 2, 3, 5, 6, 7, 8
2024
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022. 6
work page internal anchor Pith review arXiv 2022
-
[17]
Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, and Junyong Noh. Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing.arXiv preprint arXiv:2503.13836, 2025. 4, 6, 7
-
[18]
Motionflowmatchingforhumanmotionsyn- thesisandediting
Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M. Asano, Efstratios Gavves, Pascal Mettes, Bj¨orn Ommer, and Cees G. M. Snoek. Motion flow matching for human motion synthesis and editing.CoRR, abs/2312.08895, 2023. 2
-
[19]
Efficient explicit joint-level interac- tion modeling with mamba for text-guided HOI generation
Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, and Wei-Shi Zheng. Efficient explicit joint-level interac- tion modeling with mamba for text-guided HOI generation. InIEEE International Conference on Multimedia and Expo, ICME 2025, Nantes, France, June 30 - July 4, 2025, pages 1–6. IEEE, 2025. 2
2025
-
[20]
Motiongpt: Human motion as a foreign language
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. NeurIPS, 36:20067–20079, 2023. 2
2023
-
[21]
Local action- guided motion diffusion model for text-to-motion generation
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, and Jie Chen. Local action- guided motion diffusion model for text-to-motion generation
-
[22]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 4
2025
-
[23]
Guided motion diffusion for con- trollable human motion synthesis
Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwa- janakorn, and Siyu Tang. Guided motion diffusion for con- trollable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 2
2023
-
[24]
Person- abooth: Personalized text-to-motion generation
Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, and Hyung Jin Chang. Person- abooth: Personalized text-to-motion generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 22756–22765. Computer Vi...
2025
-
[25]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 6
2015
-
[26]
Priority-centric human motion generation in discrete latent space
Hanyang Kong, Kehong Gong, Dongze Lian, Michael Bi Mi, and Xinchao Wang. Priority-centric human motion generation in discrete latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14806– 14816, 2023. 2
2023
-
[27]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 2, 5
2024
-
[28]
Efficient text-to-motion via multi-head generative masked modeling
Heng Li, Xing Liufu, Xiaotong Lin, Jian Zhu, and Jian-Fang Hu. Efficient text-to-motion via multi-head generative masked modeling. InIEEE International Conference on Multimedia and Expo, ICME 2025, Nantes, France, June 30 - July 4, 2025, pages 1–6. IEEE, 2025. 2
2025
-
[29]
Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling- An Zeng, Jian-Fang Hu, Xihan Wei, and Wei-Shi Zheng. Irg-motionllm: Interleaving motion generation, assessment and refinement for text-to-motion generation.arXiv preprint arXiv:2512.10730, 2025. 2, 6, 7
-
[30]
Morph: A motion-free physics optimization framework for human motion generation
Zhuo Li, Mingshuang Luo, Ruibing Hou, Xin Zhao, Hao Liu, Hong Chang, Zimo Liu, and Chen Li. Morph: A motion-free physics optimization framework for human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14580–14589, 2025. 1
2025
-
[31]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2, 3
2023
-
[32]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 2, 3
2023
-
[33]
Bridging the gap between human motion and action semantics via kinematic phrases
Xinpeng Liu, Yong-Lu Li, Ailing Zeng, Zizheng Zhou, Yang You, and Cewu Lu. Bridging the gap between human motion and action semantics via kinematic phrases. InEuropean Conference on Computer Vision (ECCV), 2024. 2
2024
-
[34]
Humantomato: Text-aligned whole-body motion generation
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. Humantomato: Text-aligned whole-body motion generation. InForty-first International Conference on Machine Learning
-
[35]
Scamo: Exploring the scaling law in autoregressive motion generation model
Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 27872–27882. Computer Vi- sion Foundation / IE...
2025
-
[36]
Progressively generating better initial guesses towards next stages for high-quality human motion prediction
Tiezheng Ma, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 6427–6436. IEEE, 2022. 1
2022
-
[37]
Amass: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019. 6
2019
-
[38]
Rethinking diffusion for text-driven human mo- tion generation: Redundant representations, evaluation, and masked autoregression
Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human mo- tion generation: Redundant representations, evaluation, and masked autoregression. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 27859–27871. Computer Vision Founda...
2025
-
[39]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE, 2023. 2, 5, 7
2023
-
[40]
Black, and G¨ul Varol
Mathis Petrovich, Michael J. Black, and G¨ul Varol. TEMOS: generating diverse human motions from textual descriptions. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, pages 480–497. Springer, 2022. 2, 6, 7
2022
-
[41]
BAMM: bidirectional autoregressive motion model
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. BAMM: bidirectional autoregressive motion model. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XV, pages 172–190. Springer, 2024. 1, 2, 7
2024
-
[42]
MMM: generative masked motion model
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. MMM: generative masked motion model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 1546–1555. IEEE, 2024. 2, 5
2024
-
[43]
Maskcon- trol: Spatio-temporal control for masked motion synthesis
Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. Maskcon- trol: Spatio-temporal control for masked motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9955–9965, 2025. 2
2025
-
[44]
The kit motion-language dataset.Big data, 4(4):236–252,
Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset.Big data, 4(4):236–252,
-
[45]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, ...
2021
-
[46]
Realistic human motion generation with cross-diffusion models
Zeping Ren, Shaoli Huang, and Xiu Li. Realistic human motion generation with cross-diffusion models. 2024. 2
2024
-
[47]
Length-aware motion synthesis via latent diffusion
Alessio Sampieri, Alessio Palma, Indro Spinelli, and Fabio Galasso. Length-aware motion synthesis via latent diffusion
-
[48]
Two- stream adaptive graph convolutional networks for skeleton- based action recognition
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two- stream adaptive graph convolutional networks for skeleton- based action recognition. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 12026–12035. Computer Vision Foundation / IEEE, 2019. 3, 4, 6
2019
-
[49]
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024. 5
2024
-
[50]
Action-guided 3d human motion prediction
Jiangxin Sun, Zihang Lin, Xintong Han, Jian-Fang Hu, Jia Xu, and Wei-Shi Zheng. Action-guided 3d human motion prediction. InProceedings of the 35th International Confer- ence on Neural Information Processing Systems, Red Hook, NY , USA, 2021. Curran Associates Inc. 2
2021
-
[51]
You never stop dancing: Non-freezing dance generation via bank-constrained manifold projection
Jiangxin Sun, Chunyu Wang, Huang Hu, Hanjiang Lai, Zhi Jin, and Jian-Fang Hu. You never stop dancing: Non-freezing dance generation via bank-constrained manifold projection. InAdvances in Neural Information Processing Systems, 2022
2022
-
[52]
Human motion prediction via continual prior compensation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026
Jianwei Tang, Jian-Fang Hu, Tianming Liang, Xiaotong Lin, Jiangxin Sun, Wei-Shi Zheng, and Jianhuang Lai. Human motion prediction via continual prior compensation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026. 2
2026
-
[53]
Human motion diffusion model
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Representations, 2023. 2
2023
-
[54]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 5, 7
2017
-
[55]
Tlcontrol: Trajectory and language control for human motion synthesis
Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. 2024. 2
2024
-
[56]
You think, you act: The new task of arbitrary text to motion generation
Runqi Wang, Caoyuan Ma, Guopeng Li, Hanrui Xu, Yuke Li, and Zheng Wang. You think, you act: The new task of arbitrary text to motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12012–12022, 2025. 1
2025
-
[57]
Fg-t2m: Fine-grained text-driven human motion generation via diffusion model
Yin Wang, Zhiying Leng, Frederick WB Li, Shun-Cheng Wu, and Xiaohui Liang. Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22035–22044, 2023. 2
2023
-
[58]
Motion-agent: A conversational framework for human motion generation with LLMs
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with LLMs. InThe Thirteenth International Conference on Learning Representa- tions, 2025. 2
2025
-
[59]
Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space
Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, and Jingbo Wang. Motionstreamer: Streaming motion genera- tion via diffusion-based autoregressive model in causal latent space. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 10086–10096, 2025. 2
2025
-
[60]
Omnicontrol: Control any joint at any time for human motion generation
Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InThe Twelfth International Conference on Learning Representations. 2
-
[61]
Hamilton, and Jure Leskovec
Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. InAd- vances in Neural Information Processing Systems 31: An- nual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada, pages 4805...
2018
-
[62]
Mogents: Motion generation based on spatial-temporal joint modeling
Weihao Yuan, Yisheng He, Weichao Shen, Yuan Dong, Xi- aodong Gu, Zilong Dong, Liefeng Bo, and Qixing Huang. Mogents: Motion generation based on spatial-temporal joint modeling. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro- cessing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 20...
2024
-
[63]
Chain- hoi: Joint-based kinematic chain modeling for human-object interaction generation
Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, and Wei-Shi Zheng. Chain- hoi: Joint-based kinematic chain modeling for human-object interaction generation. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 12358–12369. Computer Vision Foundation / IEE...
2025
-
[64]
Light-t2m: A lightweight and fast model for text-to- motion generation
Ling-An Zeng, Guohong Huang, Gaojie Wu, and Wei-Shi Zheng. Light-t2m: A lightweight and fast model for text-to- motion generation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9797–9805, 2025. 2, 6, 7
2025
-
[65]
Progressive human motion generation based on text and few motion frames.IEEE Transactions on Circuits and Systems for Video Technology, 2025
Ling-An Zeng, Gaojie Wu, Ancong Wu, Jian-Fang Hu, and Wei-Shi Zheng. Progressive human motion generation based on text and few motion frames.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 2
2025
-
[66]
Generating human motion from textual descriptions with dis- crete representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Y ong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with dis- crete representations. InCVPR, pages 14730–14740, 2023. 2, 6, 7
2023
-
[67]
Energymogen: Compositional human motion generation with energy-based diffusion model in latent space
Jianrong Zhang, Hehe Fan, and Yi Yang. Energymogen: Compositional human motion generation with energy-based diffusion model in latent space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 17592–17602. Computer Vision Foundation / IEEE, 2025. 6, 7
2025
-
[68]
Re- modiffuse: Retrieval-augmented motion diffusion model
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In ICCV, pages 364–373, 2023. 7
2023
-
[69]
Motiondiffuse: Text-driven human motion generation with diffusion model
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI, 2024. 2, 6
2024
-
[70]
Towards robust and controllable text-to-motion via masked autoregressive diffusion
Zongye Zhang, Bohan Kong, Qingjie Liu, and Yunhong Wang. Towards robust and controllable text-to-motion via masked autoregressive diffusion. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, page 9326–9335, New York, NY , USA, 2025. Association for Computing Machinery. 2
2025
-
[71]
Attt2m: Text-driven human motion generation with multi- perspective attention mechanism
Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi- perspective attention mechanism. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 509–519, 2023. 2
2023
-
[72]
Emdm: Efficient motion diffusion model for fast, high-quality motion generation
Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, and Lingjie Liu. Emdm: Efficient motion diffusion model for fast, high-quality motion generation. 2024. 2
2024
-
[73]
Avatargpt: All-in- one framework for motion understanding planning generation and beyond
Zixiang Zhou, Yu Wan, and Baoyuan Wang. Avatargpt: All-in- one framework for motion understanding planning generation and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1357–1366,
-
[74]
Parco: Part-coordinating text-to-motion synthesis
Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, and Xiangyang Ji. Parco: Part-coordinating text-to-motion synthesis. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LVI, pages 126–143. Springer, 2024. 2
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.