Recognition: unknown
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
Pith reviewed 2026-05-10 06:53 UTC · model grok-4.3
The pith
One diffusion model unifies skeleton action recognition and text-to-motion generation by using recognizer feedback for semantic guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Coordinates-based Autoregressive Motion Diffusion (CoAMD) together with a Multi-modal Action Recognizer (MAR) that delivers gradient-based semantic guidance can simultaneously solve skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing. By training and evaluating on absolute rather than relative coordinates and testing on thirteen benchmarks, the single model reaches state-of-the-art numbers on all four tasks.
What carries the argument
Multi-modal Action Recognizer (MAR) that supplies gradient-based semantic guidance to the Coordinates-based Autoregressive Motion Diffusion (CoAMD) generator.
If this is right
- The same architecture can be used directly for skeleton-based action recognition without retraining from scratch.
- Text-to-motion outputs gain coherence because the recognizer feedback enforces semantic consistency during sampling.
- Text-motion retrieval and motion editing become natural extensions of the shared representation.
- Performance advantages appear across diverse datasets without requiring separate models for each task.
Where Pith is reading between the lines
- Future motion systems may default to including a lightweight recognition head to improve generation quality.
- The absolute-coordinate choice could increase robustness when motions are captured under varying camera positions or in real-time settings.
- The unification pattern suggests similar recognizer-guided diffusion could be tested on related tasks such as motion prediction or anomaly detection.
Load-bearing premise
Gradient signals from the action recognizer reliably steer motion generation toward better semantic alignment without introducing new artifacts or demanding heavy per-task tuning.
What would settle it
An ablation that removes the MAR guidance component and shows no drop (or an increase) in standard text-to-motion metrics such as FID or R-Precision would indicate the claimed benefit of semantic guidance does not hold.
Figures
read the original abstract
Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation. We propose Coordinates-based Autoregressive Motion Diffusion (CoAMD), which synthesizes motion in a coarse-to-fine manner. As a core component of CoAMD, we design a Multi-modal Action Recognizer (MAR) that provides gradient-based semantic guidance for motion generation. Furthermore, we establish a rigorous benchmark by evaluating baselines on absolute coordinates. Our model can be applied to four important tasks, including skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing. Extensive experiments on 13 benchmarks across these tasks demonstrate that our approach achieves state-of-the-art performance, highlighting its effectiveness and versatility for human motion modeling. Code is available at https://github.com/jidongkuang/CoAMD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Coordinates-based Autoregressive Motion Diffusion (CoAMD) to unify skeleton-based action recognition and text-to-motion generation. It introduces a Multi-modal Action Recognizer (MAR) that supplies gradient-based semantic guidance inside the CoAMD diffusion sampler, enabling coarse-to-fine motion synthesis from text. The framework is applied to four tasks (skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing) and claims state-of-the-art results on 13 benchmarks after re-evaluating baselines on absolute coordinates.
Significance. If the MAR guidance mechanism is shown to be the load-bearing driver of the reported gains, the work would meaningfully bridge two previously separate lines of human-motion research and provide a reusable semantic prior for generation. Code release is a positive factor for reproducibility. However, the significance is currently limited by the absence of controlled experiments isolating the guidance contribution from the autoregressive coordinate diffusion and the absolute-coordinate benchmark protocol.
major comments (3)
- [Experiments] Experiments section: the manuscript reports SOTA performance across 13 benchmarks but provides no details on baseline implementations, exact metrics, error bars, data splits, or ablation studies. This directly affects verification of the central claim that MAR gradient guidance improves text-motion alignment.
- [§3.2] §3.2 (MAR) and generation experiments: the claim that gradient-based semantic guidance from MAR reliably improves generation quality without new artifacts or instability is load-bearing for the 'marriage' of recognition and generation. No guidance-scale sweeps, disabled-guidance runs, or ablations on the absolute-coordinate benchmark are described, leaving open the possibility that gains arise from CoAMD's coarse-to-fine autoregression or the new evaluation protocol rather than MAR.
- [Benchmark] Benchmark re-evaluation protocol: converting all baselines to absolute coordinates is presented as rigorous, yet no quantitative comparison of relative vs. absolute coordinate performance for the same models is supplied. This makes it difficult to attribute SOTA results specifically to the proposed unification rather than the coordinate choice.
minor comments (2)
- [Abstract] The abstract states 'extensive experiments' but the main text should explicitly list the 13 benchmarks with references and splits for each task.
- [Method] Notation for the coarse-to-fine autoregressive process and the precise form of the MAR gradient term should be clarified with an equation or pseudocode block.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that stronger experimental controls and documentation are required to substantiate the role of MAR guidance and the absolute-coordinate protocol. We address each major comment below and commit to the indicated revisions.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript reports SOTA performance across 13 benchmarks but provides no details on baseline implementations, exact metrics, error bars, data splits, or ablation studies. This directly affects verification of the central claim that MAR gradient guidance improves text-motion alignment.
Authors: We agree that the current Experiments section is insufficiently detailed for independent verification. In the revised manuscript we will expand this section (and the supplementary material) with complete baseline implementation details, the precise metrics and protocols used for each of the 13 benchmarks, error bars computed over multiple random seeds, explicit data splits, and additional ablation studies that isolate the contribution of MAR gradient guidance to text-motion alignment. revision: yes
-
Referee: [§3.2] §3.2 (MAR) and generation experiments: the claim that gradient-based semantic guidance from MAR reliably improves generation quality without new artifacts or instability is load-bearing for the 'marriage' of recognition and generation. No guidance-scale sweeps, disabled-guidance runs, or ablations on the absolute-coordinate benchmark are described, leaving open the possibility that gains arise from CoAMD's coarse-to-fine autoregression or the new evaluation protocol rather than MAR.
Authors: We accept that controlled ablations are necessary to establish MAR guidance as the primary driver. The revised version will include guidance-scale sweeps, results with MAR guidance disabled, and direct ablations of CoAMD with versus without MAR on the absolute-coordinate benchmarks. We will also report any observed artifacts or sampling instability and discuss their relation to the autoregressive coordinate diffusion. revision: yes
-
Referee: [Benchmark] Benchmark re-evaluation protocol: converting all baselines to absolute coordinates is presented as rigorous, yet no quantitative comparison of relative vs. absolute coordinate performance for the same models is supplied. This makes it difficult to attribute SOTA results specifically to the proposed unification rather than the coordinate choice.
Authors: We acknowledge the value of a direct relative-versus-absolute comparison. We will add quantitative results for our model and a representative subset of baselines (where re-implementation is feasible within revision time) to quantify the effect of the coordinate system. We will also expand the discussion of why absolute coordinates are required for the unified recognition-generation framework. A complete re-evaluation of every baseline in relative coordinates is not practical, but the partial comparison will help attribute gains to the proposed unification. revision: partial
Circularity Check
No significant circularity; claims rest on external empirical benchmarks
full rationale
The paper introduces CoAMD for coarse-to-fine motion synthesis and MAR for gradient guidance, then reports SOTA results on 13 external benchmarks for recognition, generation, retrieval, and editing. No load-bearing step reduces a claimed prediction or result to a self-definition, fitted input renamed as prediction, or self-citation chain. Performance assertions are tied to absolute-coordinate re-evaluations and comparisons against independent baselines, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diffusion models can synthesize coherent human motions in a coarse-to-fine autoregressive manner from skeleton coordinates
- domain assumption Gradient signals from a multi-modal action recognizer provide effective semantic guidance that improves text-conditioned motion quality
invented entities (2)
-
CoAMD
no independent evidence
-
MAR
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023. 2
2023
-
[2]
Neu- ron: Learning context-aware evolving representations for zero-shot skeleton action recognition
Yang Chen, Jingcai Guo, Song Guo, and Dacheng Tao. Neu- ron: Learning context-aware evolving representations for zero-shot skeleton action recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8721–8730, 2025. 2
2025
-
[3]
In- fogcn: Representation learning for human skeleton-based action recognition
Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. In- fogcn: Representation learning for human skeleton-based action recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022. 2
2022
-
[4]
Motionlcm: Real-time controllable motion generation via latent consistency model
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. InEuropean Conference on Computer Vision, pages 390–408. Springer,
-
[5]
Skateformer: skeletal- temporal transformer for human action recognition
Jeonghyeok Do and Munchurl Kim. Skateformer: skeletal- temporal transformer for human action recognition. InEu- ropean Conference on Computer Vision, pages 401–420. Springer, 2024. 2
2024
-
[6]
Hierarchical recur- rent neural network for skeleton based action recognition
Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- rent neural network for skeleton based action recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1110–1118, 2015. 1
2015
-
[7]
Pyskl: Towards good practices for skeleton action recogni- tion
Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. Pyskl: Towards good practices for skeleton action recogni- tion. InProceedings of the 30th ACM international confer- ence on multimedia, pages 7351–7354, 2022. 2
2022
-
[8]
Revisiting skeleton-based action recognition
Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022. 2
2022
-
[9]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022. 2, 7, 12
2022
-
[10]
Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2
2022
-
[11]
Momask: Generative masked model- ing of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 2, 5, 6
1900
-
[12]
Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, and Bin Feng. Graph contrastive learn- ing for skeleton-based action recognition.arXiv preprint arXiv:2301.10900, 2023. 2
-
[13]
Stablemofusion: Towards robust and efficient diffusion-based motion generation framework
Yiheng Huang, Hui Yang, Chuanchen Luo, Yuxi Wang, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, and Jun- ran Peng. Stablemofusion: Towards robust and efficient diffusion-based motion generation framework. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 224–232, 2024. 2, 5
2024
-
[14]
Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 1
2023
-
[15]
Flame: Free- form language-based motion synthesis & editing
Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free- form language-based motion synthesis & editing. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023. 2
2023
-
[16]
Zero-shot skeleton-based action recognition with dual visual-text alignment.Pattern Recognition, page 112342, 2025
Jidong Kuang, Hongsong Wang, Chaolei Han, Yang Zhang, and Jie Gui. Zero-shot skeleton-based action recognition with dual visual-text alignment.Pattern Recognition, page 112342, 2025. 2, 6
2025
-
[17]
Hierarchically decomposed graph convolutional net- works for skeleton-based action recognition
Jungho Lee, Minhyeok Lee, Dogyoon Lee, and Sangyoun Lee. Hierarchically decomposed graph convolutional net- works for skeleton-based action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10444–10453, 2023. 2
2023
-
[18]
Unimotion: Unifying 3d human motion synthesis and understanding
Chuqiao Li, Julian Chibane, Yannan He, Naama Pearl, An- dreas Geiger, and Gerard Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. InInterna- tional Conference on 3D Vision, pages 240–249. IEEE, 2025. 3
2025
-
[19]
Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024. 4
2024
-
[20]
Lamp: Language-motion pretrain- ing for motion generation, retrieval, and captioning
Zhe Li, Weihao Yuan, Lingteng Qiu, Shenhao Zhu, Xi- aodong Gu, Weichao Shen, Yuan Dong, Zilong Dong, Lau- rence Tianruo Yang, et al. Lamp: Language-motion pretrain- ing for motion generation, retrieval, and captioning. InInter- national Conference on Learning Representations, 2025. 3, 7
2025
-
[21]
Ntu rgb+ d 120: A large- scale benchmark for 3d human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2019
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+ d 120: A large- scale benchmark for 3d human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2019. 12
2019
-
[22]
3d skeleton-based action recognition: A review.arXiv preprint arXiv:2506.00915,
Mengyuan Liu, Hong Liu, Qianshuo Hu, Bin Ren, Junsong Yuan, Jiaying Lin, and Jiajun Wen. 3d skeleton-based action recognition: A review.arXiv preprint arXiv:2506.00915,
-
[23]
Scamo: Exploring the scaling law in au- toregressive motion generation model
Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in au- toregressive motion generation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27872–27882, 2025. 2
2025
-
[24]
Towards unified human motion-language un- derstanding via sparse interpretable characterization
Guangtao Lyu, Chenghao Xu, Jiexi Yan, Muli Yang, and Cheng Deng. Towards unified human motion-language un- derstanding via sparse interpretable characterization. InIn- ternational Conference on Learning Representations, 2025. 3
2025
-
[25]
Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025
Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, and Huaizu Jiang. Absolute coordinates make motion generation easy.arXiv preprint arXiv:2505.19377, 2025. 2, 4, 5, 6, 12
-
[26]
Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression
Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27859– 27871, 2025. 2, 5, 6, 8
2025
-
[27]
Temos: Generating diverse human motions from textual descriptions
Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–
-
[28]
TMR: Text-to-motion retrieval using contrastive 3d human motion synthesis
Mathis Petrovich, Michael J Black, and G ¨ul Varol. TMR: Text-to-motion retrieval using contrastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488–9497, 2023. 7
2023
-
[29]
BAMM: Bidirectional autoregressive motion model
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. BAMM: Bidirectional autoregressive motion model. InEuropean Conference on Computer Vision, pages 172–190. Springer,
-
[30]
Mmm: Generative masked motion model
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. Mmm: Generative masked motion model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2024. 2
2024
-
[31]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1
work page internal anchor Pith review arXiv 2022
-
[32]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1
2022
-
[33]
Generalized zero-and few-shot learning via aligned variational autoencoders
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8247–8255, 2019. 6
2019
-
[34]
Ntu rgb+ d: A large scale dataset for 3d human activity anal- ysis
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity anal- ysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1010–1019, 2016. 12
2016
-
[35]
Uni- fied multi-modal unsupervised representation learning for skeleton-based action understanding
Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, and Meng Wang. Uni- fied multi-modal unsupervised representation learning for skeleton-based action understanding. InProceedings of the ACM International Conference on Multimedia, pages 2973– 2984, 2023. 6
2023
-
[36]
Human-centric founda- tion models: Perception, generation and agentic modeling
Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, and Wanli Ouyang. Human-centric founda- tion models: Perception, generation and agentic modeling. InProceedings of the International Joint Conference on Ar- tificial Intelligence, pages 10678–10686. International Joint Conferences on Artificial Intelligence Organization, 2025. Survey Track. 1
2025
-
[37]
Human motion diffu- sion model
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InInternational Conference on Learning Repre- sentations, 2023. 1, 2, 5, 6
2023
-
[38]
Closd: Closing the loop between sim- ulation and diffusion for multi-task character control
Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. Closd: Closing the loop between sim- ulation and diffusion for multi-task character control. InIn- ternational Conference on Learning Representations, 2025. 2
2025
-
[39]
Modeling temporal dynamics and spatial configurations of actions using two- stream recurrent neural networks
Hongsong Wang and Liang Wang. Modeling temporal dynamics and spatial configurations of actions using two- stream recurrent neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 499–508, 2017. 1
2017
-
[40]
Foundation model for skeleton-based human action understanding.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025
Hongsong Wang, Wanjiang Weng, Junbo Wang, Fang Zhao, Guo-Sen Xie, Xin Geng, and Liang Wang. Foundation model for skeleton-based human action understanding.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 2, 6
2025
-
[41]
3mformer: Multi-order multi- mode transformer for skeletal action recognition
Lei Wang and Piotr Koniusz. 3mformer: Multi-order multi- mode transformer for skeletal action recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5620–5631, 2023. 2
2023
-
[42]
Hulk: A universal knowledge translator for human- centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Yizhou Wang, Yixuan Wu, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang, et al. Hulk: A universal knowledge translator for human- centric tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
2025
-
[43]
Mg-motionllm: A unified framework for motion comprehension and gener- ation across multiple granularities
Bizhu Wu, Jinheng Xie, Keming Shen, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, and Linlin Shen. Mg-motionllm: A unified framework for motion comprehension and gener- ation across multiple granularities. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27849–27858, 2025. 3
2025
-
[44]
Motion-agent: A conversational framework for human motion generation with llms
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Motion-agent: A conversational framework for human motion generation with llms. InIn- ternational Conference on Learning Representations, 2025. 3
2025
-
[45]
Frequency guidance matters: Skeletal ac- tion recognition by frequency-aware mixed transformer
Wenhan Wu, Ce Zheng, Zihao Yang, Chen Chen, Srijan Das, and Aidong Lu. Frequency guidance matters: Skeletal ac- tion recognition by frequency-aware mixed transformer. In Proceedings of the ACM International Conference on Multi- media, pages 4660–4669, 2024. 2
2024
-
[46]
Wenhan Wu, Zhishuai Guo, Chen Chen, Hongfei Xue, and Aidong Lu. Frequency-semantic enhanced variational au- 10 toencoder for zero-shot skeleton-based action recognition. arXiv preprint arXiv:2506.22179, 2025. 2
-
[47]
Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, and Yalin Zheng. Are spatial-temporal graph convolution networks for human action recognition over- parameterized? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24309–24319, 2025. 2
2025
-
[48]
Skeleton mixformer: Multivariate topology representation for skeleton-based action recogni- tion
Wentian Xin, Qiguang Miao, Yi Liu, Ruyi Liu, Chi-Man Pun, and Cheng Shi. Skeleton mixformer: Multivariate topology representation for skeleton-based action recogni- tion. InProceedings of the ACM International Conference on Multimedia, pages 2211–2220, 2023. 2
2023
-
[49]
Spatial tempo- ral graph convolutional networks for skeleton-based action recognition
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on arti- ficial intelligence, 2018. 1, 2
2018
-
[50]
Mogents: Motion generation based on spatial-temporal joint modeling.Advances in Neural Information Processing Sys- tems, 37:130739–130763, 2024
Weihao Yuan, Yisheng He, Weichao Shen, Yuan Dong, Xi- aodong Gu, Zilong Dong, Liefeng Bo, and Qixing Huang. Mogents: Motion generation based on spatial-temporal joint modeling.Advances in Neural Information Processing Sys- tems, 37:130739–130763, 2024. 2
2024
-
[51]
Physdiff: Physics-guided human motion diffusion model
Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16010–16021, 2023. 2
2023
-
[52]
Generating human motion from textual descrip- tions with discrete representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descrip- tions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023. 2
2023
-
[53]
Energymogen: Compositional human motion generation with energy-based diffusion model in latent space
Jianrong Zhang, Hehe Fan, and Yi Yang. Energymogen: Compositional human motion generation with energy-based diffusion model in latent space. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17592–17602, 2025. 2
2025
-
[54]
Re- modiffuse: Retrieval-augmented motion diffusion model
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 5, 6
2023
-
[55]
Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4115–4128, 2024. 1, 2, 5, 6
2024
-
[56]
Pengfei Zhang, Pinxin Liu, Pablo Garrido, Hyeongwoo Kim, and Bindita Chaudhuri. Kinmo: Kinematic-aware hu- man motion understanding and generation.arXiv preprint arXiv:2411.15472, 2024. 3
-
[57]
Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control
Kaifeng Zhao, Gen Li, and Siyu Tang. Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control. InInternational Conference on Learning Representations, 2025. 2
2025
-
[58]
Emdm: Efficient motion diffusion model for fast and high-quality motion generation
Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, and Lingjie Liu. Emdm: Efficient motion diffusion model for fast and high-quality motion generation. InEuropean Conference on Computer Vision, pages 18–38. Springer, 2024. 2
2024
-
[59]
Zero-shot skeleton-based action recogni- tion via mutual information estimation and maximization
Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, and Jiaqi Wang. Zero-shot skeleton-based action recogni- tion via mutual information estimation and maximization. In Proceedings of the ACM international conference on multi- media, pages 5302–5310, 2023. 2, 6
2023
-
[60]
Blockgcn: Redefine topology aware- ness for skeleton-based action recognition
Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua. Blockgcn: Redefine topology aware- ness for skeleton-based action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024. 2
2049
-
[61]
Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition
Anqi Zhu, Jingmin Zhu, James Bailey, Mingming Gong, and Qiuhong Ke. Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 13876–13885, 2025. 2
2025
-
[62]
Motionbert: A unified perspective on learning human motion representations
Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15085–15099, 2023. 1
2023
-
[63]
waltz dance step
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 1 11 Appendix In this appendix, we provide additional materials to comple- ment the main text. Specifically, we include datasets ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.