Recognition: 2 theorem links
· Lean TheoremExploring Motion-Language Alignment for Text-driven Motion Generation
Pith reviewed 2026-05-13 20:49 UTC · model grok-4.3
The pith
Text-to-motion models generate more accurate movements when attention sinks on initial text tokens are masked to use the full description.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLA-Gen integrates global motion priors with fine-grained local conditioning while using SinkRatio and alignment-aware masking to counteract the attention sink on the start text token, thereby improving both the realism of generated sequences and their semantic correspondence to the input text.
What carries the argument
MLA-Gen framework that pairs global motion priors with fine-grained local conditioning, together with the SinkRatio metric and alignment-aware masking strategies that regulate disproportionate attention on the first text token.
If this is right
- Generated motions follow the complete textual description instead of defaulting to patterns tied to the first word.
- The same masking approach can be added to existing text-to-motion architectures with little extra cost.
- Motion datasets that contain longer, multi-clause sentences become more usable for training.
- Quantitative metrics such as FID and text-alignment scores rise consistently across baseline models.
- The SinkRatio value becomes a diagnostic tool for diagnosing alignment problems in new models.
Where Pith is reading between the lines
- Similar attention-sink behavior may appear in text-to-video or text-to-3D generation, suggesting the masking technique could transfer.
- Longer or more complex prompts would be a direct test of whether the regulation scales without retraining.
- Combining the masking with contrastive losses might produce even tighter language-motion correspondence.
Load-bearing premise
The attention sink on the start token is the main driver of weak semantic grounding and that masking it fixes alignment without creating new artifacts or lowering motion quality.
What would settle it
Training and evaluating the masked and unmasked versions on the same benchmarks and finding no measurable gain in standard motion quality scores or text-motion alignment metrics would show the sink is not the primary cause.
Figures
read the original abstract
Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MLA-Gen, a text-to-motion generation framework that combines global motion priors with local conditioning to improve motion-language alignment. It identifies an attention sink phenomenon in which attention disproportionately focuses on the initial text token, introduces the SinkRatio metric to quantify this concentration, and develops alignment-aware masking and control strategies to regulate attention during generation. The central claim is that these components yield consistent gains in both motion quality and semantic alignment over strong baselines, supported by extensive experiments.
Significance. If the attention sink is shown to be a primary driver of degraded grounding and the masking strategies demonstrably correct it without side effects, the work would offer a useful diagnostic tool (SinkRatio) and practical intervention for improving controllability in motion synthesis. The integration of global priors with local alignment is a reasonable architectural direction, and the release of code would strengthen reproducibility.
major comments (3)
- [Experiments / Ablation subsection] The ablation studies do not isolate the contribution of alignment-aware masking while holding the MLA-Gen backbone fixed. Without a controlled comparison that toggles only the masking mechanism (and reports the resulting change in SinkRatio together with motion metrics), it remains unclear whether the reported gains in quality and alignment stem from sink regulation or from the global-prior component alone.
- [§3 (Attention Sink Analysis)] The claim that attention sink is a primary cause of degraded semantic grounding rests on correlation via SinkRatio; no causal intervention (e.g., forced attention redistribution independent of the proposed masking) is presented to test whether lowering SinkRatio directly improves grounding or merely co-occurs with other changes.
- [§5 (Experimental Results)] Quantitative results are presented without sufficient detail on baseline implementations, exact metric definitions (FID, R-Precision, etc.), dataset splits, or hyper-parameter search protocols. This makes it difficult to verify that the improvements are robust rather than sensitive to post-hoc choices.
minor comments (2)
- [Abstract] The abstract refers to 'strong baselines' without naming them; the introduction or experimental section should explicitly list the compared methods and their sources.
- [§3.2] Notation for the masking strategies and the precise formulation of SinkRatio should be introduced with a single equation block rather than scattered across paragraphs.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and outline the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Experiments / Ablation subsection] The ablation studies do not isolate the contribution of alignment-aware masking while holding the MLA-Gen backbone fixed. Without a controlled comparison that toggles only the masking mechanism (and reports the resulting change in SinkRatio together with motion metrics), it remains unclear whether the reported gains in quality and alignment stem from sink regulation or from the global-prior component alone.
Authors: We agree that a more controlled ablation is needed. In the revised manuscript we will add a dedicated experiment that keeps the full MLA-Gen backbone fixed and toggles only the alignment-aware masking (on/off). We will report the resulting SinkRatio values together with all motion quality and alignment metrics to isolate the masking contribution. revision: yes
-
Referee: [§3 (Attention Sink Analysis)] The claim that attention sink is a primary cause of degraded semantic grounding rests on correlation via SinkRatio; no causal intervention (e.g., forced attention redistribution independent of the proposed masking) is presented to test whether lowering SinkRatio directly improves grounding or merely co-occurs with other changes.
Authors: The alignment-aware masking and control strategies are our causal intervention: they directly reshape the attention distribution during generation. Ablations that enable/disable these strategies already show that SinkRatio reduction produces measurable gains in grounding. We will expand §3 to articulate this causal pathway more explicitly. A completely separate forced-redistribution experiment unrelated to our masking would require an orthogonal experimental setup outside the scope of the present framework. revision: partial
-
Referee: [§5 (Experimental Results)] Quantitative results are presented without sufficient detail on baseline implementations, exact metric definitions (FID, R-Precision, etc.), dataset splits, or hyper-parameter search protocols. This makes it difficult to verify that the improvements are robust rather than sensitive to post-hoc choices.
Authors: We appreciate this observation. The revised manuscript will include expanded descriptions of all baseline implementations, precise definitions and computation procedures for FID, R-Precision and other metrics, the exact training/validation/test splits, and the hyper-parameter search protocol. These details will appear in §5 and the supplementary material. revision: yes
Circularity Check
No circularity in derivation or claims
full rationale
The paper introduces MLA-Gen as an architectural integration of global priors and local conditioning, plus an empirical attention-sink analysis with SinkRatio metric and masking strategies. All performance claims are presented as outcomes of experiments on baselines rather than any first-principles derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs. No equations or load-bearing steps in the provided text exhibit self-definition or construction equivalence.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SinkRatio, a metric that measures the degree of such concentration... sink-mask... sink-ctrl
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
global motion priors via memory slots... local fine-grained alignment via cross-attention
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Michael S Albergo and Eric Vanden-Eijnden. 2022. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [2]
-
[3]
Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. 2025. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Conference. 6200–6211
work page 2025
-
[4]
Rui Chen, Mingyi Shi, Shaoli Huang, Ping Tan, Taku Komura, and Xuelin Chen
-
[5]
InACM SIG- GRAPH 2024 Conference Papers
Taming diffusion probabilistic models for character control. InACM SIG- GRAPH 2024 Conference Papers. 1–10
work page 2024
-
[6]
Wenshuo Chen, Haozhe Jia, Songning Lai, Keming Wu, Hongru Xiao, Lijie Hu, and Yutao Yue. 2025. Free-t2m: Frequency enhanced text-to-motion diffusion model with consistency loss.arXiv e-prints(2025), arXiv–2501
work page 2025
-
[7]
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010
work page 2023
-
[8]
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye
-
[9]
arXiv preprint arXiv:2406.08070(2024)
Cfg++: Manifold-constrained classifier free guidance for diffusion models. arXiv preprint arXiv:2406.08070(2024)
-
[10]
Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. 2023. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9760–9770
work page 2023
-
[11]
Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. 2024. Motionlcm: Real-time controllable motion generation via latent consistency model. InEuropean Conference on Computer Vision. Springer, 390– 408
work page 2024
-
[12]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems34 (2021), 8780–8794
work page 2021
- [13]
-
[14]
Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. 2025. Duetgen: Mu- sic driven two-person dance generation via hierarchical masked modeling. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11
work page 2025
- [15]
-
[16]
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. 2024. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1900–1910
work page 2024
-
[17]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng
-
[18]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152– 5161
-
[19]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. InEuropean Conference on Computer Vision. Springer, 580–597
work page 2022
-
[20]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia. 2021–2029
work page 2020
- [21]
-
[22]
Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, and Lingni Ma. 2025. Egolm: Multi-modal language model of egocentric motions. InProceedings of the Computer Vision and Pattern Recognition Conference. 5344–5354
work page 2025
-
[24]
Inwoo Hwang, Jian Wang, Bing Zhou, et al. 2025. Snapmogen: Human motion generation from expressive texts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
- [25]
-
[26]
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. 2023. Mo- tiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems36 (2023), 20067–20079
work page 2023
- [27]
-
[28]
Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. 2023. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2151– 2162
work page 2023
-
[29]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. 2024. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems37 (2024), 122458–122483
work page 2024
-
[31]
Chuqiao Li, Julian Chibane, Yannan He, Naama Pearl, Andreas Geiger, and Gerard Pons-Moll. 2025. Unimotion: Unifying 3d human motion synthesis and under- standing. In2025 International Conference on 3D Vision (3DV). IEEE, 240–249
work page 2025
-
[32]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems37 (2024), 56424–56445
work page 2024
-
[33]
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. 2024. Intergen: Diffusion-based multi-human motion generation under complex interactions. International Journal of Computer Vision132, 9 (2024), 3463–3483
work page 2024
-
[34]
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. 2024. Common diffusion noise schedules and sample steps are flawed. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5404–5411
work page 2024
-
[35]
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le
-
[36]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. 2024. Flow matching guide and code.arXiv preprint arXiv:2412.06264(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, and Yueqi Duan. 2025. Langscene-x: Reconstruct generalizable 3d language- embedded scenes with trimap video diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision. 29010–29020
work page 2025
-
[39]
Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, and Yueqi Duan
-
[40]
InEuropean Conference on Computer Vision
Make-your-3d: Fast and consistent subject-driven 3d content generation. InEuropean Conference on Computer Vision. Springer, 389–406
-
[41]
Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, and Chenliang Xu. 2025. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial- temporal modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10929–10939
work page 2025
-
[42]
Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5442– 5451
work page 2019
-
[44]
Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, and Huaizu Jiang
- [45]
-
[46]
Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, and Huaizu Jiang. 2025. Rethinking diffusion for text-driven human motion generation: Redundant repre- sentations, evaluation, and masked autoregression. InProceedings of the Computer Vision and Pattern Recognition Conference. 27859–27871
work page 2025
-
[47]
Mathis Petrovich, Or Litany, Umar Iqbal, Michael J Black, Gul Varol, Xue Bin Peng, and Davis Rempe. 2024. Multi-track timeline control for text-driven 3d human motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1911–1921
work page 2024
-
[48]
Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. 2025. Maskcontrol: Spatio-temporal control for masked motion synthe- sis. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9955–9965
work page 2025
-
[49]
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. 2024. Bamm: bidirectional autoregressive motion model. InEuropean Conference on Computer Vision. Springer, 172–190
work page 2024
-
[50]
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, and Chen Chen. 2024. Mmm: Generative masked motion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1546–1555
work page 2024
-
[51]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning. PmLR, 8748–8763
work page 2021
-
[52]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510. 9 Preprint, Under Review, —, — Ruxi Gu et al
work page 2023
- [53]
-
[54]
Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. 2024. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations
work page 2024
-
[55]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems35 (2022), 36479–36494
work page 2022
- [56]
- [57]
- [58]
-
[59]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model.arXiv preprint arXiv:2209.14916(2022)
work page internal anchor Pith review arXiv 2022
-
[60]
Linnan Tu, Lingwei Meng, Zongyi Li, Hefei Ling, and Shijuan Huang. [n. d.]. Au- toregressive Motion Generation with Gaussian Mixture-Guided Latent Sampling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[61]
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.Advances in Neural Information Processing Systems30 (2017)
work page 2017
-
[62]
Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. 2024. Tlcontrol: Trajectory and language control for human motion synthesis. InEuropean Conference on Computer Vision. Springer, 37–54
work page 2024
-
[63]
Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, and Yueqi Duan
- [64]
- [65]
-
[66]
Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang
- [67]
-
[68]
WANG Xi, Nicolas Dufour, Nefeli Andreou, CANI Marie-Paule, Victoria Fernan- dez Abrevaya, David Picard, and Vicky Kalogeiton. 2024. Analysis of classifier- free guidance weight schedulers.Transactions on Machine Learning Research (2024)
work page 2024
-
[69]
Mengfei Xia, Nan Xue, Yujun Shen, Ran Yi, Tieliang Gong, and Yong-Jin Liu
-
[70]
InProceedings of the Computer Vision and Pattern Recognition Conference
Rectified diffusion guidance for conditional generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 13371–13380
-
[71]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis
-
[72]
Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, and Jingbo Wang. 2025. Motionstreamer: Streaming motion generation via diffusion-based autoregressive model in causal latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10086–10096
work page 2025
-
[74]
Runmao Yao, Yi Du, Zhuoqun Chen, Haoze Zheng, and Chen Wang. 2025. Air- Room: Objects Matter in Room Reidentification. InProceedings of the Computer Vision and Pattern Recognition Conference. 1385–1394
work page 2025
-
[75]
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. 2023. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14730–14740
work page 2023
-
[76]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2024. Motiondiffuse: Text-driven human motion genera- tion with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 6 (2024), 4115–4128
work page 2024
-
[77]
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. 2023. Remodiffuse: Retrieval-augmented motion diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision. 364–373
work page 2023
-
[78]
Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, et al. 2024. Large motion model for unified multi-modal motion generation. InEuropean Conference on Computer Vision. Springer, 397–421
work page 2024
-
[79]
Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu
-
[80]
Advances in Neural Information Processing Systems36 (2023), 13981–13992
Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems36 (2023), 13981–13992
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.