pith. sign in

arxiv: 2505.11334 · v4 · submitted 2025-05-16 · 💻 cs.CV

MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Pith reviewed 2026-05-22 14:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords bodyunitsinformationmarrsunitautoregressivediffusiondistinct
0
0 comments X

The pith

MARRS generates coordinated human reactions by masking tokens and modulating between body and hand units in continuous space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARRS to generate human reactions to another person's action sequence. It replaces vector quantization with continuous representations to avoid information loss and low codebook use. The approach first encodes the body and hands as separate units in a Unit-distinguished Motion Variational AutoEncoder. Random masking of reactive tokens then extracts body and hand information, while Mutual Unit Modulation lets each unit adapt the other. A diffusion model with compact MLPs per unit models the token distributions and produces the final motions.

Core claim

MARRS generates coordinated and fine-grained reaction motions using continuous representations. It starts with a Unit-distinguished Motion Variational AutoEncoder that segments and encodes body and hand units independently. Action-Conditioned Fusion randomly masks a subset of reactive tokens and pulls specific body and hand information from the active ones. Mutual Unit Modulation then lets information from one unit adaptively modulate the other. For the diffusion stage a compact MLP serves as noise predictor for each unit and the diffusion loss models the probability distribution of each token.

What carries the argument

Mutual Unit Modulation (MUM) together with Action-Conditioned Fusion (ACF) operating on independently encoded body and hand units inside a continuous diffusion model.

If this is right

  • Produces reaction motions without quantization information loss.
  • Captures inter-person coordination and fine-grained hand details through unit interaction.
  • Achieves superior quantitative and qualitative results over prior VQ-based autoregressive methods.
  • Keeps computational cost manageable by limiting the number of units and using compact predictors per unit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unit masking and cross-modulation pattern could be applied to generate full multi-person scenes rather than pairwise reactions.
  • Continuous representations may permit direct editing or interpolation of reaction motions without decoding to discrete codes first.
  • Similar masking-plus-modulation blocks might improve single-person motion forecasting by letting different body parts condition one another.
  • The framework could be tested on longer sequences to check whether coordination remains stable over time.
  • keywords=[

Load-bearing premise

Segmenting the body into independent body and hand units, then applying random masking and mutual modulation, will capture inter-person coordination and fine-grained details without introducing coordination artifacts or requiring prohibitive compute.

What would settle it

A test set of complex two-person interactions where the generated hand positions fail to match the body posture or timing required by the conditioning action sequence.

Figures

Figures reproduced from arXiv: 2505.11334 by Jiafu Wu, Jiangning Zhang, Qingdong He, Shuo Wang, Yabiao Wang, Yong Liu.

Figure 1
Figure 1. Figure 1: Left: Paradigm comparison of different frameworks. (a) and (b) present the structures of the VQ-VAE-based and Diffusion-based methods, respectively, while (c) shows the framework of our proposed MARRS. LCE is cross entropy loss, LDif f is diffusion loss. Right: result comparison among our method and other methods on eight metrics. stage, we propose Action-Conditioned Fusion (ACF), which involves randomly m… view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our proposed MARRS. (a) Whole-body motion is divided into two units: body and hands and then each unit is encoded independently by a VAE. (b) shows the process of the masked reaction generation model. First, the reactive token of each unit obtains the interaction information from the active token through Action-Conditioned Fusion (ACF). Then different units acquire the coordinated … view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of inference process. The generation of entire tokens is performed in an autoregressive manner. Compact diffusion model is very small, consisting of only a 3-layer MLP. Therefore, MARRS can achieve fast inference speed. TABLE I: Comparison in the online setting on NTU120-AS [35] for human action–reaction synthesis. ± indicates 95% confidence interval, → means that closer to Real is better. Bo… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization Comparison with RegenNet on NTU120-AS. Blue for actors and Red for reactors. Our method produces more plausible body movements and relative positions, as well as more natural hand gestures of reactors. The red dashed boxes highlight artifacts, while the green dashed boxes indicate more reasonable results. D. Ablation Study In this section, we carry out extensive ablation experiments to invest… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization Comparison with RegenNet on Inter-X. Blue for actors and Red for reactors. Our method produces more plausible body movements and relative positions, as well as more natural hand gestures of reactors. The red dashed boxes highlight artifacts, while the green dashed boxes indicate more reasonable results. TABLE IV: Comparison in the offline setting on NTU120- AS [35]. ± indicates 95% confidence… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization Comparison of Reconstruction: VQ￾VAE vs. UD-VAE (Ours). The results in the red dashed box show reconstruction artifacts by VQ-VAE, while our results align more closely with the ground truth (GT). Blue for actors and Red for reactors. E. Accuracy of Hand Poses and Global Translation We used coordinate-based metrics (APE and AVE) [50] to measure the accuracy of hand poses and global translation… view at source ↗
Figure 7
Figure 7. Figure 7: User study. We use three subjective indicators, Natural￾ness, Smoothness, and Realism, to compare with ReGenNet. V. CONCLUSION AND LIMITATION A. Conclusion In this paper, we introduce an innovative framework named MARRS, designed to generate synchronized and fine reac￾tions. Initially, we present the UD-VAE, which divides the whole body into distinct units: body and hands, allowing for independent encoding… view at source ↗
read the original abstract

This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Mutual Unit Modulation (MUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/MARRS/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MARRS, a framework for human action-reaction synthesis that encodes body and hand units independently via a Unit-distinguished Motion Variational AutoEncoder (UD-VAE), applies Action-Conditioned Fusion (ACF) with random masking of reactive tokens, uses Mutual Unit Modulation (MUM) for adaptive cross-unit interaction, and employs separate compact MLP diffusion predictors with diffusion loss for each unit. The central claim is that this continuous-representation approach yields superior quantitative and qualitative performance in generating coordinated, fine-grained reactions compared to prior VQ-based autoregressive methods.

Significance. If the empirical claims are substantiated, the work provides a practical alternative to vector-quantization losses in motion synthesis by retaining continuous latents while managing computational cost through unit segmentation and post-encoding modulation. The combination of random masking and mutual modulation offers a lightweight mechanism for inter-person and body-hand coordination that could transfer to related tasks such as two-person interaction generation or fine-motor control in animation.

major comments (2)
  1. [Abstract] Abstract: the claim that 'both quantitative and qualitative results demonstrate that our method achieves superior performance' is presented without any reported metrics, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the summary alone and must be verified against the experimental section.
  2. [Section 3.3] Section 3.3 (MUM description): the adaptive modulation of one unit's features by the other is described as sufficient to recover inter-unit dependencies, yet no explicit joint constraint, synchronization loss, or diagnostic metric (e.g., cross-unit velocity correlation or instantaneous pose-velocity consistency) is introduced; if body-hand coupling is non-factorizable, this post-hoc modulation may only approximate rather than enforce coordination, risking artifacts that FID or MPJPE could under-detect.
minor comments (2)
  1. [Section 3.2] The masking ratio is listed among free parameters but no sensitivity analysis or default value is stated; a brief ablation or recommended range would clarify reproducibility.
  2. [Section 3.1] Notation for the continuous latent variables of body versus hand units should be introduced once and used consistently to avoid ambiguity when describing the modulation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the potential impact of MARRS. We address each major comment point by point below, indicating whether revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'both quantitative and qualitative results demonstrate that our method achieves superior performance' is presented without any reported metrics, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the summary alone and must be verified against the experimental section.

    Authors: We agree that the abstract serves as a high-level summary and does not contain specific numerical results. The detailed quantitative evaluation, including metrics such as FID and MPJPE, comparisons against baselines, error bars, and ablation studies, is fully reported in Section 4 with supporting tables and figures. To address the concern, we have revised the abstract to include a concise reference to the observed performance gains on these metrics. revision: yes

  2. Referee: [Section 3.3] Section 3.3 (MUM description): the adaptive modulation of one unit's features by the other is described as sufficient to recover inter-unit dependencies, yet no explicit joint constraint, synchronization loss, or diagnostic metric (e.g., cross-unit velocity correlation or instantaneous pose-velocity consistency) is introduced; if body-hand coupling is non-factorizable, this post-hoc modulation may only approximate rather than enforce coordination, risking artifacts that FID or MPJPE could under-detect.

    Authors: We appreciate the referee's careful analysis of the MUM module. MUM provides adaptive cross-unit modulation within the continuous latent space to facilitate interaction between body and hand units in a computationally efficient manner, complementing the random masking in ACF. No additional synchronization loss was introduced to avoid increasing model complexity, but the joint diffusion training with compact per-unit predictors encourages coordinated outputs, as evidenced by our quantitative results and qualitative motion visualizations. We have revised the description in Section 3.3 for greater clarity on this mechanism and added a diagnostic analysis of cross-unit velocity correlations in the experiments to better validate coordination. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural components are constructive extensions without reduction to inputs or self-citations

full rationale

The paper defines UD-VAE for independent body/hand encoding, ACF for random masking of reactive tokens, MUM for adaptive cross-unit modulation, and per-unit MLP diffusion predictors as sequential novel modules. These are presented as design choices to address VQ limitations and neglected mutual perception, with performance asserted via quantitative/qualitative results rather than any equation that reduces a claimed prediction to a fitted parameter or prior self-result by construction. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or self-definitional loops appear in the derivation chain; the framework remains self-contained against external motion benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that body-hand segmentation plus masking and modulation will preserve coordination information; several standard VAE and diffusion assumptions are inherited without re-derivation.

free parameters (1)
  • masking ratio
    Proportion of reactive tokens randomly masked in ACF is a design choice that directly affects information flow.
axioms (1)
  • domain assumption Independent encoding of body and hand units preserves all necessary inter-unit dependencies for reaction synthesis.
    Invoked by the UD-VAE design and subsequent modulation step.

pith-pipeline@v0.9.0 · 5812 in / 1134 out tokens · 50576 ms · 2026-05-22T14:44:44.502840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Parent,Computer animation: algorithms and techniques

    R. Parent,Computer animation: algorithms and techniques. Newnes, 2012

  2. [2]

    Unified cross-structural motion retargeting for humanoid char- acters,

    H. Zhang, Z. Chen, H. Xu, L. Hao, X. Wu, S. Xu, R. Xiong, and Y . Wang, “Unified cross-structural motion retargeting for humanoid char- acters,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 7, pp. 3863–3876, 2025

  3. [3]

    Magnenat-Thalmann, D

    N. Magnenat-Thalmann, D. Thalmann, N. Magnenat-Thalmann, and D. Thalmann,Computer animation. Springer, 1985

  4. [4]

    Introduction to game development,

    J. Urbain, “Introduction to game development,”Cell, vol. 414, pp. 745– 5102, 2010

  5. [5]

    Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,

    Y . Wang, M. Li, Z. Leng, F. W. B. Li, and X. Liang, “Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 8994–9007, 2025

  6. [6]

    Bethke,Game development and production

    E. Bethke,Game development and production. Wordware Publishing, Inc., 2003

  7. [7]

    Intelligent robotic control,

    G. Saridis, “Intelligent robotic control,”IEEE Transactions on Automatic Control, vol. 28, no. 5, pp. 547–557, 1983

  8. [8]

    Towards domain generalization for multi-view 3d object detection in bird-eye-view,

    S. Wang, X. Zhao, H.-M. Xu, Z. Chen, D. Yu, J. Chang, Z. Yang, and F. Zhao, “Towards domain generalization for multi-view 3d object detection in bird-eye-view,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 333–13 342

  9. [9]

    Learning from noisy data for semi-supervised 3d object detection,

    Z. Chen, Z. Li, S. Wang, D. Fu, and F. Zhao, “Learning from noisy data for semi-supervised 3d object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6929–6939

  10. [10]

    Stream query denoising for vectorized hd-map construction,

    S. Wang, F. Jia, W. Mao, Y . Liu, Y . Zhao, Z. Chen, T. Wang, C. Zhang, X. Zhang, and F. Zhao, “Stream query denoising for vectorized hd-map construction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 203–220

  11. [11]

    Mmm: Generative masked motion model,

    E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1546–1555

  12. [12]

    Momask: Generative masked modeling of 3d human motions,

    C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1900–1910

  13. [13]

    Online clustered codebook,

    C. Zheng and A. Vedaldi, “Online clustered codebook,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 798–22 807

  14. [14]

    Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,

    G. Baykal, M. Kandemir, and G. Unal, “Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,”Pattern Recognition, vol. 156, p. 110792, 2024. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0031320324005430

  15. [15]

    Interactive character control with auto-regressive motion diffusion models,

    Y . Shi, J. Wang, X. Jiang, B. Lin, B. Dai, and X. B. Peng, “Interactive character control with auto-regressive motion diffusion models,”ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–14, 2024

  16. [16]

    Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

    T. Li, Y . Tian, H. Li, M. Deng, and K. He, “Autoregressive image gen- eration without vector quantization,”arXiv preprint arXiv:2406.11838, 2024

  17. [17]

    Temos: Generating diverse human motions from textual descriptions,

    M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–497

  18. [18]

    Generating diverse and natural 3d human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

  19. [19]

    Generating human motion from textual descriptions with discrete representations,

    J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 730–14 740

  20. [20]

    Mo- tiondiffuse: Text-driven human motion generation with diffusion model,

    M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Mo- tiondiffuse: Text-driven human motion generation with diffusion model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  21. [21]

    Human Motion Diffusion Model

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022

  22. [22]

    Executing your commands via motion diffusion in latent space,

    X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

  23. [23]

    Remodiffuse: Retrieval-augmented motion diffusion model,

    M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 364–373

  24. [24]

    Sport: From zero- shot prompts to real-time motion generation,

    B. Ji, Y . Pan, Z. Liu, S. Tan, and X. Yang, “Sport: From zero- shot prompts to real-time motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 7171–7183, 2025

  25. [25]

    Guess: Gradually enriching synthesis for text-driven human motion generation,

    X. Gao, Y . Yang, Z. Xie, S. Du, Z. Sun, and Y . Wu, “Guess: Gradually enriching synthesis for text-driven human motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 12, pp. 7518–7530, 2024

  26. [26]

    Simulating competitive interactions using singly captured motions,

    H. P. Shum, T. Komura, and S. Yamazaki, “Simulating competitive interactions using singly captured motions,” inProceedings of the 2007 ACM symposium on Virtual reality software and technology, 2007, pp. 65–72

  27. [27]

    Animating reactive motion using momentum-based inverse kinematics,

    T. Komura, E. S. Ho, and R. W. Lau, “Animating reactive motion using momentum-based inverse kinematics,”Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 213–223, 2005

  28. [28]

    Human motion diffusion as a generative prior,

    Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023

  29. [29]

    Role-aware interaction generation from textual description,

    M. Tanaka and K. Fujiwara, “Role-aware interaction generation from textual description,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 999–16 009

  30. [30]

    Intercontrol: Generate hu- man motion interactions by controlling every joint,

    Z. Wang, J. Wang, D. Lin, and B. Dai, “Intercontrol: Generate hu- man motion interactions by controlling every joint,”arXiv preprint arXiv:2311.15864, 2023

  31. [31]

    Intergen: Diffusion- based multi-human motion generation under complex interactions,

    H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

  32. [32]

    Freemotion: A unified framework for number-free text-to-motion synthesis,

    K. Fan, J. Tang, W. Cao, R. Yi, M. Li, J. Gong, J. Zhang, Y . Wang, C. Wang, and L. Ma, “Freemotion: A unified framework for number-free text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 93–109

  33. [33]

    Timotion: Temporal and interactive framework for efficient human- human motion generation,

    Y . Wang, S. Wang, J. Zhang, K. Fan, J. Wu, Z. Xue, and Y . Liu, “Timotion: Temporal and interactive framework for efficient human- human motion generation,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7169–7178

  34. [34]

    Intermask: 3d human interaction generation via collaborative masked modeling,

    M. G. Javed, chuan guo, L. cheng, and X. Li, “Intermask: 3d human interaction generation via collaborative masked modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=ZAyuwJYN8N

  35. [35]

    Regennet: Towards human action-reaction synthesis,

    L. Xu, Y . Zhou, Y . Yan, X. Jin, W. Zhu, F. Rao, X. Yang, and W. Zeng, “Regennet: Towards human action-reaction synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1759–1769

  36. [36]

    Reactffusion: Physical contact- guided diffusion model for reaction generation,

    Z. Zhang, S. Zhang, Y . Wang, and S. Li, “Reactffusion: Physical contact- guided diffusion model for reaction generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9677– 9685. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12

  37. [37]

    Mardini: Masked autoregressive diffusion for video generation at scale,

    H. Liu, S. Liu, Z. Zhou, M. Xu, Y . Xie, X. Han, J. C. P ´erez, D. Liu, K. Kahatapitiya, M. Jiaet al., “Mardini: Masked autoregressive diffusion for video generation at scale,”arXiv preprint arXiv:2410.20280, 2024

  38. [38]

    Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,

    J. Yang, D. Yin, Y . Zhou, F. Rao, W. Zhai, Y . Cao, and Z.-J. Zha, “Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,” arXiv preprint arXiv:2410.10798, 2024

  39. [39]

    Rethinking diffusion for text-driven human motion generation,

    Z. Meng, Y . Xie, X. Peng, Z. Han, and H. Jiang, “Rethinking diffusion for text-driven human motion generation,”arXiv preprint arXiv:2411.16575, 2024

  40. [40]

    Diverse motion in-betweening from sparse keyframes with dual posture stitching,

    T. Ren, J. Yu, S. Guo, Y . Ma, Y . Ouyang, Z. Zeng, Y . Zhang, and Y . Qin, “Diverse motion in-betweening from sparse keyframes with dual posture stitching,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 2, pp. 1402–1413, 2025

  41. [41]

    Expressive body capture: 3d hands, face, and body from a single image,

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 975–10 985

  42. [42]

    Parco: Part-coordinating text-to-motion synthesis,

    Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “Parco: Part-coordinating text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 126–143

  43. [43]

    Improved denoising diffusion probabilis- tic models,

    A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis- tic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

  44. [44]

    Inter-x: Towards versatile human-human interaction analysis,

    L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Shenget al., “Inter-x: Towards versatile human-human interaction analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 260–22 271

  45. [45]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  46. [46]

    Auto-encoding variational bayes,

    D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” in The 2nd International Conference on Learning Representations, 2014. [Online]. Available: https://openreview.net/forum?id=33X9fd2-9FyZd

  47. [47]

    Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,

    Y . Du, R. Kips, A. Pumarola, S. Starke, A. Thabet, and A. Sanakoyeu, “Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 481– 490

  48. [48]

    Synthesis of compositional animations from textual descriptions,

    A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Synthesis of compositional animations from textual descriptions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1396–1406

  49. [49]

    Action-conditioned 3d human motion synthesis with transformer vae,

    M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 985–10 995

  50. [50]

    What is the best automated metric for text to motion generation?

    J. V oas, Y . Wang, Q. Huang, and R. Mooney, “What is the best automated metric for text to motion generation?” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11