MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Jiafu Wu; Jiangning Zhang; Qingdong He; Shuo Wang; Yabiao Wang; Yong Liu

arxiv: 2505.11334 · v4 · submitted 2025-05-16 · 💻 cs.CV

MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Yabiao Wang , Shuo Wang , Jiangning Zhang , Jiafu Wu , Qingdong He , Yong Liu This is my paper

Pith reviewed 2026-05-22 14:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords bodyunitsinformationmarrsunitautoregressivediffusiondistinct

0 comments

The pith

MARRS generates coordinated human reactions by masking tokens and modulating between body and hand units in continuous space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARRS to generate human reactions to another person's action sequence. It replaces vector quantization with continuous representations to avoid information loss and low codebook use. The approach first encodes the body and hands as separate units in a Unit-distinguished Motion Variational AutoEncoder. Random masking of reactive tokens then extracts body and hand information, while Mutual Unit Modulation lets each unit adapt the other. A diffusion model with compact MLPs per unit models the token distributions and produces the final motions.

Core claim

MARRS generates coordinated and fine-grained reaction motions using continuous representations. It starts with a Unit-distinguished Motion Variational AutoEncoder that segments and encodes body and hand units independently. Action-Conditioned Fusion randomly masks a subset of reactive tokens and pulls specific body and hand information from the active ones. Mutual Unit Modulation then lets information from one unit adaptively modulate the other. For the diffusion stage a compact MLP serves as noise predictor for each unit and the diffusion loss models the probability distribution of each token.

What carries the argument

Mutual Unit Modulation (MUM) together with Action-Conditioned Fusion (ACF) operating on independently encoded body and hand units inside a continuous diffusion model.

If this is right

Produces reaction motions without quantization information loss.
Captures inter-person coordination and fine-grained hand details through unit interaction.
Achieves superior quantitative and qualitative results over prior VQ-based autoregressive methods.
Keeps computational cost manageable by limiting the number of units and using compact predictors per unit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unit masking and cross-modulation pattern could be applied to generate full multi-person scenes rather than pairwise reactions.
Continuous representations may permit direct editing or interpolation of reaction motions without decoding to discrete codes first.
Similar masking-plus-modulation blocks might improve single-person motion forecasting by letting different body parts condition one another.
The framework could be tested on longer sequences to check whether coordination remains stable over time.
keywords=[

Load-bearing premise

Segmenting the body into independent body and hand units, then applying random masking and mutual modulation, will capture inter-person coordination and fine-grained details without introducing coordination artifacts or requiring prohibitive compute.

What would settle it

A test set of complex two-person interactions where the generated hand positions fail to match the body posture or timing required by the conditioning action sequence.

Figures

Figures reproduced from arXiv: 2505.11334 by Jiafu Wu, Jiangning Zhang, Qingdong He, Shuo Wang, Yabiao Wang, Yong Liu.

**Figure 1.** Figure 1: Left: Paradigm comparison of different frameworks. (a) and (b) present the structures of the VQ-VAE-based and Diffusion-based methods, respectively, while (c) shows the framework of our proposed MARRS. LCE is cross entropy loss, LDif f is diffusion loss. Right: result comparison among our method and other methods on eight metrics. stage, we propose Action-Conditioned Fusion (ACF), which involves randomly m… view at source ↗

**Figure 2.** Figure 2: The overall framework of our proposed MARRS. (a) Whole-body motion is divided into two units: body and hands and then each unit is encoded independently by a VAE. (b) shows the process of the masked reaction generation model. First, the reactive token of each unit obtains the interaction information from the active token through Action-Conditioned Fusion (ACF). Then different units acquire the coordinated … view at source ↗

**Figure 3.** Figure 3: Visualization of inference process. The generation of entire tokens is performed in an autoregressive manner. Compact diffusion model is very small, consisting of only a 3-layer MLP. Therefore, MARRS can achieve fast inference speed. TABLE I: Comparison in the online setting on NTU120-AS [35] for human action–reaction synthesis. ± indicates 95% confidence interval, → means that closer to Real is better. Bo… view at source ↗

**Figure 4.** Figure 4: Visualization Comparison with RegenNet on NTU120-AS. Blue for actors and Red for reactors. Our method produces more plausible body movements and relative positions, as well as more natural hand gestures of reactors. The red dashed boxes highlight artifacts, while the green dashed boxes indicate more reasonable results. D. Ablation Study In this section, we carry out extensive ablation experiments to invest… view at source ↗

**Figure 5.** Figure 5: Visualization Comparison with RegenNet on Inter-X. Blue for actors and Red for reactors. Our method produces more plausible body movements and relative positions, as well as more natural hand gestures of reactors. The red dashed boxes highlight artifacts, while the green dashed boxes indicate more reasonable results. TABLE IV: Comparison in the offline setting on NTU120- AS [35]. ± indicates 95% confidence… view at source ↗

**Figure 6.** Figure 6: Visualization Comparison of Reconstruction: VQVAE vs. UD-VAE (Ours). The results in the red dashed box show reconstruction artifacts by VQ-VAE, while our results align more closely with the ground truth (GT). Blue for actors and Red for reactors. E. Accuracy of Hand Poses and Global Translation We used coordinate-based metrics (APE and AVE) [50] to measure the accuracy of hand poses and global translation… view at source ↗

**Figure 7.** Figure 7: User study. We use three subjective indicators, Naturalness, Smoothness, and Realism, to compare with ReGenNet. V. CONCLUSION AND LIMITATION A. Conclusion In this paper, we introduce an innovative framework named MARRS, designed to generate synchronized and fine reactions. Initially, we present the UD-VAE, which divides the whole body into distinct units: body and hands, allowing for independent encoding… view at source ↗

read the original abstract

This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Mutual Unit Modulation (MUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/MARRS/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARRS adds unit-specific continuous encoding, random masking, and mutual modulation to autoregressive reaction synthesis, but the performance edge rests on experiments not visible in the abstract.

read the letter

The main point is that this work replaces vector quantization with a continuous VAE split across body and hand units, then adds random masking during fusion and cross-unit modulation before feeding into per-unit MLP diffusion predictors. That trio targets the usual VQ drawbacks like information loss while trying to keep coordination between units without exploding compute. The masking step pulls action-conditioned features from active tokens, and the modulation lets one unit's features scale the other before prediction. This setup is a reasonable response to the coordination and detail problems that come up when units are handled completely separately. The choice of a compact MLP instead of a heavy transformer for the noise predictor also keeps the model practical for longer sequences. The specific combination for action-reaction pairs does not match the autoregressive VQ baselines they cite, so the architectural move is fresh. On the downside, the abstract states superior quantitative and qualitative results without any numbers, baselines, or ablation tables shown here. That leaves the central claim hard to assess. The masking ratio is listed as a free parameter, so any gains could partly trace to tuning rather than the modulation itself. The stress-test worry about post-hoc modulation failing to enforce tight body-hand coupling is worth watching; if reactions require instantaneous constraints that separate latents cannot recover, the outputs might show subtle desync that FID or MPJPE overlook. Qualitative examples would need to demonstrate this does not happen. The work is aimed at motion synthesis researchers who already use diffusion or autoregressive models and want finer reaction detail for animation or robotics. A reader focused on multi-person coordination or unit-based representations would find the design choices useful to examine. It deserves peer review because the problem is concrete, the fixes address documented prior limitations, and the framework is simple enough to test and extend even if the current evidence needs strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MARRS, a framework for human action-reaction synthesis that encodes body and hand units independently via a Unit-distinguished Motion Variational AutoEncoder (UD-VAE), applies Action-Conditioned Fusion (ACF) with random masking of reactive tokens, uses Mutual Unit Modulation (MUM) for adaptive cross-unit interaction, and employs separate compact MLP diffusion predictors with diffusion loss for each unit. The central claim is that this continuous-representation approach yields superior quantitative and qualitative performance in generating coordinated, fine-grained reactions compared to prior VQ-based autoregressive methods.

Significance. If the empirical claims are substantiated, the work provides a practical alternative to vector-quantization losses in motion synthesis by retaining continuous latents while managing computational cost through unit segmentation and post-encoding modulation. The combination of random masking and mutual modulation offers a lightweight mechanism for inter-person and body-hand coordination that could transfer to related tasks such as two-person interaction generation or fine-motor control in animation.

major comments (2)

[Abstract] Abstract: the claim that 'both quantitative and qualitative results demonstrate that our method achieves superior performance' is presented without any reported metrics, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the summary alone and must be verified against the experimental section.
[Section 3.3] Section 3.3 (MUM description): the adaptive modulation of one unit's features by the other is described as sufficient to recover inter-unit dependencies, yet no explicit joint constraint, synchronization loss, or diagnostic metric (e.g., cross-unit velocity correlation or instantaneous pose-velocity consistency) is introduced; if body-hand coupling is non-factorizable, this post-hoc modulation may only approximate rather than enforce coordination, risking artifacts that FID or MPJPE could under-detect.

minor comments (2)

[Section 3.2] The masking ratio is listed among free parameters but no sensitivity analysis or default value is stated; a brief ablation or recommended range would clarify reproducibility.
[Section 3.1] Notation for the continuous latent variables of body versus hand units should be introduced once and used consistently to avoid ambiguity when describing the modulation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the potential impact of MARRS. We address each major comment point by point below, indicating whether revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'both quantitative and qualitative results demonstrate that our method achieves superior performance' is presented without any reported metrics, baselines, error bars, or ablation tables, so the central empirical claim cannot be evaluated from the summary alone and must be verified against the experimental section.

Authors: We agree that the abstract serves as a high-level summary and does not contain specific numerical results. The detailed quantitative evaluation, including metrics such as FID and MPJPE, comparisons against baselines, error bars, and ablation studies, is fully reported in Section 4 with supporting tables and figures. To address the concern, we have revised the abstract to include a concise reference to the observed performance gains on these metrics. revision: yes
Referee: [Section 3.3] Section 3.3 (MUM description): the adaptive modulation of one unit's features by the other is described as sufficient to recover inter-unit dependencies, yet no explicit joint constraint, synchronization loss, or diagnostic metric (e.g., cross-unit velocity correlation or instantaneous pose-velocity consistency) is introduced; if body-hand coupling is non-factorizable, this post-hoc modulation may only approximate rather than enforce coordination, risking artifacts that FID or MPJPE could under-detect.

Authors: We appreciate the referee's careful analysis of the MUM module. MUM provides adaptive cross-unit modulation within the continuous latent space to facilitate interaction between body and hand units in a computationally efficient manner, complementing the random masking in ACF. No additional synchronization loss was introduced to avoid increasing model complexity, but the joint diffusion training with compact per-unit predictors encourages coordinated outputs, as evidenced by our quantitative results and qualitative motion visualizations. We have revised the description in Section 3.3 for greater clarity on this mechanism and added a diagnostic analysis of cross-unit velocity correlations in the experiments to better validate coordination. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural components are constructive extensions without reduction to inputs or self-citations

full rationale

The paper defines UD-VAE for independent body/hand encoding, ACF for random masking of reactive tokens, MUM for adaptive cross-unit modulation, and per-unit MLP diffusion predictors as sequential novel modules. These are presented as design choices to address VQ limitations and neglected mutual perception, with performance asserted via quantitative/qualitative results rather than any equation that reduces a claimed prediction to a fitted parameter or prior self-result by construction. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or self-definitional loops appear in the derivation chain; the framework remains self-contained against external motion benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that body-hand segmentation plus masking and modulation will preserve coordination information; several standard VAE and diffusion assumptions are inherited without re-derivation.

free parameters (1)

masking ratio
Proportion of reactive tokens randomly masked in ACF is a design choice that directly affects information flow.

axioms (1)

domain assumption Independent encoding of body and hand units preserves all necessary inter-unit dependencies for reaction synthesis.
Invoked by the UD-VAE design and subsequent modulation step.

pith-pipeline@v0.9.0 · 5812 in / 1134 out tokens · 50576 ms · 2026-05-22T14:44:44.502840+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction (8-tick period forcing D=3) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ACF and MUM form the basic model blocks, of which there are N (8 in MARRS-Base).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost coupling) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we propose Mutual Unit Modulation (MUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Parent,Computer animation: algorithms and techniques

R. Parent,Computer animation: algorithms and techniques. Newnes, 2012

work page 2012
[2]

Unified cross-structural motion retargeting for humanoid char- acters,

H. Zhang, Z. Chen, H. Xu, L. Hao, X. Wu, S. Xu, R. Xiong, and Y . Wang, “Unified cross-structural motion retargeting for humanoid char- acters,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 7, pp. 3863–3876, 2025

work page 2025
[3]

Magnenat-Thalmann, D

N. Magnenat-Thalmann, D. Thalmann, N. Magnenat-Thalmann, and D. Thalmann,Computer animation. Springer, 1985

work page 1985
[4]

Introduction to game development,

J. Urbain, “Introduction to game development,”Cell, vol. 414, pp. 745– 5102, 2010

work page 2010
[5]

Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,

Y . Wang, M. Li, Z. Leng, F. W. B. Li, and X. Liang, “Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 8994–9007, 2025

work page 2025
[6]

Bethke,Game development and production

E. Bethke,Game development and production. Wordware Publishing, Inc., 2003

work page 2003
[7]

Intelligent robotic control,

G. Saridis, “Intelligent robotic control,”IEEE Transactions on Automatic Control, vol. 28, no. 5, pp. 547–557, 1983

work page 1983
[8]

Towards domain generalization for multi-view 3d object detection in bird-eye-view,

S. Wang, X. Zhao, H.-M. Xu, Z. Chen, D. Yu, J. Chang, Z. Yang, and F. Zhao, “Towards domain generalization for multi-view 3d object detection in bird-eye-view,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 333–13 342

work page 2023
[9]

Learning from noisy data for semi-supervised 3d object detection,

Z. Chen, Z. Li, S. Wang, D. Fu, and F. Zhao, “Learning from noisy data for semi-supervised 3d object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6929–6939

work page 2023
[10]

Stream query denoising for vectorized hd-map construction,

S. Wang, F. Jia, W. Mao, Y . Liu, Y . Zhao, Z. Chen, T. Wang, C. Zhang, X. Zhang, and F. Zhao, “Stream query denoising for vectorized hd-map construction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 203–220

work page 2024
[11]

Mmm: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1546–1555

work page 2024
[12]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1900–1910

work page 2024
[13]

Online clustered codebook,

C. Zheng and A. Vedaldi, “Online clustered codebook,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 798–22 807

work page 2023
[14]

Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,

G. Baykal, M. Kandemir, and G. Unal, “Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,”Pattern Recognition, vol. 156, p. 110792, 2024. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0031320324005430

work page 2024
[15]

Interactive character control with auto-regressive motion diffusion models,

Y . Shi, J. Wang, X. Jiang, B. Lin, B. Dai, and X. B. Peng, “Interactive character control with auto-regressive motion diffusion models,”ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–14, 2024

work page 2024
[16]

Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

T. Li, Y . Tian, H. Li, M. Deng, and K. He, “Autoregressive image gen- eration without vector quantization,”arXiv preprint arXiv:2406.11838, 2024

work page arXiv 2024
[17]

Temos: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–497

work page 2022
[18]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

work page 2022
[19]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 730–14 740

work page 2023
[20]

Mo- tiondiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Mo- tiondiffuse: Text-driven human motion generation with diffusion model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[21]

Human Motion Diffusion Model

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

work page 2023
[23]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 364–373

work page 2023
[24]

Sport: From zero- shot prompts to real-time motion generation,

B. Ji, Y . Pan, Z. Liu, S. Tan, and X. Yang, “Sport: From zero- shot prompts to real-time motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 7171–7183, 2025

work page 2025
[25]

Guess: Gradually enriching synthesis for text-driven human motion generation,

X. Gao, Y . Yang, Z. Xie, S. Du, Z. Sun, and Y . Wu, “Guess: Gradually enriching synthesis for text-driven human motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 12, pp. 7518–7530, 2024

work page 2024
[26]

Simulating competitive interactions using singly captured motions,

H. P. Shum, T. Komura, and S. Yamazaki, “Simulating competitive interactions using singly captured motions,” inProceedings of the 2007 ACM symposium on Virtual reality software and technology, 2007, pp. 65–72

work page 2007
[27]

Animating reactive motion using momentum-based inverse kinematics,

T. Komura, E. S. Ho, and R. W. Lau, “Animating reactive motion using momentum-based inverse kinematics,”Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 213–223, 2005

work page 2005
[28]

Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023

work page arXiv 2023
[29]

Role-aware interaction generation from textual description,

M. Tanaka and K. Fujiwara, “Role-aware interaction generation from textual description,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 999–16 009

work page 2023
[30]

Intercontrol: Generate hu- man motion interactions by controlling every joint,

Z. Wang, J. Wang, D. Lin, and B. Dai, “Intercontrol: Generate hu- man motion interactions by controlling every joint,”arXiv preprint arXiv:2311.15864, 2023

work page arXiv 2023
[31]

Intergen: Diffusion- based multi-human motion generation under complex interactions,

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

work page 2024
[32]

Freemotion: A unified framework for number-free text-to-motion synthesis,

K. Fan, J. Tang, W. Cao, R. Yi, M. Li, J. Gong, J. Zhang, Y . Wang, C. Wang, and L. Ma, “Freemotion: A unified framework for number-free text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 93–109

work page 2024
[33]

Timotion: Temporal and interactive framework for efficient human- human motion generation,

Y . Wang, S. Wang, J. Zhang, K. Fan, J. Wu, Z. Xue, and Y . Liu, “Timotion: Temporal and interactive framework for efficient human- human motion generation,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7169–7178

work page 2025
[34]

Intermask: 3d human interaction generation via collaborative masked modeling,

M. G. Javed, chuan guo, L. cheng, and X. Li, “Intermask: 3d human interaction generation via collaborative masked modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=ZAyuwJYN8N

work page 2025
[35]

Regennet: Towards human action-reaction synthesis,

L. Xu, Y . Zhou, Y . Yan, X. Jin, W. Zhu, F. Rao, X. Yang, and W. Zeng, “Regennet: Towards human action-reaction synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1759–1769

work page 2024
[36]

Reactffusion: Physical contact- guided diffusion model for reaction generation,

Z. Zhang, S. Zhang, Y . Wang, and S. Li, “Reactffusion: Physical contact- guided diffusion model for reaction generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9677– 9685. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12

work page 2025
[37]

Mardini: Masked autoregressive diffusion for video generation at scale,

H. Liu, S. Liu, Z. Zhou, M. Xu, Y . Xie, X. Han, J. C. P ´erez, D. Liu, K. Kahatapitiya, M. Jiaet al., “Mardini: Masked autoregressive diffusion for video generation at scale,”arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[38]

Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,

J. Yang, D. Yin, Y . Zhou, F. Rao, W. Zhai, Y . Cao, and Z.-J. Zha, “Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,” arXiv preprint arXiv:2410.10798, 2024

work page arXiv 2024
[39]

Rethinking diffusion for text-driven human motion generation,

Z. Meng, Y . Xie, X. Peng, Z. Han, and H. Jiang, “Rethinking diffusion for text-driven human motion generation,”arXiv preprint arXiv:2411.16575, 2024

work page arXiv 2024
[40]

Diverse motion in-betweening from sparse keyframes with dual posture stitching,

T. Ren, J. Yu, S. Guo, Y . Ma, Y . Ouyang, Z. Zeng, Y . Zhang, and Y . Qin, “Diverse motion in-betweening from sparse keyframes with dual posture stitching,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 2, pp. 1402–1413, 2025

work page 2025
[41]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 975–10 985

work page 2019
[42]

Parco: Part-coordinating text-to-motion synthesis,

Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “Parco: Part-coordinating text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 126–143

work page 2024
[43]

Improved denoising diffusion probabilis- tic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis- tic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

work page 2021
[44]

Inter-x: Towards versatile human-human interaction analysis,

L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Shenget al., “Inter-x: Towards versatile human-human interaction analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 260–22 271

work page 2024
[45]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Auto-encoding variational bayes,

D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” in The 2nd International Conference on Learning Representations, 2014. [Online]. Available: https://openreview.net/forum?id=33X9fd2-9FyZd

work page 2014
[47]

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,

Y . Du, R. Kips, A. Pumarola, S. Starke, A. Thabet, and A. Sanakoyeu, “Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 481– 490

work page 2023
[48]

Synthesis of compositional animations from textual descriptions,

A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Synthesis of compositional animations from textual descriptions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1396–1406

work page 2021
[49]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 985–10 995

work page 2021
[50]

What is the best automated metric for text to motion generation?

J. V oas, Y . Wang, Q. Huang, and R. Mooney, “What is the best automated metric for text to motion generation?” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11

work page 2023

[1] [1]

Parent,Computer animation: algorithms and techniques

R. Parent,Computer animation: algorithms and techniques. Newnes, 2012

work page 2012

[2] [2]

Unified cross-structural motion retargeting for humanoid char- acters,

H. Zhang, Z. Chen, H. Xu, L. Hao, X. Wu, S. Xu, R. Xiong, and Y . Wang, “Unified cross-structural motion retargeting for humanoid char- acters,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 7, pp. 3863–3876, 2025

work page 2025

[3] [3]

Magnenat-Thalmann, D

N. Magnenat-Thalmann, D. Thalmann, N. Magnenat-Thalmann, and D. Thalmann,Computer animation. Springer, 1985

work page 1985

[4] [4]

Introduction to game development,

J. Urbain, “Introduction to game development,”Cell, vol. 414, pp. 745– 5102, 2010

work page 2010

[5] [5]

Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,

Y . Wang, M. Li, Z. Leng, F. W. B. Li, and X. Liang, “Most: Motion diffusion model for rare text via temporal clip banzhaf interaction,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 8994–9007, 2025

work page 2025

[6] [6]

Bethke,Game development and production

E. Bethke,Game development and production. Wordware Publishing, Inc., 2003

work page 2003

[7] [7]

Intelligent robotic control,

G. Saridis, “Intelligent robotic control,”IEEE Transactions on Automatic Control, vol. 28, no. 5, pp. 547–557, 1983

work page 1983

[8] [8]

Towards domain generalization for multi-view 3d object detection in bird-eye-view,

S. Wang, X. Zhao, H.-M. Xu, Z. Chen, D. Yu, J. Chang, Z. Yang, and F. Zhao, “Towards domain generalization for multi-view 3d object detection in bird-eye-view,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 333–13 342

work page 2023

[9] [9]

Learning from noisy data for semi-supervised 3d object detection,

Z. Chen, Z. Li, S. Wang, D. Fu, and F. Zhao, “Learning from noisy data for semi-supervised 3d object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6929–6939

work page 2023

[10] [10]

Stream query denoising for vectorized hd-map construction,

S. Wang, F. Jia, W. Mao, Y . Liu, Y . Zhao, Z. Chen, T. Wang, C. Zhang, X. Zhang, and F. Zhao, “Stream query denoising for vectorized hd-map construction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 203–220

work page 2024

[11] [11]

Mmm: Generative masked motion model,

E. Pinyoanuntapong, P. Wang, M. Lee, and C. Chen, “Mmm: Generative masked motion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1546–1555

work page 2024

[12] [12]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y . Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1900–1910

work page 2024

[13] [13]

Online clustered codebook,

C. Zheng and A. Vedaldi, “Online clustered codebook,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 798–22 807

work page 2023

[14] [14]

Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,

G. Baykal, M. Kandemir, and G. Unal, “Edvae: Mitigating codebook collapse with evidential discrete variational autoencoders,”Pattern Recognition, vol. 156, p. 110792, 2024. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0031320324005430

work page 2024

[15] [15]

Interactive character control with auto-regressive motion diffusion models,

Y . Shi, J. Wang, X. Jiang, B. Lin, B. Dai, and X. B. Peng, “Interactive character control with auto-regressive motion diffusion models,”ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–14, 2024

work page 2024

[16] [16]

Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

T. Li, Y . Tian, H. Li, M. Deng, and K. He, “Autoregressive image gen- eration without vector quantization,”arXiv preprint arXiv:2406.11838, 2024

work page arXiv 2024

[17] [17]

Temos: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–497

work page 2022

[18] [18]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161

work page 2022

[19] [19]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y . Zhang, X. Cun, Y . Zhang, H. Zhao, H. Lu, X. Shen, and Y . Shan, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 730–14 740

work page 2023

[20] [20]

Mo- tiondiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Mo- tiondiffuse: Text-driven human motion generation with diffusion model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[21] [21]

Human Motion Diffusion Model

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Executing your commands via motion diffusion in latent space,

X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, and G. Yu, “Executing your commands via motion diffusion in latent space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 000–18 010

work page 2023

[23] [23]

Remodiffuse: Retrieval-augmented motion diffusion model,

M. Zhang, X. Guo, L. Pan, Z. Cai, F. Hong, H. Li, L. Yang, and Z. Liu, “Remodiffuse: Retrieval-augmented motion diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 364–373

work page 2023

[24] [24]

Sport: From zero- shot prompts to real-time motion generation,

B. Ji, Y . Pan, Z. Liu, S. Tan, and X. Yang, “Sport: From zero- shot prompts to real-time motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 7171–7183, 2025

work page 2025

[25] [25]

Guess: Gradually enriching synthesis for text-driven human motion generation,

X. Gao, Y . Yang, Z. Xie, S. Du, Z. Sun, and Y . Wu, “Guess: Gradually enriching synthesis for text-driven human motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 12, pp. 7518–7530, 2024

work page 2024

[26] [26]

Simulating competitive interactions using singly captured motions,

H. P. Shum, T. Komura, and S. Yamazaki, “Simulating competitive interactions using singly captured motions,” inProceedings of the 2007 ACM symposium on Virtual reality software and technology, 2007, pp. 65–72

work page 2007

[27] [27]

Animating reactive motion using momentum-based inverse kinematics,

T. Komura, E. S. Ho, and R. W. Lau, “Animating reactive motion using momentum-based inverse kinematics,”Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 213–223, 2005

work page 2005

[28] [28]

Human motion diffusion as a generative prior,

Y . Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023

work page arXiv 2023

[29] [29]

Role-aware interaction generation from textual description,

M. Tanaka and K. Fujiwara, “Role-aware interaction generation from textual description,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 999–16 009

work page 2023

[30] [30]

Intercontrol: Generate hu- man motion interactions by controlling every joint,

Z. Wang, J. Wang, D. Lin, and B. Dai, “Intercontrol: Generate hu- man motion interactions by controlling every joint,”arXiv preprint arXiv:2311.15864, 2023

work page arXiv 2023

[31] [31]

Intergen: Diffusion- based multi-human motion generation under complex interactions,

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

work page 2024

[32] [32]

Freemotion: A unified framework for number-free text-to-motion synthesis,

K. Fan, J. Tang, W. Cao, R. Yi, M. Li, J. Gong, J. Zhang, Y . Wang, C. Wang, and L. Ma, “Freemotion: A unified framework for number-free text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 93–109

work page 2024

[33] [33]

Timotion: Temporal and interactive framework for efficient human- human motion generation,

Y . Wang, S. Wang, J. Zhang, K. Fan, J. Wu, Z. Xue, and Y . Liu, “Timotion: Temporal and interactive framework for efficient human- human motion generation,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7169–7178

work page 2025

[34] [34]

Intermask: 3d human interaction generation via collaborative masked modeling,

M. G. Javed, chuan guo, L. cheng, and X. Li, “Intermask: 3d human interaction generation via collaborative masked modeling,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=ZAyuwJYN8N

work page 2025

[35] [35]

Regennet: Towards human action-reaction synthesis,

L. Xu, Y . Zhou, Y . Yan, X. Jin, W. Zhu, F. Rao, X. Yang, and W. Zeng, “Regennet: Towards human action-reaction synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1759–1769

work page 2024

[36] [36]

Reactffusion: Physical contact- guided diffusion model for reaction generation,

Z. Zhang, S. Zhang, Y . Wang, and S. Li, “Reactffusion: Physical contact- guided diffusion model for reaction generation,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 9677– 9685. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12

work page 2025

[37] [37]

Mardini: Masked autoregressive diffusion for video generation at scale,

H. Liu, S. Liu, Z. Zhou, M. Xu, Y . Xie, X. Han, J. C. P ´erez, D. Liu, K. Kahatapitiya, M. Jiaet al., “Mardini: Masked autoregressive diffusion for video generation at scale,”arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024

[38] [38]

Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,

J. Yang, D. Yin, Y . Zhou, F. Rao, W. Zhai, Y . Cao, and Z.-J. Zha, “Mmar: Towards lossless multi-modal auto-regressive probabilistic modeling,” arXiv preprint arXiv:2410.10798, 2024

work page arXiv 2024

[39] [39]

Rethinking diffusion for text-driven human motion generation,

Z. Meng, Y . Xie, X. Peng, Z. Han, and H. Jiang, “Rethinking diffusion for text-driven human motion generation,”arXiv preprint arXiv:2411.16575, 2024

work page arXiv 2024

[40] [40]

Diverse motion in-betweening from sparse keyframes with dual posture stitching,

T. Ren, J. Yu, S. Guo, Y . Ma, Y . Ouyang, Z. Zeng, Y . Zhang, and Y . Qin, “Diverse motion in-betweening from sparse keyframes with dual posture stitching,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 2, pp. 1402–1413, 2025

work page 2025

[41] [41]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 975–10 985

work page 2019

[42] [42]

Parco: Part-coordinating text-to-motion synthesis,

Q. Zou, S. Yuan, S. Du, Y . Wang, C. Liu, Y . Xu, J. Chen, and X. Ji, “Parco: Part-coordinating text-to-motion synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 126–143

work page 2024

[43] [43]

Improved denoising diffusion probabilis- tic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilis- tic models,” inInternational conference on machine learning. PMLR, 2021, pp. 8162–8171

work page 2021

[44] [44]

Inter-x: Towards versatile human-human interaction analysis,

L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Shenget al., “Inter-x: Towards versatile human-human interaction analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 260–22 271

work page 2024

[45] [45]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

Auto-encoding variational bayes,

D. P. Kingma, M. Wellinget al., “Auto-encoding variational bayes,” in The 2nd International Conference on Learning Representations, 2014. [Online]. Available: https://openreview.net/forum?id=33X9fd2-9FyZd

work page 2014

[47] [47]

Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,

Y . Du, R. Kips, A. Pumarola, S. Starke, A. Thabet, and A. Sanakoyeu, “Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 481– 490

work page 2023

[48] [48]

Synthesis of compositional animations from textual descriptions,

A. Ghosh, N. Cheema, C. Oguz, C. Theobalt, and P. Slusallek, “Synthesis of compositional animations from textual descriptions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1396–1406

work page 2021

[49] [49]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 985–10 995

work page 2021

[50] [50]

What is the best automated metric for text to motion generation?

J. V oas, Y . Wang, Q. Huang, and R. Mooney, “What is the best automated metric for text to motion generation?” inSIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11

work page 2023