pith. machine review for the scientific record. sign in

arxiv: 2605.04662 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords dance motion synthesisreactive motion generationcontact matrixdiffusion modelVQ-VAEhuman interaction modelingduet dance
0
0 comments X

The pith

A contact-aware diffusion model that jointly generates duet dance motions and an explicit interaction matrix produces more precise physical contacts and better rhythmic alignment than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a two-stage system for creating one dancer's reactive motions in response to another's fixed sequence. The first stage trains a VQ-VAE that encodes body parts separately before decoding them together to maintain consistency across limbs. The second stage runs a diffusion process that outputs both the full-body motion and a contact matrix recording which body regions touch at each moment. A reader should care because duet dance involves tight spatial and timing constraints that current generators often violate, and the added matrix supplies direct guidance during sampling to respect those limits even when training examples are few.

Core claim

The paper claims that jointly generating motion and a contact matrix between two individuals inside a contact-aware diffusion model supplies explicit interaction modeling that guides sampling toward more precise and constrained dynamics, producing lower FID_k and FID_cd scores together with higher BED scores than the Duolando baseline.

What carries the argument

The contact matrix, an explicit representation of pairwise body-part contacts that is produced simultaneously with the motion sequence inside the diffusion model and used to constrain the generation trajectory.

If this is right

  • The generated motions exhibit tighter physical contact fidelity as measured by reduced FID_cd.
  • Rhythmic synchronization between dancers improves, reflected in elevated BED scores.
  • The two-stage separation allows the contact signal to steer sampling even when high-quality duet data remains scarce.
  • Body-part inconsistencies are reduced by the joint decoder in the first-stage VQ-VAE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contact-matrix mechanism could be applied to other paired physical activities such as partner sports or object hand-offs.
  • Extending the matrix to record forces or velocities rather than binary contacts might further tighten the interaction constraints.
  • Because the matrix is generated at inference time, the model could accept user-specified contact patterns to control interaction style without retraining.

Load-bearing premise

Jointly generating the motion and contact matrix supplies enough guidance to enforce precise interactions without extra loss terms or larger datasets.

What would settle it

Retraining the diffusion stage without the contact-matrix output head and measuring whether interaction-specific metrics (FID_cd and BED) fall back to or below the Duolando baseline on the same test set would directly test the claim.

Figures

Figures reproduced from arXiv: 2605.04662 by Huaijin Pi, Sida Peng, Xiaowei Zhou, Xuhai Chen, Yong Liu, Zhi Cen.

Figure 1
Figure 1. Figure 1: We tackle interaction-aware reactive motion gener￾ation. First, we introduce (b) PartFusion-VQ to prevent unnatu￾ral poses, as highlighted by the green circle in (a) Duolando [45]. Second, in (c), we incorporate a contact matrix to guide motion re￾finement, correcting unrealistic joint configurations (red circles). response to one another. This dynamic and interdependent process is commonly referred to as … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of PartFusion-VQ. Js, Es, and Zs (s ∈ {U, D, L, R}) represent the joint number, encoder, and codebook of each body part (upper body, lower body, left hand, right hand), respectively. T and T ′ denote the input and encoded sequence lengths. C is the feature dimension, and Dp, Dr, Dg denote the decoders for 3D joint positions, rotations, and global translation. As illustrated in view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the pipeline. We propose a diffusion model named RCDiff to generate reactive motion in latent space. Em denotes the four PartFusion-VQ encoders; Ed, Ec, and Ea denote the encoders for global trajectory, contact matrix, and music, respectively. Dm, Dd, and Dc denote the corresponding decoders. All encoders and decoders except Ea are frozen during diffusion training. sulting loss can be written a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of Duolando [45] and our method RCDiff. The blue mesh represents the leader, while the pink mesh shows the generated follower, corresponding to the reactive motion produced by the models. Duolando is a two-stage method for dance reaction genera￾tion. It first encodes motion and relative translations using VQ-VAEs, and then autoregressively generates the motion of the follower in the… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on PartFusion-VQ. The first row shows the results obtained by modeling each body part separately with inde￾pendent VQ-VAEs. The second row shows the results obtained by modeling the overall motion using PartFusion-VQ. of diverse ages, genders, and backgrounds, including vary￾ing familiarity with dance, were presented with 10 sets of anonymized animations. In each set, two videos generated for the … view at source ↗
read the original abstract

Generating realistic reactive motions, in which one person reacts to the fixed motions of others, is challenging due to strict interaction constraints and a limited feasible solution space. This paper focuses on a typical scenario: duet dance, where high-quality data is scarce, motion patterns are complex, and the details of human interactions are both intricate and abundant. To tackle these challenges, we propose a novel two-stage framework. In the first stage, we introduce a motion VQ-VAE with separate body-part encoders and a joint decoder, enabling specialized codebooks to enhance representation capacity while dynamically modeling dependencies across body parts during decoding, thereby preventing inconsistencies in the generated motions. In the second stage, we propose a contact-aware diffusion model for reactive motion generation that jointly generates motion and a contact matrix between individuals, enabling explicit interaction modeling and providing guidance toward more precise and constrained interaction dynamics during sampling. Experiments show that our method outperforms Duolando with lower $\text{FID}_k$ (8.89 vs. 25.30) and $\text{FID}_{cd}$ (8.01 vs. 9.97), as well as a higher BED (0.4606 vs. 0.2858), indicating improved interaction fidelity and rhythmic synchronization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a two-stage framework improves reactive duet dance motion synthesis: a VQ-VAE with per-body-part encoders and joint decoder in stage 1, followed by a contact-aware diffusion model in stage 2 that jointly generates motion and an inter-person contact matrix to enforce precise interaction constraints. Experiments report quantitative gains over Duolando (FID_k 8.89 vs. 25.30, FID_cd 8.01 vs. 9.97, BED 0.4606 vs. 0.2858).

Significance. If the contact matrix genuinely supplies effective guidance during sampling, the work would meaningfully advance multi-person motion synthesis in data-scarce, high-constraint domains such as duet dance by addressing both representation capacity and interaction fidelity. The body-part codebook design is a concrete strength for handling complex dependencies.

major comments (2)
  1. [Abstract] Abstract and framework description: the central claim that jointly generating the contact matrix 'enables explicit interaction modeling and providing guidance toward more precise and constrained interaction dynamics during sampling' lacks any described auxiliary contact loss, contact-masking schedule, or post-sampling enforcement. Without these, the reported metric improvements cannot be confidently attributed to the contact matrix rather than the VQ-VAE stage or other training choices.
  2. [Experiments] The assumption that the learned joint distribution over motion and contact matrix will produce accurate, feasible contacts is load-bearing for the interaction-fidelity claim, yet no quantitative evaluation of contact-matrix accuracy (e.g., contact prediction error or intersection volume) is provided to verify that the matrix actually steers samples into valid states.
minor comments (2)
  1. [Abstract] Please add a citation for the Duolando baseline in the abstract and methods.
  2. [Abstract] Notation for FID_k and FID_cd should be defined at first use or in a table caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate clarifications and additional analyses where needed.

read point-by-point responses
  1. Referee: [Abstract] Abstract and framework description: the central claim that jointly generating the contact matrix 'enables explicit interaction modeling and providing guidance toward more precise and constrained interaction dynamics during sampling' lacks any described auxiliary contact loss, contact-masking schedule, or post-sampling enforcement. Without these, the reported metric improvements cannot be confidently attributed to the contact matrix rather than the VQ-VAE stage or other training choices.

    Authors: The contact matrix is generated jointly with the motion as an additional output channel in the diffusion model. The training objective is the standard diffusion loss applied simultaneously to both modalities, enabling the network to learn their joint distribution and implicit dependencies directly from data. This provides guidance during sampling because each denoising step produces motion and contacts that are consistent with each other by construction, without requiring an auxiliary loss, masking schedule, or post-processing enforcement. We have expanded the abstract and Section 3 to describe this mechanism more explicitly. To strengthen attribution of the gains, we have added an ablation study (new Table X) comparing the full model against a motion-only diffusion variant; the results show clear degradation in FID_k, FID_cd, and BED when the contact matrix is removed, indicating its contribution beyond the VQ-VAE stage. revision: yes

  2. Referee: [Experiments] The assumption that the learned joint distribution over motion and contact matrix will produce accurate, feasible contacts is load-bearing for the interaction-fidelity claim, yet no quantitative evaluation of contact-matrix accuracy (e.g., contact prediction error or intersection volume) is provided to verify that the matrix actually steers samples into valid states.

    Authors: We agree that direct quantitative validation of contact accuracy would provide stronger support for the claim. In the revised manuscript we have added a new evaluation subsection reporting contact prediction accuracy (fraction of correctly classified contact pairs) and average intersection volume (penetration depth) between the two dancers on generated samples. These metrics confirm low error rates and feasible contacts, consistent with the observed improvements in interaction fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: two-stage model evaluated via independent empirical metrics

full rationale

The paper's core contribution is a two-stage architecture (VQ-VAE followed by contact-aware diffusion) whose outputs are assessed on external metrics (FID_k, FID_cd, BED) against a named baseline (Duolando). No equation or claim reduces a result to its own inputs by definition, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation chains. The contact matrix is generated as an explicit joint output rather than presupposed, and performance differences are reported as measured quantities, not derived tautologies.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper introduces the contact matrix as a key innovation and relies on standard assumptions in generative models for motion. Limited details available from abstract prevent full enumeration of all parameters.

free parameters (1)
  • codebook sizes for body parts
    The VQ-VAE uses separate codebooks whose sizes are hyperparameters likely tuned to data.
axioms (1)
  • domain assumption Joint decoding of body parts prevents motion inconsistencies
    Invoked in the description of the VQ-VAE to ensure coherent full-body motions.
invented entities (1)
  • contact matrix no independent evidence
    purpose: To explicitly represent and guide interactions between dancers in the diffusion process
    New modeling component introduced in the second stage without external validation mentioned.

pith-pipeline@v0.9.0 · 5530 in / 1476 out tokens · 90755 ms · 2026-05-08T17:53:46.684793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Learning character-agnostic motion for motion retargeting in 2d.arXiv preprint arXiv:1905.01680, 2019

    Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. Learning character-agnostic motion for motion retargeting in 2d.arXiv preprint arXiv:1905.01680, 2019. 3

  2. [2]

    Skeleton- aware networks for deep motion retargeting.ACM Transac- tions on Graphics (TOG), 39(4):62–1, 2020

    Kfir Aberman, Peizhuo Li, Dani Lischinski, Olga Sorkine- Hornung, Daniel Cohen-Or, and Baoquan Chen. Skeleton- aware networks for deep motion retargeting.ACM Transac- tions on Graphics (TOG), 39(4):62–1, 2020. 3

  3. [3]

    To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations

    Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations. In2019 International conference on multimodal interaction, pages 74–84, 2019. 2

  4. [4]

    Pose-conditioned joint an- gle limits for 3d human pose reconstruction

    Ijaz Akhter and Michael J Black. Pose-conditioned joint an- gle limits for 3d human pose reconstruction. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1446–1455, 2015. 2, 3

  5. [5]

    Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

    Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 1

  6. [6]

    Belfusion: Latent diffusion for behavior-driven human mo- tion prediction

    German Barquero, Sergio Escalera, and Cristina Palmero. Belfusion: Latent diffusion for behavior-driven human mo- tion prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 2317–2327,

  7. [7]

    Ready-to-react: Online reaction policy for two-character interaction genera- tion.arXiv preprint arXiv:2502.20370, 2025

    Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready-to-react: Online reaction policy for two-character interaction genera- tion.arXiv preprint arXiv:2502.20370, 2025. 2, 3

  8. [8]

    Text2hoi: Text-guided 3d motion generation for hand- object interaction

    Junuk Cha, Jihyeon Kim, Jae Shin Yoon, and Seungryul Baek. Text2hoi: Text-guided 3d motion generation for hand- object interaction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1577–1585, 2024. 2, 3

  9. [9]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 3

  10. [10]

    Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25: 8842–8854, 2023

    Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25: 8842–8854, 2023. 1, 2, 6, 7

  11. [11]

    Mofusion: A framework for denoising-diffusion-based motion synthesis

    Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9760–9770, 2023. 3

  12. [12]

    Adam: A method for stochastic opti- mization.(No Title), 2014

    P Kingma Diederik. Adam: A method for stochastic opti- mization.(No Title), 2014. 6

  13. [13]

    Cg-hoi: Contact-guided 3d human-object interaction generation

    Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19888–19901, 2024. 2

  14. [14]

    Presence and interaction in mixed reality envi- ronments.The Visual Computer, 23:317–333, 2007

    Arjan Egges, George Papagiannakis, and Nadia Magnenat- Thalmann. Presence and interaction in mixed reality envi- ronments.The Visual Computer, 23:317–333, 2007. 1

  15. [15]

    Remos: 3d motion- conditioned reaction synthesis for two-person interactions

    Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Chris- tian Theobalt, and Philipp Slusallek. Remos: 3d motion- conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision, pages 418–

  16. [16]

    Springer, 2024. 1, 2, 5

  17. [17]

    Duetgen: Music driven two-person dance generation via hierarchical masked modeling

    Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchical masked modeling. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 7

  18. [18]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022. 1

  19. [19]

    Scenemaker: Intelligent multimodal visualisation of natural language scripts

    Eva Hanser, Paul Mc Kevitt, Tom Lunney, and Joan Condell. Scenemaker: Intelligent multimodal visualisation of natural language scripts. InArtificial Intelligence and Cognitive Sci- ence: 20th Irish Conference, AICS 2009, Dublin, Ireland, August 19-21, 2009, Revised Selected Papers 20, pages 144–

  20. [20]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2, 5, 6

  21. [21]

    Multi-agent long-term 3d human pose forecasting via interaction-aware trajectory conditioning

    Jaewoo Jeong, Daehee Park, and Kuk-Jin Yoon. Multi-agent long-term 3d human pose forecasting via interaction-aware trajectory conditioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1617–1628, 2024. 2

  22. [22]

    Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 1, 3, 4

  23. [23]

    A brand new dance partner: Music- conditioned pluralistic dancing controlled by multiple dance genres

    Jinwoo Kim, Heeseok Oh, Seongjean Kim, Hoseok Tong, and Sanghoon Lee. A brand new dance partner: Music- conditioned pluralistic dancing controlled by multiple dance genres. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3490– 3500, 2022. 1

  24. [24]

    Music-driven group choreography

    Nhat Le, Thang Pham, Tuong Do, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Music-driven group choreography. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8673–8682, 2023. 1

  25. [25]

    Ai choreographer: Music conditioned 3d dance generation with aist++

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InProceedings of the IEEE/CVF international conference on computer vision, pages 13401– 13412, 2021. 6

  26. [26]

    Auto-conditioned recurrent networks for ex- tended complex human motion synthesis.arXiv preprint arXiv:1707.05363, 2017

    Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. Auto-conditioned recurrent networks for ex- tended complex human motion synthesis.arXiv preprint arXiv:1707.05363, 2017. 1, 2, 3

  27. [27]

    Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

    Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 2

  28. [28]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 5

  29. [29]

    Disentangling and unifying graph convo- lutions for skeleton-based action recognition

    Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convo- lutions for skeleton-based action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020. 3

  30. [30]

    Smpl: A skinned multi- person linear model

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023. 3

  31. [31]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  32. [32]

    Contact-aware human motion generation from textual de- scriptions.arXiv preprint arXiv:2403.15709, 2024

    Sihan Ma, Qiong Cao, Jing Zhang, and Dacheng Tao. Contact-aware human motion generation from textual de- scriptions.arXiv preprint arXiv:2403.15709, 2024. 2, 3

  33. [33]

    Synergy and synchrony in couple dances

    V ongani Maluleke, Lea M ¨uller, Jathushan Rajasegaran, Georgios Pavlakos, Shiry Ginosar, Angjoo Kanazawa, and Jitendra Malik. Synergy and synchrony in couple dances. arXiv preprint arXiv:2409.04440, 2024. 1, 2, 4

  34. [34]

    Learning trajectory dependencies for human motion pre- diction

    Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion pre- diction. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 9489–9497, 2019. 3

  35. [35]

    History repeats itself: Human motion prediction via motion atten- tion

    Wei Mao, Miaomiao Liu, and Mathieu Salzmann. History repeats itself: Human motion prediction via motion atten- tion. InEuropean Conference on Computer Vision, pages 474–489. Springer, 2020. 3

  36. [36]

    librosa: Audio and music signal analysis in python.SciPy, 2015:18– 24, 2015

    Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python.SciPy, 2015:18– 24, 2015. 4

  37. [37]

    Performance-driven dance motion control of a virtual partner character

    Christos Mousas. Performance-driven dance motion control of a virtual partner character. In2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 57–64. IEEE, 2018. 2

  38. [38]

    Efficient content-based retrieval of motion capture data

    Meinard M ¨uller, Tido R¨oder, and Michael Clausen. Efficient content-based retrieval of motion capture data. InACM SIG- GRAPH 2005 Papers, pages 677–685. 2005. 6

  39. [39]

    Fmdistance: A fast and effective distance function for mo- tion capture data.Eurographics (Short Papers), 7(10), 2008

    Kensuke Onuma, Christos Faloutsos, and Jessica K Hodgins. Fmdistance: A fast and effective distance function for mo- tion capture data.Eurographics (Short Papers), 7(10), 2008. 6

  40. [40]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 3, 6

  41. [41]

    Content based querying and search- ing for 3d human motions

    Manoj M Pawar, Gaurav N Pradhan, Kang Zhang, and Bal- akrishnan Prabhakaran. Content based querying and search- ing for 3d human motions. InAdvances in Multimedia Mod- eling: 14th International Multimedia Modeling Conference, MMM 2008, Kyoto, Japan, January 9-11, 2008. Proceedings 14, pages 446–455. Springer, 2008. 2, 3

  42. [42]

    Hierarchical generation of human-object inter- actions with diffusion probabilistic models

    Huaijin Pi, Sida Peng, Minghui Yang, Xiaowei Zhou, and Hujun Bao. Hierarchical generation of human-object inter- actions with diffusion probabilistic models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15061–15073, 2023. 4

  43. [43]

    Hierarchical indexing structure for 3d human mo- tions

    Gaurav N Pradhan, Chuanjun Li, and Balakrishnan Prab- hakaran. Hierarchical indexing structure for 3d human mo- tions. InInternational Conference on Multimedia Modeling, pages 386–396. Springer, 2007. 2, 3

  44. [44]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 7

  45. [45]

    Bailando: 3d dance generation by actor-critic gpt with choreographic memory

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050– 11059, 2022. 2, 3, 4, 6

  46. [46]

    Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment.arXiv preprint arXiv:2403.18811,

    Li Siyao, Tianpei Gu, Zhitao Yang, Zhengyu Lin, Ziwei Liu, Henghui Ding, Lei Yang, and Chen Change Loy. Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment.arXiv preprint arXiv:2403.18811,

  47. [47]

    1, 2, 3, 4, 5, 6, 7, 8

  48. [48]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 2

  49. [49]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2, 6

  50. [50]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2, 6

  51. [51]

    Local motion phases for learning multi-contact charac- ter movements.ACM Transactions on Graphics (TOG), 39 (4):54–1, 2020

    Sebastian Starke, Yiwei Zhao, Taku Komura, and Kazi Za- man. Local motion phases for learning multi-contact charac- ter movements.ACM Transactions on Graphics (TOG), 39 (4):54–1, 2020. 2

  52. [52]

    Role-aware interac- tion generation from textual description

    Mikihiro Tanaka and Kent Fujiwara. Role-aware interac- tion generation from textual description. InProceedings of the IEEE/CVF international conference on computer vision, pages 15999–16009, 2023. 1

  53. [53]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 1, 4

  54. [54]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 5

  55. [55]

    Neural kinematic networks for unsupervised motion retargetting

    Ruben Villegas, Jimei Yang, Duygu Ceylan, and Honglak Lee. Neural kinematic networks for unsupervised motion retargetting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8639–8648,

  56. [56]

    Saga: Stochastic whole- body grasping with contact

    Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu, and Siyu Tang. Saga: Stochastic whole- body grasping with contact. InEuropean Conference on Computer Vision, pages 257–274. Springer, 2022. 1

  57. [57]

    Regennet: Towards human action-reaction synthesis

    Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1759–1769, 2024. 1, 2, 5, 6, 7

  58. [58]

    Spatial tempo- ral graph convolutional networks for skeleton-based action recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on arti- ficial intelligence, 2018. 3

  59. [59]

    Smgdiff: Soccer motion generation using diffusion probabilistic mod- els.arXiv preprint arXiv:2411.16216, 2024

    Hongdi Yang, Chengyang Li, Zhenxuan Wu, Gaozheng Li, Jingya Wang, Jingyi Yu, Zhuo Su, and Lan Xu. Smgdiff: Soccer motion generation using diffusion probabilistic mod- els.arXiv preprint arXiv:2411.16216, 2024. 2, 6

  60. [60]

    Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots

    Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jae- hong Kim, and Geehyuk Lee. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In2019 International Conference on Robotics and Automation (ICRA), pages 4303–4309. IEEE,

  61. [61]

    Structure-aware human-action generation

    Ping Yu, Yang Zhao, Chunyuan Li, Junsong Yuan, and Changyou Chen. Structure-aware human-action generation. InComputer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 18–34. Springer, 2020. 1

  62. [62]

    Generating human motion from textual descrip- tions with discrete representations

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descrip- tions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023. 1, 3, 4

  63. [63]

    Semantics-guided neural networks for efficient skeleton-based human action recog- nition

    Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. Semantics-guided neural networks for efficient skeleton-based human action recog- nition. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1112–1121,

  64. [64]

    Couch: Towards controllable human-chair interactions

    Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. InEuropean Confer- ence on Computer Vision, pages 518–535. Springer, 2022. 2, 3

  65. [65]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753,