Social Structure Matters in 3D Human-Human Interaction Generation

Beier Wang; Daoyi Dong; Hongdong Li; Huadong Mo; Pichao Wang; Yatao Bian; Zhenhong Sun; Zhi Wang; Zhongju Wang

arxiv: 2606.24255 · v1 · pith:MUIZTGFVnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI

Social Structure Matters in 3D Human-Human Interaction Generation

Zhongju Wang , Beier Wang , Yatao Bian , Pichao Wang , Zhi Wang , Daoyi Dong , Hongdong Li , Huadong Mo

show 1 more author

Zhenhong Sun

This is my paper

Pith reviewed 2026-06-26 00:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human-human interactiontext-to-motion3D motion generationLLM plannersocial structurephase decompositionpartner conditioning

0 comments

The pith

Text-to-3D human-human interaction requires an LLM to first recover social phases and roles before motion generation can succeed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that extending text-to-motion methods to two-person 3D interactions fails unless the model first extracts the interaction's social organization. Large language models can identify how an encounter unfolds across phases and assign distinct roles to each actor, yet they cannot produce the continuous, physically valid joint movements. The proposed solution therefore splits the task: the LLM outputs a structured plan of phases, roles, and timing, which then conditions a motion model that has been adapted from single-person training data. This separation yields motions that maintain phase order, respect actor responsibilities, and keep the two bodies coordinated throughout the sequence.

Core claim

HHI generation is a social structure modeling and grounding problem. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning.

What carries the argument

The planner-executor paradigm (Solo-to-Social framework) in which an LLM decomposes text into phased, role-assigned supervision that an adapted motion model then realizes as partner-aware 3D sequences.

If this is right

Generated interactions exhibit higher phase consistency across the full sequence.
Actor roles remain aligned with the original text description throughout the motion.
Partner conditioning produces measurable improvements in inter-actor spatial and temporal coordination.
The same pretrained solo motion model can be reused for social tasks once conditioned on the planned structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The planner-executor split may transfer to other multi-agent generation settings that require both high-level organization and low-level physical control.
Extending the planner to handle more than two actors would require only additional role-assignment logic while reusing the same executor adaptations.
Datasets that annotate phase boundaries and role switches would directly test whether the LLM planning step is the current performance bottleneck.

Load-bearing premise

LLMs can recover phase decompositions and partner-aware roles from text that translate into physically plausible, interaction-aware 3D motion when fed to the adapted executor.

What would settle it

A benchmark run in which LLM-generated phase and role plans produce two-person motions that violate coordination metrics or physical plausibility on existing HHI test sets.

Figures

Figures reproduced from arXiv: 2606.24255 by Beier Wang, Daoyi Dong, Hongdong Li, Huadong Mo, Pichao Wang, Yatao Bian, Zhenhong Sun, Zhi Wang, Zhongju Wang.

**Figure 1.** Figure 1: We define social structure as the latent interaction organization that governs how an interaction unfolds over [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: LLM (Qwen3.5) capability analysis in modeling social structures. (a) t-SNE of phase decomposition. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Atomic motion generation capacity. (b) Reasoning efficiency Social Structure Planning. We first test whether LLMs can model pϕ(S | y) to capture latent social structure from global interaction text. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of our proposed planner-executor paradigm for social-structure-centered HHI generation. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative evaluation results. We compare InterGen, ComMDM, a baseline using only our structure plan [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of LLM-based human atomic motion execution. Given atomic action prompts, the LLM can [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative ablation results. Removing social structure planning or motion facts weakens semantic faith [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: User study results across four evaluation aspects. (a) Phase decomposition accuracy of the LLM-generated [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: User study interface. Each case presents the original global text prompt, the LLM-decomposed phase [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used for LLM-based social structure planning. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying \textbf{social structure} that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can \textit{think} by recovering phase decompositions and partner-aware roles, but cannot directly \textit{move}, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, \textbf{Think with LLM, Move with Motion Skill}. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits text-to-HHI into an LLM planner for phases and roles plus a LoRA-adapted solo motion executor with phase and partner conditioning; the split is reasonable but the advance is mostly integrative.

read the letter

The main thing here is the planner-executor split for two-person motion. An LLM decomposes the text into interaction phases and assigns partner-aware roles, then a motion model adapted from solo work via LoRA, previous-phase self-conditioning, and ego-relative partner input turns that plan into 3D sequences. This directly tackles why single-person text-to-motion methods do not transfer cleanly to HHI.

The analysis that LLMs can recover phases and roles but fail at producing dynamic, physically plausible motion is a useful boundary check and justifies keeping the motion part separate. The conditioning choices look practical for coordination without full retraining. The paper does a clean job of naming the social structure elements (phase progression, actor roles, inter-actor alignment) that prior work left implicit.

The soft spots are in the evidence base. The abstract states improvements in phase consistency and partner-aware coordination, yet supplies no metrics, ablations, or baseline comparisons in the provided text. If the full experiments show clear gains on standard HHI datasets and hold up under error propagation from the LLM stage, the claims land; otherwise the contribution stays at the level of a sensible architecture rather than a demonstrated advance. The work also inherits whatever limitations exist in the pretrained solo models and LLM, so complex or long-horizon interactions may still expose gaps.

This is for groups already working on text-conditioned human motion or multi-agent animation who need a concrete way to add social structure. A reader looking for new primitives or large-scale benchmarks will find less here.

It deserves peer review. The framing is coherent, the design choices are motivated by the stated LLM limitations, and the problem extension is real even if the quantitative payoff needs checking.

Referee Report

2 major / 1 minor

Summary. The paper claims that text-to-3D human-human interaction (HHI) generation requires explicit modeling of social structure (phase progression, actor roles, inter-actor coordination). It first analyzes the capability boundary of LLMs, finding that they can recover phase decompositions and partner-aware roles from text but cannot directly generate dynamic, physically plausible motion. This motivates a planner-executor framework ('Think with LLM, Move with Motion Skill'): the LLM planner produces motion-aligned social supervision via phase decomposition, role assignment, and sequence alignment; the motion executor then grounds this into coordinated two-person motion by adapting a pretrained solo motion model via LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. The resulting Solo-to-Social approach is reported to yield improved phase consistency, role alignment, and partner-aware coordination.

Significance. If the quantitative results and ablations hold, the work provides a principled separation between high-level social reasoning (handled by LLMs) and low-level motion realization (handled by adapted generative models), addressing a clear gap in extending single-person text-to-motion methods to interactive settings. The explicit planner-executor design and the reported LLM capability analysis constitute a concrete, testable contribution that could influence subsequent HHI generation research.

major comments (2)

[§3] §3 (LLM Capability Boundary Analysis): The central motivation for the planner-executor split rests on the claim that LLMs reliably recover accurate phase decompositions and partner-aware roles from text. No quantitative metrics (e.g., phase-boundary F1, role-assignment accuracy, or inter-annotator agreement against human labels) are referenced in the provided description of this analysis; without such numbers the reliability of the generated social supervision cannot be assessed and error propagation to the executor remains unquantified.
[§5] §5 (Motion Executor Experiments): The abstract and paradigm description assert improved phase consistency and partner-aware coordination, yet the soundness assessment notes the absence of reported error metrics, ablation tables, or baseline comparisons (e.g., direct fine-tuning without planner supervision). If these results exist in later sections they must be explicitly tied back to the planner output quality to substantiate the load-bearing claim that the social-structure supervision is what drives the gains.

minor comments (1)

[§4.2] Notation for 'ego-relative partner conditioning' and 'previous-phase self-conditioning' should be formalized with explicit equations or pseudocode in the executor section to clarify how these signals are injected into the adapted diffusion or autoregressive backbone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify how to strengthen the quantitative grounding of our claims. We address the two major comments point-by-point below and will revise the manuscript to incorporate the suggested metrics and explicit linkages.

read point-by-point responses

Referee: [§3] §3 (LLM Capability Boundary Analysis): The central motivation for the planner-executor split rests on the claim that LLMs reliably recover accurate phase decompositions and partner-aware roles from text. No quantitative metrics (e.g., phase-boundary F1, role-assignment accuracy, or inter-annotator agreement against human labels) are referenced in the provided description of this analysis; without such numbers the reliability of the generated social supervision cannot be assessed and error propagation to the executor remains unquantified.

Authors: We agree that the current presentation of the LLM analysis in §3 is primarily qualitative (via illustrative examples of phase decomposition and role assignment). This leaves the reliability of the generated supervision unquantified. In the revised manuscript we will add quantitative metrics against human annotations, including phase-boundary F1, role-assignment accuracy, and inter-annotator agreement, to allow direct assessment of supervision quality and error propagation. revision: yes
Referee: [§5] §5 (Motion Executor Experiments): The abstract and paradigm description assert improved phase consistency and partner-aware coordination, yet the soundness assessment notes the absence of reported error metrics, ablation tables, or baseline comparisons (e.g., direct fine-tuning without planner supervision). If these results exist in later sections they must be explicitly tied back to the planner output quality to substantiate the load-bearing claim that the social-structure supervision is what drives the gains.

Authors: We acknowledge that the experimental section would benefit from more explicit reporting and linkage. While ablations and baseline comparisons are present, we will revise §5 to (i) report additional error metrics, (ii) include a direct fine-tuning baseline without planner supervision, and (iii) add explicit analysis correlating planner output quality with downstream motion gains, thereby directly substantiating that the social-structure supervision drives the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formulates HHI generation as a planner-executor paradigm justified by an empirical analysis of LLM capabilities (recovering phases/roles but failing at direct motion generation). The LLM planner produces social supervision, while the motion executor adapts external pretrained solo models via LoRA and conditioning. No equations, fitted parameters, or predictions are present that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation relies on external pretrained components and stated capability boundaries without self-referential loops or renaming of known results, rendering it self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; full paper may detail additional parameters or assumptions not visible here.

axioms (1)

domain assumption LLMs can recover phase decompositions and partner-aware roles from implicit interaction semantics
Invoked in the analysis showing LLMs can think but not move.

invented entities (1)

social structure no independent evidence
purpose: Governs phase progression, actor roles, and inter-actor coordination in HHI generation
Central modeling target introduced to bridge semantics and motion; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5842 in / 1237 out tokens · 35700 ms · 2026-06-26T00:24:29.546584+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 4 linked inside Pith

[1]

A survey on human inter- action motion generation.International Journal of Computer Vision, 134(3):113, 2026

Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. A survey on human inter- action motion generation.International Journal of Computer Vision, 134(3):113, 2026

2026
[2]

Text-driven motion generation: Overview, challenges and directions

Ali Rida Sahili, Najett Neji, and Hedi Tabia. Text-driven motion generation: Overview, challenges and directions. arXiv preprint arXiv:2505.09379, 2025

arXiv 2025
[3]

3d human interaction generation: A survey.arXiv preprint arXiv:2503.13120, 2025

Siyuan Fan, Wenke Huang, Xiantao Cai, and Bo Du. 3d human interaction generation: A survey.arXiv preprint arXiv:2503.13120, 2025

arXiv 2025
[4]

Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

arXiv 2025
[5]

Make-an-animation: Large-scale text-conditional 3d human motion generation

Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, and Sonal Gupta. Make-an-animation: Large-scale text-conditional 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15039–15048, 2023

2023
[6]

Large model em- powered embodied ai: A survey on decision-making and embodied learning.arXiv preprint arXiv:2508.10399, 2025

Wenlong Liang, Rui Zhou, Yang Ma, Bing Zhang, Songlin Li, Yijia Liao, and Ping Kuang. Large model em- powered embodied ai: A survey on decision-making and embodied learning.arXiv preprint arXiv:2508.10399, 2025

arXiv 2025
[7]

Generative multi-agent collaboration in embodied ai: A systematic review.arXiv preprint arXiv:2502.11518, 2025

Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, and Bo Jin. Generative multi-agent collaboration in embodied ai: A systematic review.arXiv preprint arXiv:2502.11518, 2025

arXiv 2025
[8]

Long-term interactions with social robots: Trends, insights, and recommendations.ACM Transactions on Human-Robot Interaction, 14(3):1–42, 2025

Kayla Matheus, Rebecca Ramnauth, Brian Scassellati, and Nicole Salomons. Long-term interactions with social robots: Trends, insights, and recommendations.ACM Transactions on Human-Robot Interaction, 14(3):1–42, 2025

2025
[9]

in2in: Leveraging individual information to generate human interactions

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and Jos ´e Garc´ıa-Rodr´ıguez. in2in: Leveraging individual information to generate human interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1941–1951, June 2024

1941
[10]

Mixer- mdm: Learnable composition of human motion diffusion models

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and Jos ´e Garc´ıa-Rodr´ıguez. Mixer- mdm: Learnable composition of human motion diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12380–12390, 2025

2025
[11]

Human motion diffusion as a generative prior

Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. In The Twelfth International Conference on Learning Representations, 2024

2024
[12]

Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, pages 1–21, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, pages 1–21, 2024

2024
[13]

Intermask: 3d human interaction generation via collaborative masked modeling

Muhammad Gohar Javed, Chuan Guo, Li Cheng, and Xingyu Li. Intermask: 3d human interaction generation via collaborative masked modeling. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[14]

Timotion: Temporal and interactive framework for efficient human-human motion generation

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, and Yong Liu. Timotion: Temporal and interactive framework for efficient human-human motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7169–7178, 2025

2025
[15]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026
[16]

A human-in-the-loop approach to robot action replanning through llm common-sense reasoning.IEEE Robotics and Automation Letters, 2025

Elena Merlo, Marta Lagomarsino, and Arash Ajoudani. A human-in-the-loop approach to robot action replanning through llm common-sense reasoning.IEEE Robotics and Automation Letters, 2025

2025
[17]

Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, et al. Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

Pith/arXiv arXiv 2025
[18]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015

2015
[19]

Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

2023
[20]

Motion generation: A survey of generative approaches and benchmarks.arXiv preprint arXiv:2507.05419, 2025

Aliasghar Khani, Arianna Rampini, Bruno Roy, Larasika Nadela, Noa Kaplan, Evan Atherton, Derek Cheung, and Jacky Bibliowicz. Motion generation: A survey of generative approaches and benchmarks.arXiv preprint arXiv:2507.05419, 2025. 10 PRIME AI paper

arXiv 2025
[21]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6200–6211, 2025

2025
[22]

Ls-gan: Human motion synthesis with latent-space gans

Avinash Amballa, Gayathri Akkinapalli, and Vinitra Muralikrishnan. Ls-gan: Human motion synthesis with latent-space gans. InProceedings of the Winter Conference on Applications of Computer Vision, pages 326–335, 2025

2025
[23]

Learning diverse stochastic human-action generators by learning smooth latent transitions

Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 12281–12288, 2020

2020
[24]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM international conference on multimedia, pages 2021–2029, 2020

2021
[25]

Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023

2023
[26]

Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023
[27]

Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation

Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2228–2238, 2023

2023
[28]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36, 2024

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36, 2024

2024
[29]

Motiongpt-2: A general-purpose motion-language model for motion generation and understanding.arXiv preprint arXiv:2410.21747, 2024

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixi- ang Tang, and Dan Xu. Motiongpt-2: A general-purpose motion-language model for motion generation and understanding.arXiv preprint arXiv:2410.21747, 2024

Pith/arXiv arXiv 2024
[30]

Motiongpt3: Human motion as a second modality.arXiv preprint arXiv:2506.24086, 2025

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, and Xin Chen. Motiongpt3: Human motion as a second modality.arXiv preprint arXiv:2506.24086, 2025

arXiv 2025
[31]

Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Pith/arXiv arXiv 2022
[32]

Guided motion diffusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 2151–2162, 2023

2023
[33]

Tm2d: Bimodality driven 3d dance generation via music-text integration

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023

2023
[34]

Motion- diffuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motion- diffuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

2024
[35]

Tm2t: Stochastic and tokenized modeling for the recipro- cal generation of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the recipro- cal generation of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022

2022
[36]

Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488– 9497, 2023

2023
[37]

Understanding human-human interactions: a survey.arXiv preprint arXiv:1808.00022, 2, 2018

Alexandros Stergiou and Ronald Poppe. Understanding human-human interactions: a survey.arXiv preprint arXiv:1808.00022, 2, 2018

arXiv 2018
[38]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1759–1769, 2024

2024
[39]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022. 11 PRIME AI paper

2022
[40]

Dance with you: The diversity controllable dancer generation via diffusion models

Siyue Yao, Mingjie Sun, Bingliang Li, Fengyu Yang, Junle Wang, and Ruimao Zhang. Dance with you: The diversity controllable dancer generation via diffusion models. InProceedings of the 31st ACM International Conference on Multimedia, pages 8504–8514, 2023

2023
[41]

Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment

Li Siyao, Tianpei Gu, Zhitao Yang, Zhengyu Lin, Ziwei Liu, Henghui Ding, Lei Yang, and Chen Change Loy. Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment. InThe Twelfth Inter- national Conference on Learning Representations, 2024

2024
[42]

Interdance: Reactive 3d dance generation with realistic duet interactions.arXiv preprint arXiv:2412.16982, 2024

Ronghui Li, Youliang Zhang, Yachao Zhang, Yuxiang Zhang, Mingyang Su, Jie Guo, Ziwei Liu, Yebin Liu, and Xiu Li. Interdance: Reactive 3d dance generation with realistic duet interactions.arXiv preprint arXiv:2412.16982, 2024

arXiv 2024
[43]

Think then react: Towards unconstrained action-to-reaction motion generation

Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang, and Ruihua Song. Think then react: Towards unconstrained action-to-reaction motion generation. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[44]

Ready- to-react: Online reaction policy for two-character interaction generation

Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready- to-react: Online reaction policy for two-character interaction generation. InICLR, 2025

2025
[45]

Remos: 3d motion-conditioned reaction synthesis for two-person interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Remos: 3d motion-conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[46]

Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25:8842–8854, 2023

Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25:8842–8854, 2023

2023
[47]

Inter-x: Towards versatile human-human interaction analysis

Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, et al. Inter-x: Towards versatile human-human interaction analysis. InCVPR, pages 22260–22271, 2024

2024
[48]

Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models.arXiv preprint arXiv:2512.19692, 2025

Pablo Ruiz-Ponce, Sergio Escalera, Jos ´e Garc´ıa-Rodr´ıguez, Jiankang Deng, and Rolandos Alexandros Potamias. Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models.arXiv preprint arXiv:2512.19692, 2025

arXiv 2025
[49]

A unified framework for motion reasoning and generation in human interaction

Jeongeun Park, Sungjoon Choi, and Sangdoo Yun. A unified framework for motion reasoning and generation in human interaction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10698–10707, 2025

2025
[50]

Aman Goel, Qianhui Men, and Edmond S. L. Ho. Interaction Mix and Match: Synthesizing Close Interaction using Conditional Hierarchical GAN with Multi-Hot Class Embedding.Computer Graphics Forum, 2022

2022
[51]

Intermamba: Efficient human-human interaction generation with adaptive spatio-temporal mamba.IEEE Transactions on Visualization and Computer Graphics, 2025

Zizhao Wu, Yingying Sun, Yiming Chen, Xiaoling Gu, Ruyu Liu, and Jiazhou Chen. Intermamba: Efficient human-human interaction generation with adaptive spatio-temporal mamba.IEEE Transactions on Visualization and Computer Graphics, 2025

2025
[52]

Disentangled hierarchical vae for 3d human-human interaction generation.arXiv preprint arXiv:2603.00144, 2026

Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, and Ajmal Mian. Disentangled hierarchical vae for 3d human-human interaction generation.arXiv preprint arXiv:2603.00144, 2026

arXiv 2026
[53]

Armflow: Autoregressive meanflow for online 3d human reaction generation.arXiv preprint arXiv:2512.16234, 2025

Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, and Ajmal Mian. Armflow: Autoregressive meanflow for online 3d human reaction generation.arXiv preprint arXiv:2512.16234, 2025

arXiv 2025
[54]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022

2022
[55]

not accurate at all

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 12 PRIME AI paper A Metric Details •FID.Fr ´echet Inceptio...

2019
[56]

approach: one or both people move closer or prepare to interact before contact
[57]

contact: physical contact, object transfer, blocking, supporting, or direct interaction occurs
[58]

release: physical contact or interaction ends and one or both people withdraw or return to neutral
[59]

Instructions:

in-place: the interaction happens mostly without clear approach/contact/release progression, or the actors coordinate while staying near their positions. Instructions:
[60]

Decompose the global prompt into 1 to 4 temporally ordered phases
[61]

Assign exactly one phase type to each phase: approach, contact, release, or in-place
[62]

For each phase, write one action sentence for P1 and one action sentence for P2
[63]

The P1 and P2 actions must be partner-aware
[64]

Do not invent unrelated actions

Preserve the semantics of the global prompt. Do not invent unrelated actions
[65]

If the prompt implies asymmetric roles, make them explicit, such as initiator/receiver, attacker/defender, giver/taker, or supporter/assisted person
[66]

If there is no clear movement toward or away from the partner, use in-place
[67]

<INPUT_GLOBAL_PROMPT>

Keep each action concise, physically plausible, and suitable for motion generation. Output format: Phase 1: (<phase_type>) P1 action: <one sentence describing Person 1’s action> P2 action: <one sentence describing Person 2’s action> Phase 2: (<phase_type>) P1 action: <one sentence describing Person 1’s action> P2 action: <one sentence describing Person 2’...

[1] [1]

A survey on human inter- action motion generation.International Journal of Computer Vision, 134(3):113, 2026

Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. A survey on human inter- action motion generation.International Journal of Computer Vision, 134(3):113, 2026

2026

[2] [2]

Text-driven motion generation: Overview, challenges and directions

Ali Rida Sahili, Najett Neji, and Hedi Tabia. Text-driven motion generation: Overview, challenges and directions. arXiv preprint arXiv:2505.09379, 2025

arXiv 2025

[3] [3]

3d human interaction generation: A survey.arXiv preprint arXiv:2503.13120, 2025

Siyuan Fan, Wenke Huang, Xiantao Cai, and Bo Du. 3d human interaction generation: A survey.arXiv preprint arXiv:2503.13120, 2025

arXiv 2025

[4] [4]

Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

arXiv 2025

[5] [5]

Make-an-animation: Large-scale text-conditional 3d human motion generation

Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, and Sonal Gupta. Make-an-animation: Large-scale text-conditional 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15039–15048, 2023

2023

[6] [6]

Large model em- powered embodied ai: A survey on decision-making and embodied learning.arXiv preprint arXiv:2508.10399, 2025

Wenlong Liang, Rui Zhou, Yang Ma, Bing Zhang, Songlin Li, Yijia Liao, and Ping Kuang. Large model em- powered embodied ai: A survey on decision-making and embodied learning.arXiv preprint arXiv:2508.10399, 2025

arXiv 2025

[7] [7]

Generative multi-agent collaboration in embodied ai: A systematic review.arXiv preprint arXiv:2502.11518, 2025

Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, and Bo Jin. Generative multi-agent collaboration in embodied ai: A systematic review.arXiv preprint arXiv:2502.11518, 2025

arXiv 2025

[8] [8]

Long-term interactions with social robots: Trends, insights, and recommendations.ACM Transactions on Human-Robot Interaction, 14(3):1–42, 2025

Kayla Matheus, Rebecca Ramnauth, Brian Scassellati, and Nicole Salomons. Long-term interactions with social robots: Trends, insights, and recommendations.ACM Transactions on Human-Robot Interaction, 14(3):1–42, 2025

2025

[9] [9]

in2in: Leveraging individual information to generate human interactions

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and Jos ´e Garc´ıa-Rodr´ıguez. in2in: Leveraging individual information to generate human interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1941–1951, June 2024

1941

[10] [10]

Mixer- mdm: Learnable composition of human motion diffusion models

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and Jos ´e Garc´ıa-Rodr´ıguez. Mixer- mdm: Learnable composition of human motion diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12380–12390, 2025

2025

[11] [11]

Human motion diffusion as a generative prior

Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. In The Twelfth International Conference on Learning Representations, 2024

2024

[12] [12]

Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, pages 1–21, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, pages 1–21, 2024

2024

[13] [13]

Intermask: 3d human interaction generation via collaborative masked modeling

Muhammad Gohar Javed, Chuan Guo, Li Cheng, and Xingyu Li. Intermask: 3d human interaction generation via collaborative masked modeling. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[14] [14]

Timotion: Temporal and interactive framework for efficient human-human motion generation

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, and Yong Liu. Timotion: Temporal and interactive framework for efficient human-human motion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7169–7178, 2025

2025

[15] [15]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026

[16] [16]

A human-in-the-loop approach to robot action replanning through llm common-sense reasoning.IEEE Robotics and Automation Letters, 2025

Elena Merlo, Marta Lagomarsino, and Arash Ajoudani. A human-in-the-loop approach to robot action replanning through llm common-sense reasoning.IEEE Robotics and Automation Letters, 2025

2025

[17] [17]

Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, et al. Llm-based human-agent collaboration and interaction systems: A survey.arXiv preprint arXiv:2505.00753, 2025

Pith/arXiv arXiv 2025

[18] [18]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015

2015

[19] [19]

Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430–2449, 2023

2023

[20] [20]

Motion generation: A survey of generative approaches and benchmarks.arXiv preprint arXiv:2507.05419, 2025

Aliasghar Khani, Arianna Rampini, Bruno Roy, Larasika Nadela, Noa Kaplan, Evan Atherton, Derek Cheung, and Jacky Bibliowicz. Motion generation: A survey of generative approaches and benchmarks.arXiv preprint arXiv:2507.05419, 2025. 10 PRIME AI paper

arXiv 2025

[21] [21]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6200–6211, 2025

2025

[22] [22]

Ls-gan: Human motion synthesis with latent-space gans

Avinash Amballa, Gayathri Akkinapalli, and Vinitra Muralikrishnan. Ls-gan: Human motion synthesis with latent-space gans. InProceedings of the Winter Conference on Applications of Computer Vision, pages 326–335, 2025

2025

[23] [23]

Learning diverse stochastic human-action generators by learning smooth latent transitions

Zhenyi Wang, Ping Yu, Yang Zhao, Ruiyi Zhang, Yufan Zhou, Junsong Yuan, and Changyou Chen. Learning diverse stochastic human-action generators by learning smooth latent transitions. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 12281–12288, 2020

2020

[24] [24]

Action2motion: Conditioned generation of 3d human motions

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM international conference on multimedia, pages 2021–2029, 2020

2021

[25] [25]

Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023

2023

[26] [26]

Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023

2023

[27] [27]

Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation

Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2228–2238, 2023

2023

[28] [28]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36, 2024

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36, 2024

2024

[29] [29]

Motiongpt-2: A general-purpose motion-language model for motion generation and understanding.arXiv preprint arXiv:2410.21747, 2024

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixi- ang Tang, and Dan Xu. Motiongpt-2: A general-purpose motion-language model for motion generation and understanding.arXiv preprint arXiv:2410.21747, 2024

Pith/arXiv arXiv 2024

[30] [30]

Motiongpt3: Human motion as a second modality.arXiv preprint arXiv:2506.24086, 2025

Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, and Xin Chen. Motiongpt3: Human motion as a second modality.arXiv preprint arXiv:2506.24086, 2025

arXiv 2025

[31] [31]

Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model.arXiv preprint arXiv:2209.14916, 2022

Pith/arXiv arXiv 2022

[32] [32]

Guided motion diffusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 2151–2162, 2023

2023

[33] [33]

Tm2d: Bimodality driven 3d dance generation via music-text integration

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zihang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023

2023

[34] [34]

Motion- diffuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motion- diffuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

2024

[35] [35]

Tm2t: Stochastic and tokenized modeling for the recipro- cal generation of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the recipro- cal generation of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022

2022

[36] [36]

Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488– 9497, 2023

2023

[37] [37]

Understanding human-human interactions: a survey.arXiv preprint arXiv:1808.00022, 2, 2018

Alexandros Stergiou and Ronald Poppe. Understanding human-human interactions: a survey.arXiv preprint arXiv:1808.00022, 2, 2018

arXiv 2018

[38] [38]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1759–1769, 2024

2024

[39] [39]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022. 11 PRIME AI paper

2022

[40] [40]

Dance with you: The diversity controllable dancer generation via diffusion models

Siyue Yao, Mingjie Sun, Bingliang Li, Fengyu Yang, Junle Wang, and Ruimao Zhang. Dance with you: The diversity controllable dancer generation via diffusion models. InProceedings of the 31st ACM International Conference on Multimedia, pages 8504–8514, 2023

2023

[41] [41]

Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment

Li Siyao, Tianpei Gu, Zhitao Yang, Zhengyu Lin, Ziwei Liu, Henghui Ding, Lei Yang, and Chen Change Loy. Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment. InThe Twelfth Inter- national Conference on Learning Representations, 2024

2024

[42] [42]

Interdance: Reactive 3d dance generation with realistic duet interactions.arXiv preprint arXiv:2412.16982, 2024

Ronghui Li, Youliang Zhang, Yachao Zhang, Yuxiang Zhang, Mingyang Su, Jie Guo, Ziwei Liu, Yebin Liu, and Xiu Li. Interdance: Reactive 3d dance generation with realistic duet interactions.arXiv preprint arXiv:2412.16982, 2024

arXiv 2024

[43] [43]

Think then react: Towards unconstrained action-to-reaction motion generation

Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang, and Ruihua Song. Think then react: Towards unconstrained action-to-reaction motion generation. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[44] [44]

Ready- to-react: Online reaction policy for two-character interaction generation

Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready- to-react: Online reaction policy for two-character interaction generation. InICLR, 2025

2025

[45] [45]

Remos: 3d motion-conditioned reaction synthesis for two-person interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Remos: 3d motion-conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[46] [46]

Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25:8842–8854, 2023

Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25:8842–8854, 2023

2023

[47] [47]

Inter-x: Towards versatile human-human interaction analysis

Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, et al. Inter-x: Towards versatile human-human interaction analysis. InCVPR, pages 22260–22271, 2024

2024

[48] [48]

Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models.arXiv preprint arXiv:2512.19692, 2025

Pablo Ruiz-Ponce, Sergio Escalera, Jos ´e Garc´ıa-Rodr´ıguez, Jiankang Deng, and Rolandos Alexandros Potamias. Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models.arXiv preprint arXiv:2512.19692, 2025

arXiv 2025

[49] [49]

A unified framework for motion reasoning and generation in human interaction

Jeongeun Park, Sungjoon Choi, and Sangdoo Yun. A unified framework for motion reasoning and generation in human interaction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10698–10707, 2025

2025

[50] [50]

Aman Goel, Qianhui Men, and Edmond S. L. Ho. Interaction Mix and Match: Synthesizing Close Interaction using Conditional Hierarchical GAN with Multi-Hot Class Embedding.Computer Graphics Forum, 2022

2022

[51] [51]

Intermamba: Efficient human-human interaction generation with adaptive spatio-temporal mamba.IEEE Transactions on Visualization and Computer Graphics, 2025

Zizhao Wu, Yingying Sun, Yiming Chen, Xiaoling Gu, Ruyu Liu, and Jiazhou Chen. Intermamba: Efficient human-human interaction generation with adaptive spatio-temporal mamba.IEEE Transactions on Visualization and Computer Graphics, 2025

2025

[52] [52]

Disentangled hierarchical vae for 3d human-human interaction generation.arXiv preprint arXiv:2603.00144, 2026

Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, and Ajmal Mian. Disentangled hierarchical vae for 3d human-human interaction generation.arXiv preprint arXiv:2603.00144, 2026

arXiv 2026

[53] [53]

Armflow: Autoregressive meanflow for online 3d human reaction generation.arXiv preprint arXiv:2512.16234, 2025

Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, and Ajmal Mian. Armflow: Autoregressive meanflow for online 3d human reaction generation.arXiv preprint arXiv:2512.16234, 2025

arXiv 2025

[54] [54]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022

2022

[55] [55]

not accurate at all

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 12 PRIME AI paper A Metric Details •FID.Fr ´echet Inceptio...

2019

[56] [56]

approach: one or both people move closer or prepare to interact before contact

[57] [57]

contact: physical contact, object transfer, blocking, supporting, or direct interaction occurs

[58] [58]

release: physical contact or interaction ends and one or both people withdraw or return to neutral

[59] [59]

Instructions:

in-place: the interaction happens mostly without clear approach/contact/release progression, or the actors coordinate while staying near their positions. Instructions:

[60] [60]

Decompose the global prompt into 1 to 4 temporally ordered phases

[61] [61]

Assign exactly one phase type to each phase: approach, contact, release, or in-place

[62] [62]

For each phase, write one action sentence for P1 and one action sentence for P2

[63] [63]

The P1 and P2 actions must be partner-aware

[64] [64]

Do not invent unrelated actions

Preserve the semantics of the global prompt. Do not invent unrelated actions

[65] [65]

If the prompt implies asymmetric roles, make them explicit, such as initiator/receiver, attacker/defender, giver/taker, or supporter/assisted person

[66] [66]

If there is no clear movement toward or away from the partner, use in-place

[67] [67]

<INPUT_GLOBAL_PROMPT>

Keep each action concise, physically plausible, and suitable for motion generation. Output format: Phase 1: (<phase_type>) P1 action: <one sentence describing Person 1’s action> P2 action: <one sentence describing Person 2’s action> Phase 2: (<phase_type>) P1 action: <one sentence describing Person 1’s action> P2 action: <one sentence describing Person 2’...