arxiv: 2605.00517 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation

Nan Lei , Yuan-Ming Li , Ling-An Zeng , Liang Xu , Zhi-Wei Xia , Hui-Wen Huang , Fa-Ting Hong , Wei-Shi Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-human interactioncollision avoidancephysical constraintsmotion generation3D human synthesisoptimizationinterpenetration reductionplug-and-play method

0 comments

The pith

PhysiGen reduces body interpenetration in AI-generated human interactions by using simplified geometric shapes to enforce physical collision constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysiGen as a general-purpose strategy to fix a common problem in generating realistic multi-person 3D motions from text: bodies passing through each other. Previous approaches either ignored collisions or used slow mesh-level calculations. PhysiGen simplifies each person's high-resolution body mesh into basic geometric shapes to quickly detect collisions, then uses those detected collision areas to steer the optimization toward physically plausible movements. This method can be added to existing generation models without changing their core design. If effective, it means AI-generated interaction videos and animations can look more natural and usable in applications like virtual reality or robotics training.

Core claim

The central claim is that simplifying human body meshes into geometric primitives for collision detection, combined with identifying collision regions to guide optimization, creates an efficient and effective way to integrate physical constraints into human-human interaction generation models, leading to reduced interpenetration and better visual and physical quality.

What carries the argument

PhysiGen optimization strategy, which approximates high-resolution meshes with geometric primitives to compute inter-person collisions efficiently and directs the generation process using collision region information.

Load-bearing premise

Approximating detailed human body meshes with simple geometric primitives captures enough collision information to guide effective optimization without overlooking important contact details or creating false positives.

What would settle it

Running the generation process with and without PhysiGen on the same model and inputs, then measuring the volume of interpenetrating body regions or counting collision events in the output sequences to see if the reduction is statistically significant.

Figures

Figures reproduced from arXiv: 2605.00517 by Fa-Ting Hong, Hui-Wen Huang, Liang Xu, Ling-An Zeng, Nan Lei, Wei-Shi Zheng, Yuan-Ming Li, Zhi-Wei Xia.

**Figure 1.** Figure 1: Our proposed PhysiGen can generate realistic human interaction motions with minimal interpenetration. Top: close-up views show physically plausible contact without severe collisions. Bottom: full motion sequence demonstrates fluid interaction and semantic consistency with the text. person relationships. The intricate interaction patterns and spatiotemporal coherence impose higher demands on sustaining phy… view at source ↗

**Figure 2.** Figure 2: Overview of the PhysiGen Framework. For each detected collision point (red) on Person 1, we compute its corresponding antipodal target point (green) and derive a guidance direction (light blue arrow) to reduce interpenetration. PhysiGen guides the model to adjust the poses along these guidance directions to eliminate interpenetration. for human-human motion generation. To address this critical challenge, … view at source ↗

**Figure 3.** Figure 3: Overview of the bounding box fitting process and the final volumetric proxy representation. we calculate the minimum distance between each pair of “bounding box sampled point and corresponding mesh region points,” summing these distances as loss, which is defined as: Lfit = X M j=1 X q∈Sj min p∈Pj ∥q − p∥ 2 , (1) where Lfit denotes the fitting loss, M is the number of bounding boxes, Sj is the set of sam… view at source ↗

**Figure 4.** Figure 4: Illustration of Multi-Region Simultaneous Collision [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: User study comparing our method with baselines [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of generated motions [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of generated motions [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Despite substantial progress in text-driven 3D human motion synthesis, generating realistic multi-person interaction sequences remains challenging. Notably, body inter-penetration is a pervasive issue from both data acquisition to the generated results, which significantly undermines the realism and usability. Previous generative models either ignored this issue or introduced computationally expensive mesh-level loss functions to alleviate inter-body collisions. In this paper, we propose a general-purpose and computationally efficient optimization strategy named PhysiGen to explicitly integrate collision-aware physical constraints for human-human interaction generation. Specifically, we simplify the high-resolution human body mesh into geometric primitives to greatly reduce the cost of inter-person collision detection. Moreover, we identify the collision regions as the guidance of the optimization directions. PhysiGen is plug-and-play and can be readily integrated into existing human interaction generation models. Extensive cross-dataset and cross-model experiments show that our method can effectively reduce interpenetration and significantly improve visual coherence and physical plausibility compared to the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhysiGen gives a practical, low-cost way to reduce interpenetrations in multi-person motion generation by swapping full meshes for geometric primitives and steering fixes from detected collision zones.

read the letter

PhysiGen approximates high-resolution body meshes with simple geometric shapes so collision checks between two people run faster, then uses the detected overlap regions to guide the optimization away from bad contacts. The approach is presented as a general add-on that drops into existing generators without retraining them from scratch. That plug-and-play design and the reported gains on multiple datasets and models are the clearest contributions here. The experiments claim measurable drops in interpenetration plus better visual and physical scores than prior methods, which aligns with the efficiency goal stated in the abstract. Cross-dataset testing helps show the fix is not locked to one training distribution. The main soft spot is the core approximation itself. Simplifying to primitives cuts compute, but it risks missing or mislocating fine contacts around hands, fingers, or non-convex body parts. Without reported error bounds on how well the primitives match actual mesh penetrations, it is hard to know whether the guidance stays reliable in dense or twisted interactions. The abstract does not quantify that trade-off, so the claimed high-fidelity improvements rest partly on unexamined assumptions about the primitives. This work is aimed at groups already running text-to-motion or interaction generators who need a lightweight way to improve physical plausibility. Readers who care about engineering fixes for generation artifacts will find the method and the cross-model results useful. The idea is concrete enough and the claims are testable enough that it deserves a serious referee rather than a desk reject, even if the approximation accuracy needs closer examination in review.

Referee Report

2 major / 2 minor

Summary. The paper proposes PhysiGen, a plug-and-play optimization strategy for text-driven 3D human-human interaction generation. It simplifies high-resolution body meshes to geometric primitives to enable efficient collision detection, identifies collision regions to guide optimization directions, and integrates collision-aware physical constraints into existing generative models. Extensive cross-dataset and cross-model experiments are claimed to show reduced interpenetration and improved visual coherence and physical plausibility over state-of-the-art methods.

Significance. If the central claims hold, PhysiGen offers a computationally efficient, general-purpose approach to a pervasive problem in multi-person motion synthesis. The plug-and-play design and cross-model validation are strengths that could make physical plausibility improvements accessible without retraining or heavy mesh-level losses.

major comments (2)

[§3.2] §3.2 (Method, primitive approximation): The claim that simplifying high-resolution meshes to geometric primitives 'greatly reduce[s] the cost of inter-person collision detection' while still providing effective guidance relies on the unquantified assumption that the approximation preserves locations and extents of actual penetrations (especially non-convex contacts involving hands/limbs). No error bounds, missed-collision rates, or ablation on primitive choice (e.g., spheres vs. capsules) are reported; this is load-bearing for the headline improvements in physical plausibility.
[§4] §4 (Experiments): The cross-dataset and cross-model results are presented as showing 'significant' improvements, but without reported statistical significance tests, variance across runs, or direct comparison of collision metrics before/after PhysiGen on the same base model outputs, it is difficult to isolate the contribution of the collision guidance from other optimization factors.

minor comments (2)

[§3] Notation for the collision threshold and optimization strength parameters should be explicitly listed as free parameters in the method section for reproducibility.
[Figures] Figure captions for qualitative results should include the specific base model and dataset for each example to aid comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor in our method and evaluation. We address each major comment point-by-point below, providing our response and indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Method, primitive approximation): The claim that simplifying high-resolution meshes to geometric primitives 'greatly reduce[s] the cost of inter-person collision detection' while still providing effective guidance relies on the unquantified assumption that the approximation preserves locations and extents of actual penetrations (especially non-convex contacts involving hands/limbs). No error bounds, missed-collision rates, or ablation on primitive choice (e.g., spheres vs. capsules) are reported; this is load-bearing for the headline improvements in physical plausibility.

Authors: We appreciate the referee's emphasis on quantifying the fidelity of the primitive approximation. In the manuscript, we selected capsules as the primary primitive because they efficiently model the cylindrical geometry of limbs and torso while supporting fast signed-distance queries for collision detection. The collision region identification step further focuses optimization on detected contact areas rather than relying solely on global primitive overlap. While the original submission does not include explicit error bounds or missed-collision statistics, the cross-model and cross-dataset results show consistent reductions in interpenetration metrics, indicating practical effectiveness. To strengthen this, we will add in the revision: (1) an ablation comparing spheres, capsules, and ellipsoids on approximation error (Hausdorff distance to original mesh and penetration depth error), (2) missed-collision rates measured against full-mesh ground-truth detection on held-out interaction samples, and (3) visualizations of preserved vs. missed contacts, particularly for hand/limb regions. These additions will provide the requested error bounds and confirm that the approximation supports reliable guidance. revision: yes
Referee: [§4] §4 (Experiments): The cross-dataset and cross-model results are presented as showing 'significant' improvements, but without reported statistical significance tests, variance across runs, or direct comparison of collision metrics before/after PhysiGen on the same base model outputs, it is difficult to isolate the contribution of the collision guidance from other optimization factors.

Authors: We agree that additional statistical controls and direct before/after comparisons would better isolate PhysiGen's contribution. The current experiments apply PhysiGen as a post-processing step on outputs from multiple base models and report aggregate improvements, but do not include run-to-run variance or formal significance testing. In the revised manuscript we will: (1) report mean ± standard deviation for all quantitative metrics (including collision volume and contact ratio) across at least three random seeds, (2) add a dedicated table comparing collision metrics on identical base-model sequences before and after PhysiGen optimization, and (3) include paired t-test p-values to establish statistical significance of the observed reductions in interpenetration. These changes will clarify the specific impact of the collision-aware constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: PhysiGen is an independent optimization technique with no self-referential derivations

full rationale

The paper introduces PhysiGen as a plug-and-play optimization strategy that simplifies high-resolution meshes to geometric primitives for efficient collision detection and uses identified collision regions to guide optimization directions. This approach is presented as a general-purpose method integrable into existing models, with claims validated through cross-dataset and cross-model experiments rather than any internal derivation chain. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described method. The central contribution reduces to a practical engineering choice (primitive approximation for speed) whose effectiveness is externally tested, not defined into existence by the inputs themselves. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper's approach rests on the domain assumption that mesh simplification to primitives is valid for this task, with potential free parameters in the optimization process.

free parameters (1)

Collision threshold or optimization strength parameters
Likely parameters for defining collision regions or optimization strength, though not specified in abstract.

axioms (1)

domain assumption Geometric primitives can accurately represent human body collisions for optimization purposes
Central to reducing computational cost while maintaining effectiveness.

pith-pipeline@v0.9.0 · 5493 in / 1335 out tokens · 51223 ms · 2026-05-09T19:34:24.503550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Lan- guage2pose: Natural language grounded pose forecasting

Chaitanya Ahuja and Louis-Philippe Morency. Lan- guage2pose: Natural language grounded pose forecasting. In2019 International conference on 3D vision (3DV), pages 719–728. IEEE, 2019. 1

2019
[2]

To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations, 2019

Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations, 2019. 1

2019
[3]

Socialinteractiongan: Multi-person interaction se- quence generation.IEEE Transactions on Affective Comput- ing, 14(3):2182–2192, 2022

Louis Airale, Dominique Vaufreydaz, and Xavier Alameda- Pineda. Socialinteractiongan: Multi-person interaction se- quence generation.IEEE Transactions on Affective Comput- ing, 14(3):2182–2192, 2022. 1

2022
[4]

Wordware Publishing, Inc., 2003

Erik Bethke.Game development and production. Wordware Publishing, Inc., 2003. 1

2003
[5]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

1901
[6]

Digital life project: Autonomous 3d characters with social intelligence

Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Autonomous 3d characters with social intelligence. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 582–592, 2024. 2

2024
[7]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 2

2023
[8]

Occlusion-aware networks for 3d human pose estimation in video

Yu Cheng, Bo Yang, Bo Wang, Wending Yan, et al. Occlusion-aware networks for 3d human pose estimation in video. InICCV, 2019. 2

2019
[9]

Cg-hoi: Contact-guided 3d human-object interaction generation, 2024

Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation, 2024. 2

2024
[10]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2

2022
[11]

Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2

2022
[12]

Resolving 3d human pose ambigui- ties with 3d scene constraints, 2019

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3d human pose ambigui- ties with 3d scene constraints, 2019. 1, 2

2019
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1

work page internal anchor Pith review arXiv 2022
[14]

Modeling multiple normal action representations for error detection in procedural tasks

Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, et al. Modeling multiple normal action representations for error detection in procedural tasks. InCVPR, 2025. 1

2025
[15]

Learning whole-body human-humanoid interac- tion from human-human demonstrations.arXiv preprint arXiv:2601.09518, 2026

Wei-Jin Huang et al. Learning whole-body human- humanoid interaction from human-human demonstrations. arXiv:2601.09518, 2026. 1

work page arXiv 2026
[16]

Intermask: 3d human interaction genera- tion via collaborative masked modeling.arXiv preprint arXiv:2410.10010, 2024

Muhammad Gohar Javed, Chuan Guo, Li Cheng, and Xingyu Li. Intermask: 3d human interaction genera- tion via collaborative masked modeling.arXiv preprint arXiv:2410.10010, 2024. 2

work page arXiv 2024
[17]

Hand-object contact consistency reasoning for human grasps generation, 2021

Hanwen Jiang, Shaowei Liu, Jiashun Wang, and Xiaolong Wang. Hand-object contact consistency reasoning for human grasps generation, 2021. 2

2021
[18]

Coherent reconstruction of multiple humans from a single image

Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2020. 1

2020
[19]

Flame: Free- form language-based motion synthesis & editing

Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free- form language-based motion synthesis & editing. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023. 2

2023
[20]

Auto-encoding vari- ational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding vari- ational bayes, 2013. 2

2013
[21]

Irg-motionllm: Interleaving motion generation, assessment and refinement for text-to-motion generation.arXiv preprint arXiv:2512.10730, 2025

Yuan-Ming Li, Qize Yang, Nan Lei, et al. Irg-motionllm: Interleaving motion generation, assessment and refinement for text-to-motion generation.arXiv:2512.10730, 2025. 1

work page arXiv 2025
[22]

Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 1, 2, 3, 5, 6

2024
[23]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

John Wiley & Sons, 2005

Nadia Magnenat-Thalmann and Daniel Thalmann.Hand- book of virtual humans. John Wiley & Sons, 2005. 1

2005
[25]

Amass: Archive of motion capture as surface shapes

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019. 5

2019
[26]

librosa: Audio and music signal analysis in python., 2015

Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python., 2015. 1

2015
[27]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,
[28]

Newnes, 2012

Rick Parent.Computer animation: algorithms and tech- niques. Newnes, 2012. 1

2012
[29]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 5

2019
[30]

Hoi-diff: Text-driven synthe- sis of 3d human-object interactions using diffusion models,

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthe- sis of 3d human-object interactions using diffusion models,
[31]

Temos: Generating diverse human motions from textual descriptions

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–
[32]

Introduction to game development.(No Title),

Steve Rabin. Introduction to game development.(No Title),
[33]

in2in: Leveraging individual information to generate human interactions

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Ser- gio Escalera, and Jos´e Garc´ıa-Rodr´ıguez. in2in: Leveraging individual information to generate human interactions. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1941–1951, 2024. 1, 2, 6

1941
[34]

Intelligent robotic control.IEEE Transac- tions on Automatic Control, 28(5):547–557, 2003

George Saridis. Intelligent robotic control.IEEE Transac- tions on Automatic Control, 28(5):547–557, 2003. 1

2003
[35]

arXiv preprint arXiv:2303.01418 (2023) 3

Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023. 2

work page arXiv 2023
[36]

Towards open domain text-driven synthesis of multi-person motions, 2024

Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu, Ifeoma Nwogu, Guo-Jun Qi, and Mitch Hill. Towards open domain text-driven synthesis of multi-person motions, 2024. 1

2024
[37]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[38]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEuropean Conference on Com- puter Vision, pages 358–374. Springer, 2022. 2

2022
[39]

Human Motion Diffusion Model

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion dif- fusion model.arXiv preprint arXiv:2209.14916, 2022. 2

work page internal anchor Pith review arXiv 2022
[40]

Gaze-guided hand- object interaction synthesis: Dataset and method, 2024

Jie Tian, Ran Ji, Lingxiao Yang, Suting Ni, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand- object interaction synthesis: Dataset and method, 2024. 2

2024
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

2017
[42]

To- wards domain generalization for multi-view 3d object detec- tion in bird-eye-view

Shuo Wang, Xinhai Zhao, Hai-Ming Xu, Zehui Chen, Dameng Yu, Jiahao Chang, Zhen Yang, and Feng Zhao. To- wards domain generalization for multi-view 3d object detec- tion in bird-eye-view. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13333–13342, 2023. 1

2023
[43]

Timotion: Temporal and in- teractive framework for efficient human-human motion gen- eration, 2025

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, and Yong Liu. Timotion: Temporal and in- teractive framework for efficient human-human motion gen- eration, 2025. 2, 6

2025
[44]

Actformer: A gan-based transformer towards general action-conditioned 3d human motion gener- ation

Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xi- aokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion gener- ation. InICCV, pages 2228–2238, 2023. 2

2023
[45]

Inter-x: Towards versatile human- human interaction analysis, 2024

Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, et al. Inter-x: Towards versatile human- human interaction analysis, 2024. 2, 3, 5, 6, 1

2024
[46]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InCVPR, pages 1759–1769, 2024. 2

2024
[47]

Light-t2m: A lightweight and fast model for text- to-motion generation

Zeng et al. Light-t2m: A lightweight and fast model for text- to-motion generation. InAAAI, 2025. 1

2025
[48]

Progressive human motion generation based on text and few motion frames.TCSVT, 2025

Ling-An Zeng et al. Progressive human motion generation based on text and few motion frames.TCSVT, 2025

2025
[49]

Re- modiffuse: Retrieval-augmented motion diffusion model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 1

2023
[50]

Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 2

2024
[51]

Diffgrasp: Whole-body grasping synthesis guided by object motion us- ing a diffusion model, 2025

Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion us- ing a diffusion model, 2025. 1, 2

2025
[52]

Compositional human-scene interaction synthe- sis with semantic control, 2022

Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, and Siyu Tang. Compositional human-scene interaction synthe- sis with semantic control, 2022. 1

2022
[53]

Synthesizing diverse human motions in 3d in- door scenes

Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, and Siyu Tang. Synthesizing diverse human motions in 3d in- door scenes. InProceedings of the IEEE/CVF international conference on computer vision, pages 14738–14749, 2023. 1

2023
[54]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els.arXiv preprint arXiv:2303.18223, 1(2), 2023. 2

work page internal anchor Pith review arXiv 2023
[55]

Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023. 2

2023
[56]

Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023. 1

2023
[57]

One bends down and then the other notices and helps them up. Both of them communicate with each other and finally leave together

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 1 10 PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation Supplementar...

2023