pith. machine review for the scientific record. sign in

arxiv: 2605.00517 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords human-human interactioncollision avoidancephysical constraintsmotion generation3D human synthesisoptimizationinterpenetration reductionplug-and-play method
0
0 comments X

The pith

PhysiGen reduces body interpenetration in AI-generated human interactions by using simplified geometric shapes to enforce physical collision constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysiGen as a general-purpose strategy to fix a common problem in generating realistic multi-person 3D motions from text: bodies passing through each other. Previous approaches either ignored collisions or used slow mesh-level calculations. PhysiGen simplifies each person's high-resolution body mesh into basic geometric shapes to quickly detect collisions, then uses those detected collision areas to steer the optimization toward physically plausible movements. This method can be added to existing generation models without changing their core design. If effective, it means AI-generated interaction videos and animations can look more natural and usable in applications like virtual reality or robotics training.

Core claim

The central claim is that simplifying human body meshes into geometric primitives for collision detection, combined with identifying collision regions to guide optimization, creates an efficient and effective way to integrate physical constraints into human-human interaction generation models, leading to reduced interpenetration and better visual and physical quality.

What carries the argument

PhysiGen optimization strategy, which approximates high-resolution meshes with geometric primitives to compute inter-person collisions efficiently and directs the generation process using collision region information.

Load-bearing premise

Approximating detailed human body meshes with simple geometric primitives captures enough collision information to guide effective optimization without overlooking important contact details or creating false positives.

What would settle it

Running the generation process with and without PhysiGen on the same model and inputs, then measuring the volume of interpenetrating body regions or counting collision events in the output sequences to see if the reduction is statistically significant.

Figures

Figures reproduced from arXiv: 2605.00517 by Fa-Ting Hong, Hui-Wen Huang, Liang Xu, Ling-An Zeng, Nan Lei, Wei-Shi Zheng, Yuan-Ming Li, Zhi-Wei Xia.

Figure 1
Figure 1. Figure 1: Our proposed PhysiGen can generate realistic human interaction motions with minimal interpenetration. Top: close-up views show physically plausible contact without severe collisions. Bottom: full motion sequence demonstrates fluid interaction and semantic consistency with the text. person relationships. The intricate interaction patterns and spatiotemporal coherence impose higher demands on sus￾taining phy… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PhysiGen Framework. For each detected collision point (red) on Person 1, we compute its corresponding antipodal target point (green) and derive a guidance direction (light blue arrow) to reduce interpenetration. PhysiGen guides the model to adjust the poses along these guidance directions to eliminate interpenetration. for human-human motion generation. To address this criti￾cal challenge, … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the bounding box fitting process and the final volumetric proxy representation. we calculate the minimum distance between each pair of “bounding box sampled point and corresponding mesh re￾gion points,” summing these distances as loss, which is de￾fined as: Lfit = X M j=1 X q∈Sj min p∈Pj ∥q − p∥ 2 , (1) where Lfit denotes the fitting loss, M is the number of bounding boxes, Sj is the set of sam… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of Multi-Region Simultaneous Collision [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with state-of-the-art methods. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study comparing our method with baselines [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of generated motions [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of generated motions [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Despite substantial progress in text-driven 3D human motion synthesis, generating realistic multi-person interaction sequences remains challenging. Notably, body inter-penetration is a pervasive issue from both data acquisition to the generated results, which significantly undermines the realism and usability. Previous generative models either ignored this issue or introduced computationally expensive mesh-level loss functions to alleviate inter-body collisions. In this paper, we propose a general-purpose and computationally efficient optimization strategy named PhysiGen to explicitly integrate collision-aware physical constraints for human-human interaction generation. Specifically, we simplify the high-resolution human body mesh into geometric primitives to greatly reduce the cost of inter-person collision detection. Moreover, we identify the collision regions as the guidance of the optimization directions. PhysiGen is plug-and-play and can be readily integrated into existing human interaction generation models. Extensive cross-dataset and cross-model experiments show that our method can effectively reduce interpenetration and significantly improve visual coherence and physical plausibility compared to the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PhysiGen, a plug-and-play optimization strategy for text-driven 3D human-human interaction generation. It simplifies high-resolution body meshes to geometric primitives to enable efficient collision detection, identifies collision regions to guide optimization directions, and integrates collision-aware physical constraints into existing generative models. Extensive cross-dataset and cross-model experiments are claimed to show reduced interpenetration and improved visual coherence and physical plausibility over state-of-the-art methods.

Significance. If the central claims hold, PhysiGen offers a computationally efficient, general-purpose approach to a pervasive problem in multi-person motion synthesis. The plug-and-play design and cross-model validation are strengths that could make physical plausibility improvements accessible without retraining or heavy mesh-level losses.

major comments (2)
  1. [§3.2] §3.2 (Method, primitive approximation): The claim that simplifying high-resolution meshes to geometric primitives 'greatly reduce[s] the cost of inter-person collision detection' while still providing effective guidance relies on the unquantified assumption that the approximation preserves locations and extents of actual penetrations (especially non-convex contacts involving hands/limbs). No error bounds, missed-collision rates, or ablation on primitive choice (e.g., spheres vs. capsules) are reported; this is load-bearing for the headline improvements in physical plausibility.
  2. [§4] §4 (Experiments): The cross-dataset and cross-model results are presented as showing 'significant' improvements, but without reported statistical significance tests, variance across runs, or direct comparison of collision metrics before/after PhysiGen on the same base model outputs, it is difficult to isolate the contribution of the collision guidance from other optimization factors.
minor comments (2)
  1. [§3] Notation for the collision threshold and optimization strength parameters should be explicitly listed as free parameters in the method section for reproducibility.
  2. [Figures] Figure captions for qualitative results should include the specific base model and dataset for each example to aid comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor in our method and evaluation. We address each major comment point-by-point below, providing our response and indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method, primitive approximation): The claim that simplifying high-resolution meshes to geometric primitives 'greatly reduce[s] the cost of inter-person collision detection' while still providing effective guidance relies on the unquantified assumption that the approximation preserves locations and extents of actual penetrations (especially non-convex contacts involving hands/limbs). No error bounds, missed-collision rates, or ablation on primitive choice (e.g., spheres vs. capsules) are reported; this is load-bearing for the headline improvements in physical plausibility.

    Authors: We appreciate the referee's emphasis on quantifying the fidelity of the primitive approximation. In the manuscript, we selected capsules as the primary primitive because they efficiently model the cylindrical geometry of limbs and torso while supporting fast signed-distance queries for collision detection. The collision region identification step further focuses optimization on detected contact areas rather than relying solely on global primitive overlap. While the original submission does not include explicit error bounds or missed-collision statistics, the cross-model and cross-dataset results show consistent reductions in interpenetration metrics, indicating practical effectiveness. To strengthen this, we will add in the revision: (1) an ablation comparing spheres, capsules, and ellipsoids on approximation error (Hausdorff distance to original mesh and penetration depth error), (2) missed-collision rates measured against full-mesh ground-truth detection on held-out interaction samples, and (3) visualizations of preserved vs. missed contacts, particularly for hand/limb regions. These additions will provide the requested error bounds and confirm that the approximation supports reliable guidance. revision: yes

  2. Referee: [§4] §4 (Experiments): The cross-dataset and cross-model results are presented as showing 'significant' improvements, but without reported statistical significance tests, variance across runs, or direct comparison of collision metrics before/after PhysiGen on the same base model outputs, it is difficult to isolate the contribution of the collision guidance from other optimization factors.

    Authors: We agree that additional statistical controls and direct before/after comparisons would better isolate PhysiGen's contribution. The current experiments apply PhysiGen as a post-processing step on outputs from multiple base models and report aggregate improvements, but do not include run-to-run variance or formal significance testing. In the revised manuscript we will: (1) report mean ± standard deviation for all quantitative metrics (including collision volume and contact ratio) across at least three random seeds, (2) add a dedicated table comparing collision metrics on identical base-model sequences before and after PhysiGen optimization, and (3) include paired t-test p-values to establish statistical significance of the observed reductions in interpenetration. These changes will clarify the specific impact of the collision-aware constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: PhysiGen is an independent optimization technique with no self-referential derivations

full rationale

The paper introduces PhysiGen as a plug-and-play optimization strategy that simplifies high-resolution meshes to geometric primitives for efficient collision detection and uses identified collision regions to guide optimization directions. This approach is presented as a general-purpose method integrable into existing models, with claims validated through cross-dataset and cross-model experiments rather than any internal derivation chain. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described method. The central contribution reduces to a practical engineering choice (primitive approximation for speed) whose effectiveness is externally tested, not defined into existence by the inputs themselves. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper's approach rests on the domain assumption that mesh simplification to primitives is valid for this task, with potential free parameters in the optimization process.

free parameters (1)
  • Collision threshold or optimization strength parameters
    Likely parameters for defining collision regions or optimization strength, though not specified in abstract.
axioms (1)
  • domain assumption Geometric primitives can accurately represent human body collisions for optimization purposes
    Central to reducing computational cost while maintaining effectiveness.

pith-pipeline@v0.9.0 · 5493 in / 1335 out tokens · 51223 ms · 2026-05-09T19:34:24.503550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Lan- guage2pose: Natural language grounded pose forecasting

    Chaitanya Ahuja and Louis-Philippe Morency. Lan- guage2pose: Natural language grounded pose forecasting. In2019 International conference on 3D vision (3DV), pages 719–728. IEEE, 2019. 1

  2. [2]

    To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations, 2019

    Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, and Yaser Sheikh. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic con- versations, 2019. 1

  3. [3]

    Socialinteractiongan: Multi-person interaction se- quence generation.IEEE Transactions on Affective Comput- ing, 14(3):2182–2192, 2022

    Louis Airale, Dominique Vaufreydaz, and Xavier Alameda- Pineda. Socialinteractiongan: Multi-person interaction se- quence generation.IEEE Transactions on Affective Comput- ing, 14(3):2182–2192, 2022. 1

  4. [4]

    Wordware Publishing, Inc., 2003

    Erik Bethke.Game development and production. Wordware Publishing, Inc., 2003. 1

  5. [5]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

  6. [6]

    Digital life project: Autonomous 3d characters with social intelligence

    Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Autonomous 3d characters with social intelligence. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 582–592, 2024. 2

  7. [7]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 2

  8. [8]

    Occlusion-aware networks for 3d human pose estimation in video

    Yu Cheng, Bo Yang, Bo Wang, Wending Yan, et al. Occlusion-aware networks for 3d human pose estimation in video. InICCV, 2019. 2

  9. [9]

    Cg-hoi: Contact-guided 3d human-object interaction generation, 2024

    Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation, 2024. 2

  10. [10]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2

  11. [11]

    Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2

  12. [12]

    Resolving 3d human pose ambigui- ties with 3d scene constraints, 2019

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3d human pose ambigui- ties with 3d scene constraints, 2019. 1, 2

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 1

  14. [14]

    Modeling multiple normal action representations for error detection in procedural tasks

    Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, et al. Modeling multiple normal action representations for error detection in procedural tasks. InCVPR, 2025. 1

  15. [15]

    Learning whole-body human-humanoid interac- tion from human-human demonstrations.arXiv preprint arXiv:2601.09518, 2026

    Wei-Jin Huang et al. Learning whole-body human- humanoid interaction from human-human demonstrations. arXiv:2601.09518, 2026. 1

  16. [16]

    Intermask: 3d human interaction genera- tion via collaborative masked modeling.arXiv preprint arXiv:2410.10010, 2024

    Muhammad Gohar Javed, Chuan Guo, Li Cheng, and Xingyu Li. Intermask: 3d human interaction genera- tion via collaborative masked modeling.arXiv preprint arXiv:2410.10010, 2024. 2

  17. [17]

    Hand-object contact consistency reasoning for human grasps generation, 2021

    Hanwen Jiang, Shaowei Liu, Jiashun Wang, and Xiaolong Wang. Hand-object contact consistency reasoning for human grasps generation, 2021. 2

  18. [18]

    Coherent reconstruction of multiple humans from a single image

    Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2020. 1

  19. [19]

    Flame: Free- form language-based motion synthesis & editing

    Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free- form language-based motion synthesis & editing. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023. 2

  20. [20]

    Auto-encoding vari- ational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding vari- ational bayes, 2013. 2

  21. [21]

    Irg-motionllm: Interleaving motion generation, assessment and refinement for text-to-motion generation.arXiv preprint arXiv:2512.10730, 2025

    Yuan-Ming Li, Qize Yang, Nan Lei, et al. Irg-motionllm: Interleaving motion generation, assessment and refinement for text-to-motion generation.arXiv:2512.10730, 2025. 1

  22. [22]

    Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024

    Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion genera- tion under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 1, 2, 3, 5, 6

  23. [23]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1

  24. [24]

    John Wiley & Sons, 2005

    Nadia Magnenat-Thalmann and Daniel Thalmann.Hand- book of virtual humans. John Wiley & Sons, 2005. 1

  25. [25]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019. 5

  26. [26]

    librosa: Audio and music signal analysis in python., 2015

    Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python., 2015. 1

  27. [27]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,

  28. [28]

    Newnes, 2012

    Rick Parent.Computer animation: algorithms and tech- niques. Newnes, 2012. 1

  29. [29]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 5

  30. [30]

    Hoi-diff: Text-driven synthe- sis of 3d human-object interactions using diffusion models,

    Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthe- sis of 3d human-object interactions using diffusion models,

  31. [31]

    Temos: Generating diverse human motions from textual descriptions

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–

  32. [32]

    Introduction to game development.(No Title),

    Steve Rabin. Introduction to game development.(No Title),

  33. [33]

    in2in: Leveraging individual information to generate human interactions

    Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Ser- gio Escalera, and Jos´e Garc´ıa-Rodr´ıguez. in2in: Leveraging individual information to generate human interactions. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1941–1951, 2024. 1, 2, 6

  34. [34]

    Intelligent robotic control.IEEE Transac- tions on Automatic Control, 28(5):547–557, 2003

    George Saridis. Intelligent robotic control.IEEE Transac- tions on Automatic Control, 28(5):547–557, 2003. 1

  35. [35]

    arXiv preprint arXiv:2303.01418 (2023) 3

    Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023. 2

  36. [36]

    Towards open domain text-driven synthesis of multi-person motions, 2024

    Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu, Ifeoma Nwogu, Guo-Jun Qi, and Mitch Hill. Towards open domain text-driven synthesis of multi-person motions, 2024. 1

  37. [37]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1

  38. [38]

    Motionclip: Exposing human motion generation to clip space

    Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEuropean Conference on Com- puter Vision, pages 358–374. Springer, 2022. 2

  39. [39]

    Human Motion Diffusion Model

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion dif- fusion model.arXiv preprint arXiv:2209.14916, 2022. 2

  40. [40]

    Gaze-guided hand- object interaction synthesis: Dataset and method, 2024

    Jie Tian, Ran Ji, Lingxiao Yang, Suting Ni, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand- object interaction synthesis: Dataset and method, 2024. 2

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  42. [42]

    To- wards domain generalization for multi-view 3d object detec- tion in bird-eye-view

    Shuo Wang, Xinhai Zhao, Hai-Ming Xu, Zehui Chen, Dameng Yu, Jiahao Chang, Zhen Yang, and Feng Zhao. To- wards domain generalization for multi-view 3d object detec- tion in bird-eye-view. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13333–13342, 2023. 1

  43. [43]

    Timotion: Temporal and in- teractive framework for efficient human-human motion gen- eration, 2025

    Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, and Yong Liu. Timotion: Temporal and in- teractive framework for efficient human-human motion gen- eration, 2025. 2, 6

  44. [44]

    Actformer: A gan-based transformer towards general action-conditioned 3d human motion gener- ation

    Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xi- aokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion gener- ation. InICCV, pages 2228–2238, 2023. 2

  45. [45]

    Inter-x: Towards versatile human- human interaction analysis, 2024

    Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, et al. Inter-x: Towards versatile human- human interaction analysis, 2024. 2, 3, 5, 6, 1

  46. [46]

    Regennet: Towards human action-reaction synthesis

    Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InCVPR, pages 1759–1769, 2024. 2

  47. [47]

    Light-t2m: A lightweight and fast model for text- to-motion generation

    Zeng et al. Light-t2m: A lightweight and fast model for text- to-motion generation. InAAAI, 2025. 1

  48. [48]

    Progressive human motion generation based on text and few motion frames.TCSVT, 2025

    Ling-An Zeng et al. Progressive human motion generation based on text and few motion frames.TCSVT, 2025

  49. [49]

    Re- modiffuse: Retrieval-augmented motion diffusion model

    Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 1

  50. [50]

    Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 2

  51. [51]

    Diffgrasp: Whole-body grasping synthesis guided by object motion us- ing a diffusion model, 2025

    Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion us- ing a diffusion model, 2025. 1, 2

  52. [52]

    Compositional human-scene interaction synthe- sis with semantic control, 2022

    Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, and Siyu Tang. Compositional human-scene interaction synthe- sis with semantic control, 2022. 1

  53. [53]

    Synthesizing diverse human motions in 3d in- door scenes

    Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, and Siyu Tang. Synthesizing diverse human motions in 3d in- door scenes. InProceedings of the IEEE/CVF international conference on computer vision, pages 14738–14749, 2023. 1

  54. [54]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els.arXiv preprint arXiv:2303.18223, 1(2), 2023. 2

  55. [55]

    Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

    Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023. 2

  56. [56]

    Attt2m: Text-driven human motion generation with multi-perspective attention mechanism

    Chongyang Zhong, Lei Hu, Zihao Zhang, and Shihong Xia. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. InProceedings of the IEEE/CVF international conference on computer vision, pages 509–519, 2023. 1

  57. [57]

    One bends down and then the other notices and helps them up. Both of them communicate with each other and finally leave together

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2430– 2449, 2023. 1 10 PhysiGen: Integrating Collision-Aware Physical Constraints for High-Fidelity Human-Human Interaction Generation Supplementar...