arxiv: 2605.01036 · v1 · submitted 2026-05-01 · 💻 cs.CV

InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene

Chaoyue Xing , Wei Mao , Miaomiao Liu This is my paper

Pith reviewed 2026-05-09 19:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion synthesisphysics-aware generationdynamic scenescontact force modelingforce and torque balancehuman-object interaction

0 comments

The pith

Soft physical constraints and a continuous distance-based force model generate physically plausible human motions in dynamic scenes with moving objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to produce sequences of human movement that obey physical laws when the person interacts with both fixed surroundings and objects that themselves move. Prior methods typically produce implausible results because they model contacts only at the hands or only with motionless surfaces. The authors introduce soft constraints that keep overall force and torque balanced across the body and a new way to compute contact forces from distances that works on any surface shape. If correct, this would let generated motions remain consistent with physics across the full range of human-related forces without relying on hard limits or later corrections.

Core claim

The central claim is that a physics-aware framework can synthesize human motions by explicitly modeling the full spectrum of forces—human-object, human-scene, and internal body dynamics—through soft constraints that enforce force and torque balance together with a novel continuous distance-based force model. This model extends contact handling to arbitrary surfaces and to interactions with dynamic, moving objects, yielding motions that are more physically grounded than those produced by methods limited to static scenes or hand-only contacts.

What carries the argument

The continuous distance-based force model, which computes interaction forces from distances to generalize contacts beyond hands or static surfaces and to include moving objects, paired with soft constraints that maintain force and torque balance.

If this is right

Motions respect the complete set of human-related forces including internal body dynamics.
Contact modeling extends to arbitrary surfaces and dynamic moving objects rather than being restricted to hands or static environments.
Physical plausibility improves markedly in complex scenes compared with earlier limited-contact methods.
The framework generalizes to new scenes while setting a benchmark for consistent human motion generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distance-based contact approach may reduce reliance on explicit collision detection routines in downstream animation pipelines.
Similar force modeling could transfer to generating interactions with additional classes of objects if the distance function is adjusted accordingly.

Load-bearing premise

Soft constraints plus the distance-based force model suffice to keep motions physically plausible in complex dynamic scenes without hard constraints, full rigid-body simulation, or post-processing corrections.

What would settle it

A concrete motion sequence generated by the method in a scene containing a moving object, where the human body penetrates the object or the net force and torque on the body fail to balance, would show the approach does not achieve its claimed physical consistency.

Figures

Figures reproduced from arXiv: 2605.01036 by Chaoyue Xing, Miaomiao Liu, Wei Mao.

**Figure 1.** Figure 1: Our Task. Our method takes 3D object motion and a 3D scene as input (a), to synthesize physically consistent 3D human motion interacting with both the moving object and the static background scene (b). into that of the human, yielding a unified formulation where the human motion is constrained not only by the scene but also by the dynamics of the moving object. To achieve physics-aware human motion synthes… view at source ↗

**Figure 2.** Figure 2: Continous contact force model. a) The PhysPT model assumes a static ground plane and represents contact force with two independent orthogonal springs, b) Our model generalizes to arbitrary 3D surfaces by incorporating local surface normals for the normal force and explicitly modeling tangential static and kinetic friction that are dependent to the normal force, enabling physically consistent interactions i… view at source ↗

**Figure 3.** Figure 3: Overview of our pipeline. The input static scene view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on OMOMO. From left to right: object-only context, ground truth, our prediction, and predictions from view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on Trumans. Each row shows ground truth, the Turmans baseline, and our method. Arrows illustrate view at source ↗

**Figure 6.** Figure 6: Ablation study comparison on OMOMO view at source ↗

read the original abstract

This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics.~Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The continuous distance-based force model for arbitrary surfaces and dynamic objects is the actual new piece, but the physical-plausibility claims rest on unshown experiments and soft constraints that may accumulate errors.

read the letter

The main thing to know is that this paper puts forward a continuous distance-based force model to handle contacts with any surface and with moving objects, then layers soft constraints on top to keep force and torque roughly balanced. That combination is meant to produce human motions that feel more physically grounded than the hand-only or static-scene approaches in prior work. The idea of modeling the full set of human-object, human-scene, and internal forces in one framework is a clear step past the usual limited contact models. It also keeps the method lighter than full rigid-body simulation, which is practical for animation or robotics pipelines. The abstract says the experiments back this up with better plausibility and generalization, and the authors do engage with real physics ideas rather than just fitting data. That part is worth credit. The soft spots are more noticeable once you look at what is actually shown. The abstract gives no equations, no quantitative tables, no baselines, and no error analysis, so the “extensive experiments” claim is hard to evaluate from the available text. Soft penalties can let small force or torque residuals add up, especially when objects are moving and contact points shift fast; nothing in the summary demonstrates that those residuals stay bounded over long sequences. The stress-test note flags exactly this risk, and without the full results or ablation it is difficult to dismiss. The paper is aimed at graphics and vision researchers who need motion generators that respect physics without heavy simulation overhead. A reader working on animation, VR, or robot planning could get something useful from the force-modeling idea if the numbers hold. It is not a foundational theoretical advance, but the technical direction is honest and the citation pattern looks normal. I would send it to peer review. The core modeling choice is distinct enough that referees should see the full experiments and equations before deciding on acceptance.

Referee Report

2 major / 2 minor

Summary. The paper introduces InterPhys, a physics-aware framework for human motion synthesis in dynamic scenes. It explicitly models the full spectrum of human-related forces (human-object, human-scene, and internal body dynamics) by imposing soft physical constraints to maintain force and torque balance. A novel continuous distance-based force model is proposed to generalize contact modeling to arbitrary surfaces, including interactions with both static environments and dynamic moving objects. The authors claim that extensive experiments demonstrate significantly improved physical plausibility and generalization, setting a new benchmark for physically consistent motion generation.

Significance. If the central claims hold, this work would advance physics-informed motion synthesis by providing a flexible alternative to hard constraints or full rigid-body simulation, particularly through the continuous force model that handles dynamic object interactions. This could influence downstream applications in animation, robotics, and VR by reducing reliance on post-processing corrections while maintaining physical grounding.

major comments (2)

[Abstract, §3] Abstract and §3 (Method): The central claim that soft physical constraints plus the distance-based force model suffice to maintain force/torque balance in dynamic scenes is load-bearing but rests on an unverified assumption. Small per-step violations permitted by soft penalties can accumulate over time with moving objects, leading to implausibilities such as penetration or unbalanced torques; no section provides bounded residual analysis, long-horizon consistency metrics, or comparison against hard-constraint baselines to refute this risk.
[§4] §4 (Experiments): The abstract asserts that 'extensive experiments' show improved plausibility and generalization, yet the provided text supplies no quantitative results, specific baselines, error tables, or ablation studies on the soft-constraint weights and distance-based scaling parameters. This absence makes it impossible to evaluate whether the method outperforms prior contact-limited approaches in complex dynamic scenes.

minor comments (2)

[Abstract] The abstract would benefit from including one or two key quantitative metrics (e.g., force residual norms or contact accuracy) to ground the claims of 'significantly improves physical plausibility.'
[§3.2] Notation for the continuous distance-based force model parameters (e.g., scaling factors) could be clarified with an explicit symbol table or definition list in §3.2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed the concerns regarding the long-term stability of our soft-constraint formulation and the clarity of the experimental results. Revisions have been made to include additional analysis and to ensure all quantitative evaluations are explicitly presented.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Method): The central claim that soft physical constraints plus the distance-based force model suffice to maintain force/torque balance in dynamic scenes is load-bearing but rests on an unverified assumption. Small per-step violations permitted by soft penalties can accumulate over time with moving objects, leading to implausibilities such as penetration or unbalanced torques; no section provides bounded residual analysis, long-horizon consistency metrics, or comparison against hard-constraint baselines to refute this risk.

Authors: We agree that explicit verification of long-term stability is important for soft-constraint methods. Our current experiments demonstrate that motions remain plausible without accumulating visible penetrations or torque imbalances over long sequences, thanks to the continuous distance-based force model that provides smooth gradients even for dynamic objects. However, we acknowledge the lack of formal bounded residual analysis in the original submission. In the revised manuscript we have added a dedicated stability analysis subsection reporting per-step and cumulative residual force/torque norms, maximum penetration depths, and long-horizon consistency metrics across 100+ frame sequences. We also include a limited comparison to a hard-constraint baseline, noting that hard constraints frequently cause solver divergence in scenes with moving objects, which motivated our soft formulation. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts that 'extensive experiments' show improved plausibility and generalization, yet the provided text supplies no quantitative results, specific baselines, error tables, or ablation studies on the soft-constraint weights and distance-based scaling parameters. This absence makes it impossible to evaluate whether the method outperforms prior contact-limited approaches in complex dynamic scenes.

Authors: We apologize that the quantitative details were not sufficiently prominent in the reviewed version. The full §4 contains error tables comparing against multiple baselines (including prior contact-limited and physics-based methods), reporting metrics such as average contact force error, penetration volume, and torque imbalance. Ablation studies on soft-constraint weights and distance-based scaling parameters are also present and show clear sensitivity trends. We have revised the section to ensure all tables, baseline descriptions, and ablation results are explicitly referenced and placed before the qualitative results, making the performance gains in dynamic scenes immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces a physics-aware framework that imposes soft constraints for force/torque balance and proposes a novel continuous distance-based force model for human-object and human-scene interactions. These elements are framed as extensions of external physics principles rather than reductions of outputs to inputs. No quoted equations or sections in the abstract or description demonstrate self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claim. The derivation remains independent and self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Ledger estimated from abstract claims only; full paper would likely reveal additional learned parameters and implementation assumptions.

free parameters (2)

soft constraint weights
Weights balancing the soft force and torque penalties are almost certainly tuned on data.
distance-based force scaling parameters
Parameters controlling the continuous force function for arbitrary surfaces must be chosen or fitted.

axioms (2)

domain assumption Force and torque balance is a necessary condition for physically plausible human motion
Directly invoked to justify the soft constraints.
ad hoc to paper A distance-based continuous function can adequately approximate contact forces on arbitrary surfaces
Core modeling choice introduced without derivation from first principles.

invented entities (1)

continuous distance-based force model no independent evidence
purpose: To generalize contact forces beyond discrete points to arbitrary and moving surfaces
New modeling construct proposed in the paper; no independent evidence of its validity is supplied in the abstract.

pith-pipeline@v0.9.0 · 5431 in / 1438 out tokens · 32446 ms · 2026-05-09T19:17:54.533955+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 1

2023
[2]

Behave: Dataset and method for tracking human object in- teractions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935– 15946, 2022. 2

2022
[3]

Esti- mating contact dynamics

Marcus A Brubaker, Leonid Sigal, and David J Fleet. Esti- mating contact dynamics. In2009 IEEE 12th International Conference on Computer Vision, pages 2389–2396. IEEE,
[4]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 1

2023
[5]

Detecting human-object contact in images

Yixin Chen, Sai Kumar Dwivedi, Michael J Black, and Dim- itrios Tzionas. Detecting human-object contact in images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17100–17110, 2023. 3

2023
[6]

Pico: Reconstructing 3d people in con- tact with objects

Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun S Lakshmipathy, Agniv Chatterjee, Michael J Black, and Dimitrios Tzionas. Pico: Reconstructing 3d people in con- tact with objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1783–1794, 2025. 3

2025
[7]

Cg-hoi: Contact-guided 3d human-object interaction generation

Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19888–19901, 2024. 1

2024
[8]

Differentiable dynamics for articu- lated 3d human motion reconstruction

Erik G ¨artner, Mykhaylo Andriluka, Erwin Coumans, and Cristian Sminchisescu. Differentiable dynamics for articu- lated 3d human motion reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13190–13200, 2022. 2

2022
[9]

Tm2d: Bimodality driven 3d dance generation via music-text integration

Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023. 1

2023
[10]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022

2022
[11]

Momask: Generative masked model- ing of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 1

1900
[12]

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d human pose ambiguities with 3d scene constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2

2019
[13]

Stochas- tic scene-aware motion prediction

Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- tic scene-aware motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11374–11384, 2021. 2

2021
[14]

Synthesizing phys- ical character-scene interactions

Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing phys- ical character-scene interactions. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023. 1, 2

2023
[15]

Nemf: Neural motion fields for kinematic an- imation.Advances in Neural Information Processing Sys- tems, 35:4244–4256, 2022

Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic an- imation.Advances in Neural Information Processing Sys- tems, 35:4244–4256, 2022. 7

2022
[16]

Intercap: joint markerless 3d tracking of hu- mans and objects in interaction from multi-view rgb-d im- ages.International Journal of Computer Vision, 132(7): 2551–2566, 2024

Yinghao Huang, Omid Taheri, Michael J Black, and Dim- itrios Tzionas. Intercap: joint markerless 3d tracking of hu- mans and objects in interaction from multi-view rgb-d im- ages.International Journal of Computer Vision, 132(7): 2551–2566, 2024. 2

2024
[17]

Primhoi: Compositional human-object interaction via reusable primitives

Kai Jia, Tengyu Liu, Mingtao Pei, Yixin Zhu, and Siyuan Huang. Primhoi: Compositional human-object interaction via reusable primitives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11491– 11501, 2025. 2

2025
[18]

Scaling up dynamic human-scene interaction mod- eling

Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1737–1747,
[19]

Guided motion diffusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 1

2023
[20]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 1, 2, 3, 5, 6, 7, 8

2023
[22]

Karen Liu

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C. Karen Liu. Controllable human-object interaction synthesis. InECCV, 2024. 1, 2, 3, 6, 7

2024
[23]

Genzi: Zero-shot 3d human-scene in- teraction generation

Lei Li and Angela Dai. Genzi: Zero-shot 3d human-scene in- teraction generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20465–20474, 2024. 2

2024
[24]

Ai choreographer: Music conditioned 3d dance generation with aist++

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InProceedings of the IEEE/CVF international conference on computer vision, pages 13401– 13412, 2021. 1

2021
[25]

Smpl: A skinned multi- person linear model

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023. 3

2023
[26]

Posegpt: Quantization-based 3d human mo- tion generation and forecasting

Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Gr´egory Rogez. Posegpt: Quantization-based 3d human mo- tion generation and forecasting. InEuropean Conference on Computer Vision, pages 417–435. Springer, 2022. 1

2022
[27]

Himo: A new benchmark for full-body human interacting with multiple objects

Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, et al. Himo: A new benchmark for full-body human interacting with multiple objects. InEuropean Conference on Computer Vision, pages 300–318. Springer, 2024. 1

2024
[28]

Contact-aware human motion forecasting.Ad- vances in Neural Information Processing Systems, 35:7356– 7367, 2022

Wei Mao, Richard I Hartley, Mathieu Salzmann, and Miao- miao Liu. Contact-aware human motion forecasting.Ad- vances in Neural Information Processing Systems, 35:7356– 7367, 2022. 2

2022
[29]

Catch & carry: reusable neural controllers for vision-guided whole-body tasks.ACM Trans- actions on Graphics (TOG), 39(4):39–1, 2020

Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch & carry: reusable neural controllers for vision-guided whole-body tasks.ACM Trans- actions on Graphics (TOG), 39(4):39–1, 2020. 1, 2

2020
[30]

Contact-invariant optimization for hand manipulation

Igor Mordatch, Zoran Popovi ´c, and Emanuel Todorov. Contact-invariant optimization for hand manipulation. In Proceedings of the ACM SIGGRAPH/Eurographics sympo- sium on computer animation, pages 137–144, 2012. 3

2012
[31]

Animating human lower limbs us- ing contact-invariant optimization.ACM Transactions on Graphics (TOG), 32(6):1–8, 2013

Igor Mordatch, Jack M Wang, Emanuel Todorov, and Vladlen Koltun. Animating human lower limbs us- ing contact-invariant optimization.ACM Transactions on Graphics (TOG), 32(6):1–8, 2013. 3

2013
[32]

To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization

Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, and Jingbo Wang. To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5379–5391, 2025. 2

2025
[33]

Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 8

2019
[34]

Action- conditioned 3d human motion synthesis with transformer vae

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Action- conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 10985–10995, 2021. 1

2021
[35]

Temos: Generating diverse human motions from textual descriptions

Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–
[36]

Finephys: Fine-grained hu- man action generation by explicitly incorporating physical laws for effective skeletal guidance

Dian Shao, Mingfei Shi, Shengda Xu, Haodong Chen, Yon- gle Huang, and Binglu Wang. Finephys: Fine-grained hu- man action generation by explicitly incorporating physical laws for effective skeletal guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1905–1916, 2025. 2

1905
[37]

Bailando: 3d dance generation by actor-critic gpt with choreographic memory

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050– 11059, 2022. 1

2022
[38]

Human motion diffu- sion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 1

2023
[39]

Deco: Dense estimation of 3d human-scene contact in the wild

Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J Black. Deco: Dense estimation of 3d human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8001–8013, 2023. 3

2023
[40]

3d hu- man pose estimation via intuitive physics

Shashank Tripathi, Lea M ¨uller, Chun-Hao P Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. 3d hu- man pose estimation via intuitive physics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4713–4725, 2023. 2

2023
[41]

Humos: Human motion model conditioned on body shape

Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael Black, Daniel Holden, and Carsten Stoll. Humos: Human motion model conditioned on body shape. InEuropean Con- ference on Computer Vision, pages 133–152. Springer, 2024. 2

2024
[42]

Edge: Editable dance generation from music

Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 1

2023
[43]

Synthesizing long-term 3d human motion and in- teraction in 3d scenes

Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiao- long Wang. Synthesizing long-term 3d human motion and in- teraction in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021. 2

2021
[44]

Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imita- tion of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023. 1

work page arXiv 2023
[45]

Humanise: Language-conditioned hu- man motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned hu- man motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022. 2

2022
[46]

Human- object interaction from human-level instructions

Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human- object interaction from human-level instructions. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176–11186, 2025. 1

2025
[47]

Visibility aware human-object interaction tracking from sin- gle rgb camera

Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from sin- gle rgb camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4757– 4768, 2023. 3

2023
[48]

In- tertrack: Tracking human object interaction without object templates

Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- tertrack: Tracking human object interaction without object templates. In2025 International Conference on 3D Vision (3DV), pages 1427–1439. IEEE, 2025. 3

2025
[49]

Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1– 18, 2023

Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel van de Panne, and C Karen Liu. Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1– 18, 2023. 1, 2

2023
[50]

Scene-aware human motion forecasting via mutual distance prediction

Chaoyue Xing, Wei Mao, and Miaomiao Liu. Scene-aware human motion forecasting via mutual distance prediction. In European Conference on Computer Vision, pages 128–144. Springer, 2024. 2

2024
[51]

InterDiff: Generating 3d human-object interactions with physics-informed diffusion

Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, and Liang-Yan Gui. InterDiff: Generating 3d human-object interactions with physics-informed diffusion. InICCV, 2023. 1, 2, 3, 6, 7

2023
[52]

Inter- dreamer: Zero-shot text to 3d dynamic human-object inter- action.Advances in Neural Information Processing Systems, 37:52858–52890, 2024

Sirui Xu, Yu-Xiong Wang, Liangyan Gui, et al. Inter- dreamer: Zero-shot text to 3d dynamic human-object inter- action.Advances in Neural Information Processing Systems, 37:52858–52890, 2024. 1, 2

2024
[53]

Interact: Advancing large-scale versatile 3d human-object interaction generation

Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, et al. Interact: Advancing large-scale versatile 3d human-object interaction generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7048–7060, 2025. 1, 2, 6, 7

2025
[54]

Intermimic: Towards universal whole-body control for physics-based human-object interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liang-Yan Gui. Intermimic: Towards universal whole-body control for physics-based human-object interactions. InCVPR, 2025. 1, 2

2025
[55]

Guiding human-object interactions with rich geometry and relations

Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, and Changxing Ding. Guiding human-object interactions with rich geometry and relations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22714– 22723, 2025. 1

2025
[56]

Chainhoi: Joint-based kinematic chain modeling for human-object in- teraction generation

Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, and Wei-Shi Zheng. Chainhoi: Joint-based kinematic chain modeling for human-object in- teraction generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12358–12369, 2025

2025
[57]

Generating human motion from textual descrip- tions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descrip- tions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023. 1

2023
[58]

Physpt: Physics-aware pretrained transformer for estimating human dynamics from monocular videos

Yufei Zhang, Jeffrey O Kephart, Zijun Cui, and Qiang Ji. Physpt: Physics-aware pretrained transformer for estimating human dynamics from monocular videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2305–2317, 2024. 1, 2, 3, 4, 6, 8

2024
[59]

Incorporating physics principles for precise human motion prediction

Yufei Zhang, Jeffrey O Kephart, and Qiang Ji. Incorporating physics principles for precise human motion prediction. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 6164–6174, 2024. 2

2024