pith. machine review for the scientific record. sign in

arxiv: 2605.01036 · v1 · submitted 2026-05-01 · 💻 cs.CV

InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene

Pith reviewed 2026-05-09 19:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords human motion synthesisphysics-aware generationdynamic scenescontact force modelingforce and torque balancehuman-object interaction
0
0 comments X

The pith

Soft physical constraints and a continuous distance-based force model generate physically plausible human motions in dynamic scenes with moving objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to produce sequences of human movement that obey physical laws when the person interacts with both fixed surroundings and objects that themselves move. Prior methods typically produce implausible results because they model contacts only at the hands or only with motionless surfaces. The authors introduce soft constraints that keep overall force and torque balanced across the body and a new way to compute contact forces from distances that works on any surface shape. If correct, this would let generated motions remain consistent with physics across the full range of human-related forces without relying on hard limits or later corrections.

Core claim

The central claim is that a physics-aware framework can synthesize human motions by explicitly modeling the full spectrum of forces—human-object, human-scene, and internal body dynamics—through soft constraints that enforce force and torque balance together with a novel continuous distance-based force model. This model extends contact handling to arbitrary surfaces and to interactions with dynamic, moving objects, yielding motions that are more physically grounded than those produced by methods limited to static scenes or hand-only contacts.

What carries the argument

The continuous distance-based force model, which computes interaction forces from distances to generalize contacts beyond hands or static surfaces and to include moving objects, paired with soft constraints that maintain force and torque balance.

If this is right

  • Motions respect the complete set of human-related forces including internal body dynamics.
  • Contact modeling extends to arbitrary surfaces and dynamic moving objects rather than being restricted to hands or static environments.
  • Physical plausibility improves markedly in complex scenes compared with earlier limited-contact methods.
  • The framework generalizes to new scenes while setting a benchmark for consistent human motion generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distance-based contact approach may reduce reliance on explicit collision detection routines in downstream animation pipelines.
  • Similar force modeling could transfer to generating interactions with additional classes of objects if the distance function is adjusted accordingly.

Load-bearing premise

Soft constraints plus the distance-based force model suffice to keep motions physically plausible in complex dynamic scenes without hard constraints, full rigid-body simulation, or post-processing corrections.

What would settle it

A concrete motion sequence generated by the method in a scene containing a moving object, where the human body penetrates the object or the net force and torque on the body fail to balance, would show the approach does not achieve its claimed physical consistency.

Figures

Figures reproduced from arXiv: 2605.01036 by Chaoyue Xing, Miaomiao Liu, Wei Mao.

Figure 1
Figure 1. Figure 1: Our Task. Our method takes 3D object motion and a 3D scene as input (a), to synthesize physically consistent 3D human motion interacting with both the moving object and the static background scene (b). into that of the human, yielding a unified formulation where the human motion is constrained not only by the scene but also by the dynamics of the moving object. To achieve physics-aware human motion synthes… view at source ↗
Figure 2
Figure 2. Figure 2: Continous contact force model. a) The PhysPT model assumes a static ground plane and represents contact force with two independent orthogonal springs, b) Our model generalizes to arbitrary 3D surfaces by incorporating local surface normals for the normal force and explicitly modeling tangential static and kinetic friction that are dependent to the normal force, enabling physically consistent interactions i… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our pipeline. The input static scene view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on OMOMO. From left to right: object-only context, ground truth, our prediction, and predictions from view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on Trumans. Each row shows ground truth, the Turmans baseline, and our method. Arrows illustrate view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study comparison on OMOMO view at source ↗
read the original abstract

This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics.~Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InterPhys, a physics-aware framework for human motion synthesis in dynamic scenes. It explicitly models the full spectrum of human-related forces (human-object, human-scene, and internal body dynamics) by imposing soft physical constraints to maintain force and torque balance. A novel continuous distance-based force model is proposed to generalize contact modeling to arbitrary surfaces, including interactions with both static environments and dynamic moving objects. The authors claim that extensive experiments demonstrate significantly improved physical plausibility and generalization, setting a new benchmark for physically consistent motion generation.

Significance. If the central claims hold, this work would advance physics-informed motion synthesis by providing a flexible alternative to hard constraints or full rigid-body simulation, particularly through the continuous force model that handles dynamic object interactions. This could influence downstream applications in animation, robotics, and VR by reducing reliance on post-processing corrections while maintaining physical grounding.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (Method): The central claim that soft physical constraints plus the distance-based force model suffice to maintain force/torque balance in dynamic scenes is load-bearing but rests on an unverified assumption. Small per-step violations permitted by soft penalties can accumulate over time with moving objects, leading to implausibilities such as penetration or unbalanced torques; no section provides bounded residual analysis, long-horizon consistency metrics, or comparison against hard-constraint baselines to refute this risk.
  2. [§4] §4 (Experiments): The abstract asserts that 'extensive experiments' show improved plausibility and generalization, yet the provided text supplies no quantitative results, specific baselines, error tables, or ablation studies on the soft-constraint weights and distance-based scaling parameters. This absence makes it impossible to evaluate whether the method outperforms prior contact-limited approaches in complex dynamic scenes.
minor comments (2)
  1. [Abstract] The abstract would benefit from including one or two key quantitative metrics (e.g., force residual norms or contact accuracy) to ground the claims of 'significantly improves physical plausibility.'
  2. [§3.2] Notation for the continuous distance-based force model parameters (e.g., scaling factors) could be clarified with an explicit symbol table or definition list in §3.2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed the concerns regarding the long-term stability of our soft-constraint formulation and the clarity of the experimental results. Revisions have been made to include additional analysis and to ensure all quantitative evaluations are explicitly presented.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Method): The central claim that soft physical constraints plus the distance-based force model suffice to maintain force/torque balance in dynamic scenes is load-bearing but rests on an unverified assumption. Small per-step violations permitted by soft penalties can accumulate over time with moving objects, leading to implausibilities such as penetration or unbalanced torques; no section provides bounded residual analysis, long-horizon consistency metrics, or comparison against hard-constraint baselines to refute this risk.

    Authors: We agree that explicit verification of long-term stability is important for soft-constraint methods. Our current experiments demonstrate that motions remain plausible without accumulating visible penetrations or torque imbalances over long sequences, thanks to the continuous distance-based force model that provides smooth gradients even for dynamic objects. However, we acknowledge the lack of formal bounded residual analysis in the original submission. In the revised manuscript we have added a dedicated stability analysis subsection reporting per-step and cumulative residual force/torque norms, maximum penetration depths, and long-horizon consistency metrics across 100+ frame sequences. We also include a limited comparison to a hard-constraint baseline, noting that hard constraints frequently cause solver divergence in scenes with moving objects, which motivated our soft formulation. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts that 'extensive experiments' show improved plausibility and generalization, yet the provided text supplies no quantitative results, specific baselines, error tables, or ablation studies on the soft-constraint weights and distance-based scaling parameters. This absence makes it impossible to evaluate whether the method outperforms prior contact-limited approaches in complex dynamic scenes.

    Authors: We apologize that the quantitative details were not sufficiently prominent in the reviewed version. The full §4 contains error tables comparing against multiple baselines (including prior contact-limited and physics-based methods), reporting metrics such as average contact force error, penetration volume, and torque imbalance. Ablation studies on soft-constraint weights and distance-based scaling parameters are also present and show clear sensitivity trends. We have revised the section to ensure all tables, baseline descriptions, and ablation results are explicitly referenced and placed before the qualitative results, making the performance gains in dynamic scenes immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces a physics-aware framework that imposes soft constraints for force/torque balance and proposes a novel continuous distance-based force model for human-object and human-scene interactions. These elements are framed as extensions of external physics principles rather than reductions of outputs to inputs. No quoted equations or sections in the abstract or description demonstrate self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claim. The derivation remains independent and self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Ledger estimated from abstract claims only; full paper would likely reveal additional learned parameters and implementation assumptions.

free parameters (2)
  • soft constraint weights
    Weights balancing the soft force and torque penalties are almost certainly tuned on data.
  • distance-based force scaling parameters
    Parameters controlling the continuous force function for arbitrary surfaces must be chosen or fitted.
axioms (2)
  • domain assumption Force and torque balance is a necessary condition for physically plausible human motion
    Directly invoked to justify the soft constraints.
  • ad hoc to paper A distance-based continuous function can adequately approximate contact forces on arbitrary surfaces
    Core modeling choice introduced without derivation from first principles.
invented entities (1)
  • continuous distance-based force model no independent evidence
    purpose: To generalize contact forces beyond discrete points to arbitrary and moving surfaces
    New modeling construct proposed in the paper; no independent evidence of its validity is supplied in the abstract.

pith-pipeline@v0.9.0 · 5431 in / 1438 out tokens · 32446 ms · 2026-05-09T19:17:54.533955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023

    Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. Listen, denoise, action! audio-driven motion synthesis with diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–20, 2023. 1

  2. [2]

    Behave: Dataset and method for tracking human object in- teractions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935– 15946, 2022. 2

  3. [3]

    Esti- mating contact dynamics

    Marcus A Brubaker, Leonid Sigal, and David J Fleet. Esti- mating contact dynamics. In2009 IEEE 12th International Conference on Computer Vision, pages 2389–2396. IEEE,

  4. [4]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 1

  5. [5]

    Detecting human-object contact in images

    Yixin Chen, Sai Kumar Dwivedi, Michael J Black, and Dim- itrios Tzionas. Detecting human-object contact in images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17100–17110, 2023. 3

  6. [6]

    Pico: Reconstructing 3d people in con- tact with objects

    Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun S Lakshmipathy, Agniv Chatterjee, Michael J Black, and Dimitrios Tzionas. Pico: Reconstructing 3d people in con- tact with objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1783–1794, 2025. 3

  7. [7]

    Cg-hoi: Contact-guided 3d human-object interaction generation

    Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19888–19901, 2024. 1

  8. [8]

    Differentiable dynamics for articu- lated 3d human motion reconstruction

    Erik G ¨artner, Mykhaylo Andriluka, Erwin Coumans, and Cristian Sminchisescu. Differentiable dynamics for articu- lated 3d human motion reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13190–13200, 2022. 2

  9. [9]

    Tm2d: Bimodality driven 3d dance generation via music-text integration

    Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023. 1

  10. [10]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022

  11. [11]

    Momask: Generative masked model- ing of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 1

  12. [12]

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d human pose ambiguities with 3d scene constraints. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2

  13. [13]

    Stochas- tic scene-aware motion prediction

    Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- tic scene-aware motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11374–11384, 2021. 2

  14. [14]

    Synthesizing phys- ical character-scene interactions

    Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing phys- ical character-scene interactions. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023. 1, 2

  15. [15]

    Nemf: Neural motion fields for kinematic an- imation.Advances in Neural Information Processing Sys- tems, 35:4244–4256, 2022

    Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic an- imation.Advances in Neural Information Processing Sys- tems, 35:4244–4256, 2022. 7

  16. [16]

    Intercap: joint markerless 3d tracking of hu- mans and objects in interaction from multi-view rgb-d im- ages.International Journal of Computer Vision, 132(7): 2551–2566, 2024

    Yinghao Huang, Omid Taheri, Michael J Black, and Dim- itrios Tzionas. Intercap: joint markerless 3d tracking of hu- mans and objects in interaction from multi-view rgb-d im- ages.International Journal of Computer Vision, 132(7): 2551–2566, 2024. 2

  17. [17]

    Primhoi: Compositional human-object interaction via reusable primitives

    Kai Jia, Tengyu Liu, Mingtao Pei, Yixin Zhu, and Siyuan Huang. Primhoi: Compositional human-object interaction via reusable primitives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11491– 11501, 2025. 2

  18. [18]

    Scaling up dynamic human-scene interaction mod- eling

    Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 1737–1747,

  19. [19]

    Guided motion diffusion for controllable human motion synthesis

    Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 1

  20. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  21. [21]

    Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

    Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 1, 2, 3, 5, 6, 7, 8

  22. [22]

    Karen Liu

    Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C. Karen Liu. Controllable human-object interaction synthesis. InECCV, 2024. 1, 2, 3, 6, 7

  23. [23]

    Genzi: Zero-shot 3d human-scene in- teraction generation

    Lei Li and Angela Dai. Genzi: Zero-shot 3d human-scene in- teraction generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20465–20474, 2024. 2

  24. [24]

    Ai choreographer: Music conditioned 3d dance generation with aist++

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InProceedings of the IEEE/CVF international conference on computer vision, pages 13401– 13412, 2021. 1

  25. [25]

    Smpl: A skinned multi- person linear model

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023. 3

  26. [26]

    Posegpt: Quantization-based 3d human mo- tion generation and forecasting

    Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Gr´egory Rogez. Posegpt: Quantization-based 3d human mo- tion generation and forecasting. InEuropean Conference on Computer Vision, pages 417–435. Springer, 2022. 1

  27. [27]

    Himo: A new benchmark for full-body human interacting with multiple objects

    Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, et al. Himo: A new benchmark for full-body human interacting with multiple objects. InEuropean Conference on Computer Vision, pages 300–318. Springer, 2024. 1

  28. [28]

    Contact-aware human motion forecasting.Ad- vances in Neural Information Processing Systems, 35:7356– 7367, 2022

    Wei Mao, Richard I Hartley, Mathieu Salzmann, and Miao- miao Liu. Contact-aware human motion forecasting.Ad- vances in Neural Information Processing Systems, 35:7356– 7367, 2022. 2

  29. [29]

    Catch & carry: reusable neural controllers for vision-guided whole-body tasks.ACM Trans- actions on Graphics (TOG), 39(4):39–1, 2020

    Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch & carry: reusable neural controllers for vision-guided whole-body tasks.ACM Trans- actions on Graphics (TOG), 39(4):39–1, 2020. 1, 2

  30. [30]

    Contact-invariant optimization for hand manipulation

    Igor Mordatch, Zoran Popovi ´c, and Emanuel Todorov. Contact-invariant optimization for hand manipulation. In Proceedings of the ACM SIGGRAPH/Eurographics sympo- sium on computer animation, pages 137–144, 2012. 3

  31. [31]

    Animating human lower limbs us- ing contact-invariant optimization.ACM Transactions on Graphics (TOG), 32(6):1–8, 2013

    Igor Mordatch, Jack M Wang, Emanuel Todorov, and Vladlen Koltun. Animating human lower limbs us- ing contact-invariant optimization.ACM Transactions on Graphics (TOG), 32(6):1–8, 2013. 3

  32. [32]

    To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization

    Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, and Jingbo Wang. To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5379–5391, 2025. 2

  33. [33]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 8

  34. [34]

    Action- conditioned 3d human motion synthesis with transformer vae

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Action- conditioned 3d human motion synthesis with transformer vae. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 10985–10995, 2021. 1

  35. [35]

    Temos: Generating diverse human motions from textual descriptions

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Temos: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–

  36. [36]

    Finephys: Fine-grained hu- man action generation by explicitly incorporating physical laws for effective skeletal guidance

    Dian Shao, Mingfei Shi, Shengda Xu, Haodong Chen, Yon- gle Huang, and Binglu Wang. Finephys: Fine-grained hu- man action generation by explicitly incorporating physical laws for effective skeletal guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1905–1916, 2025. 2

  37. [37]

    Bailando: 3d dance generation by actor-critic gpt with choreographic memory

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050– 11059, 2022. 1

  38. [38]

    Human motion diffu- sion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2023. 1

  39. [39]

    Deco: Dense estimation of 3d human-scene contact in the wild

    Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J Black. Deco: Dense estimation of 3d human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8001–8013, 2023. 3

  40. [40]

    3d hu- man pose estimation via intuitive physics

    Shashank Tripathi, Lea M ¨uller, Chun-Hao P Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. 3d hu- man pose estimation via intuitive physics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4713–4725, 2023. 2

  41. [41]

    Humos: Human motion model conditioned on body shape

    Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael Black, Daniel Holden, and Carsten Stoll. Humos: Human motion model conditioned on body shape. InEuropean Con- ference on Computer Vision, pages 133–152. Springer, 2024. 2

  42. [42]

    Edge: Editable dance generation from music

    Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023. 1

  43. [43]

    Synthesizing long-term 3d human motion and in- teraction in 3d scenes

    Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiao- long Wang. Synthesizing long-term 3d human motion and in- teraction in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021. 2

  44. [44]

    Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

    Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imita- tion of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023. 1

  45. [45]

    Humanise: Language-conditioned hu- man motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022

    Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned hu- man motion generation in 3d scenes.Advances in Neural Information Processing Systems, 35:14959–14971, 2022. 2

  46. [46]

    Human- object interaction from human-level instructions

    Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human- object interaction from human-level instructions. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176–11186, 2025. 1

  47. [47]

    Visibility aware human-object interaction tracking from sin- gle rgb camera

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from sin- gle rgb camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4757– 4768, 2023. 3

  48. [48]

    In- tertrack: Tracking human object interaction without object templates

    Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- tertrack: Tracking human object interaction without object templates. In2025 International Conference on 3D Vision (3DV), pages 1427–1439. IEEE, 2025. 3

  49. [49]

    Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1– 18, 2023

    Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel van de Panne, and C Karen Liu. Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1– 18, 2023. 1, 2

  50. [50]

    Scene-aware human motion forecasting via mutual distance prediction

    Chaoyue Xing, Wei Mao, and Miaomiao Liu. Scene-aware human motion forecasting via mutual distance prediction. In European Conference on Computer Vision, pages 128–144. Springer, 2024. 2

  51. [51]

    InterDiff: Generating 3d human-object interactions with physics-informed diffusion

    Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, and Liang-Yan Gui. InterDiff: Generating 3d human-object interactions with physics-informed diffusion. InICCV, 2023. 1, 2, 3, 6, 7

  52. [52]

    Inter- dreamer: Zero-shot text to 3d dynamic human-object inter- action.Advances in Neural Information Processing Systems, 37:52858–52890, 2024

    Sirui Xu, Yu-Xiong Wang, Liangyan Gui, et al. Inter- dreamer: Zero-shot text to 3d dynamic human-object inter- action.Advances in Neural Information Processing Systems, 37:52858–52890, 2024. 1, 2

  53. [53]

    Interact: Advancing large-scale versatile 3d human-object interaction generation

    Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, et al. Interact: Advancing large-scale versatile 3d human-object interaction generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7048–7060, 2025. 1, 2, 6, 7

  54. [54]

    Intermimic: Towards universal whole-body control for physics-based human-object interactions

    Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liang-Yan Gui. Intermimic: Towards universal whole-body control for physics-based human-object interactions. InCVPR, 2025. 1, 2

  55. [55]

    Guiding human-object interactions with rich geometry and relations

    Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, and Changxing Ding. Guiding human-object interactions with rich geometry and relations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22714– 22723, 2025. 1

  56. [56]

    Chainhoi: Joint-based kinematic chain modeling for human-object in- teraction generation

    Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, and Wei-Shi Zheng. Chainhoi: Joint-based kinematic chain modeling for human-object in- teraction generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12358–12369, 2025

  57. [57]

    Generating human motion from textual descrip- tions with discrete representations

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descrip- tions with discrete representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14730–14740, 2023. 1

  58. [58]

    Physpt: Physics-aware pretrained transformer for estimating human dynamics from monocular videos

    Yufei Zhang, Jeffrey O Kephart, Zijun Cui, and Qiang Ji. Physpt: Physics-aware pretrained transformer for estimating human dynamics from monocular videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2305–2317, 2024. 1, 2, 3, 4, 6, 8

  59. [59]

    Incorporating physics principles for precise human motion prediction

    Yufei Zhang, Jeffrey O Kephart, and Qiang Ji. Incorporating physics principles for precise human motion prediction. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 6164–6174, 2024. 2