pith. sign in

arxiv: 2606.22806 · v1 · pith:BFDYDMNPnew · submitted 2026-06-22 · 💻 cs.CV

Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

Pith reviewed 2026-06-26 09:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords human-object interactiondiffusion modelsphysics simulationreinforcement learningdata generationgeneralizationmotion retargetingHOI synthesis
0
0 comments X

The pith

A pipeline generating HOI training data from reinforcement learning policies in a physics simulator lets diffusion models generalize to unseen objects and long time horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that training reinforcement learning policies inside a physics simulator can produce large amounts of task-oriented synthetic data for human-object interactions, which after retargeting can train diffusion models more effectively than motion capture data alone. This addresses the limits of expensive, low-diversity real datasets by scaling data generation through simulation. If the approach holds, the resulting models would handle interactions with new objects, sustain physical consistency over extended sequences, and show more varied yet plausible motions. A sympathetic reader would care because it makes creating functional embodied avatars and virtual environments more practical without massive real-world data collection.

Core claim

The paper claims that its Policy-as-Data framework, which trains RL policies in a physics simulator to generate task-oriented HOI data and applies a coarse-to-fine retargeting process to match standard parametric body models, trains diffusion models that achieve enhanced generalization to unseen objects, long-horizon generation capability, greater dynamic diversity, and improved physical plausibility.

What carries the argument

The scalable pipeline that trains reinforcement learning policies in a physics simulator to generate synthetic HOI data and uses coarse-to-fine retargeting to align simulator outputs with generative model requirements.

If this is right

  • The trained diffusion models can produce interactions with objects absent from any real training data.
  • Generated sequences maintain consistency and physical rules across longer time horizons than prior approaches.
  • Motions display increased dynamic variety while respecting simulator-derived constraints.
  • The method reduces dependence on scaled-up motion capture collections for HOI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same simulation-driven data generation could apply to training models for multi-person or tool-use scenarios.
  • It points toward hybrid pipelines where simulation supplies the bulk of data and limited real captures provide fine-tuning.
  • Direct transfer tests onto physical robots would reveal whether the generated motions remain valid outside simulation.
  • The framework suggests simulation can systematically address data scarcity across other physics-constrained generative tasks.

Load-bearing premise

The coarse-to-fine retargeting process accurately maps simplified simulator body representations to standard parametric models while preserving physical validity and task success.

What would settle it

A side-by-side test showing that models trained on the generated data produce more object interpenetrations or lower task completion rates on unseen objects than models trained on motion capture data would falsify the generalization and plausibility claims.

Figures

Figures reproduced from arXiv: 2606.22806 by Haiyu Zhang, Haoyuan Jin, Jianshu Hu, Shujia Li, Xinyuan Chen, Yaohui Wang, Yunpeng Jiang, Yutong Ban.

Figure 1
Figure 1. Figure 1: Physically Grounded Long-Horizon HOI Generation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PAD-HOI. Our paradigm uses physics simulator to overcome Mo￾Cap data scarcity. (a) Physics-Based Data Synthesis: RL experts interact with procedurally randomized object geometries in simulator, generating a massive dataset of physically valid trajectories. (b) Coarse-to-Fine Retargeting: A retargeting mod￾ule translates the simulator’s rigid-body states into the high-fidelity SMPL pose space to… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on objects in Dsim. PAD-HOI generates highly realis￾tic, physically plausible interactions with procedurally generated objects. The model accurately adapts the human pose to varying object shapes and scales, maintaining strict surface contacts without unnatural penetrations. Ablation on Retargeting Quality. To verify how retargeting quality impacts downstream generation, we train our fr… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of multi skill long horizon generation. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of coarse to fine retargeting strategies. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Synthesizing realistic Human-Object Interactions (HOI) is critical for creating embodied avatars and functional virtual environments. However, current data-driven approaches primarily rely on motion capture datasets, which are expensive to scale and limited in functional diversity. Models trained with these datasets fail to generalize to unseen objects and maintain physical consistency over long horizons. In this paper, we propose a novel framework that leverages a physics simulator to overcome the data-scarcity bottleneck in HOI generation. Specifically, we propose a scalable pipeline, called \ours, which leverages policies trained with reinforcement learning in a physics simulator for task-oriented data generation and trains a generative model on the augmented dataset for generalizable HOI generation. To seamlessly utilize the synthetic data, we introduce a coarse-to-fine retargeting process that bridges the representation gap between the simplified model used in physics simulator and the standard parametric body models required for generative training. Validated through comprehensive experiments, our method demonstrates enhanced generalization to unseen objects and the capability of long-horizon generation, while exhibiting greater dynamic diversity and physical plausibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a framework ( exttt{Policy-as-Data}) that trains RL policies inside a physics simulator to produce task-oriented HOI trajectories, applies a coarse-to-fine retargeting step to map the simplified simulator body to standard parametric models, augments existing mocap data with the retargeted trajectories, and trains a diffusion model on the combined corpus, claiming improved generalization to unseen objects, long-horizon generation, greater dynamic diversity, and physical plausibility.

Significance. If the retargeting step demonstrably preserves contact dynamics and long-term consistency, the simulation-driven data pipeline would offer a scalable route to overcoming the limited functional diversity of motion-capture datasets for HOI generation.

major comments (2)
  1. [coarse-to-fine retargeting description] The coarse-to-fine retargeting process is introduced precisely to close the representation gap between the simulator body and parametric models required for diffusion training, yet the manuscript supplies no quantitative validation (contact-force error, penetration statistics, kinematic fidelity, or velocity preservation metrics) that the retargeted sequences retain the task-oriented properties optimized by the RL policies. This validation is load-bearing for the central claim of improved physical plausibility and generalization.
  2. [experimental validation] The abstract states that "comprehensive experiments" support enhanced generalization to unseen objects and long-horizon generation, but the provided text contains no reported metrics, baselines, ablation studies, or error analysis, preventing assessment of whether the claimed advantages are realized.
minor comments (1)
  1. [abstract] The acronym exttt{\\{ours}} is used in the abstract without an explicit expansion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important gaps in validation and reporting that we will address in the revision. We respond to each point below.

read point-by-point responses
  1. Referee: [coarse-to-fine retargeting description] The coarse-to-fine retargeting process is introduced precisely to close the representation gap between the simulator body and parametric models required for diffusion training, yet the manuscript supplies no quantitative validation (contact-force error, penetration statistics, kinematic fidelity, or velocity preservation metrics) that the retargeted sequences retain the task-oriented properties optimized by the RL policies. This validation is load-bearing for the central claim of improved physical plausibility and generalization.

    Authors: We agree that quantitative validation of the retargeting step is necessary to substantiate claims of preserved task-oriented dynamics and physical plausibility. The current manuscript does not include these metrics. In the revised version we will add contact-force error, penetration statistics, kinematic fidelity, and velocity preservation metrics comparing retargeted trajectories to the original simulator outputs, along with analysis showing retention of RL-optimized properties. revision: yes

  2. Referee: [experimental validation] The abstract states that "comprehensive experiments" support enhanced generalization to unseen objects and long-horizon generation, but the provided text contains no reported metrics, baselines, ablation studies, or error analysis, preventing assessment of whether the claimed advantages are realized.

    Authors: We acknowledge that the manuscript text as provided lacks the detailed experimental metrics, baselines, ablations, and error analysis referenced in the abstract. This is a reporting omission. The revised manuscript will include a full experimental section with quantitative results, baseline comparisons, ablation studies, and error analysis to support the claims of improved generalization, long-horizon generation, dynamic diversity, and physical plausibility. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline is externally grounded.

full rationale

The paper presents an engineering pipeline that generates synthetic HOI trajectories via RL policies inside an external physics simulator, applies a coarse-to-fine retargeting step to map simplified bodies onto parametric models, and then trains a diffusion model on the resulting dataset. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or generalization result equivalent to its own inputs by construction. The retargeting procedure is introduced as an independent preprocessing choice rather than a self-defining or load-bearing assumption, and the central claims rest on the simulator and RL components being independent of the diffusion stage. This is the normal case of a self-contained applied method without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the retargeting process and simulator fidelity are implicit modeling choices whose details are not stated.

pith-pipeline@v0.9.1-grok · 5742 in / 1068 out tokens · 24013 ms · 2026-06-26T09:38:20.091044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 6 canonical work pages

  1. [1]

    In: Computer Graphics Forum

    Aristidou, A., Lasenby, J., Chrysanthou, Y., Shamir, A.: Inverse kinematics tech- niques in computer graphics: A survey. In: Computer Graphics Forum. vol. 37(6), pp. 35–58 (2018)

  2. [2]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  3. [3]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 15935–15946 (2022)

  4. [4]

    arXiv preprint arXiv:2311.15127 (2023)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. [5]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cong, P., Wang, Z., Ma, Y., Yue, X.: Semgeomo: Dynamic contextual human motion generation with semantic and geometric guidance. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17561–17570 (2025)

  6. [6]

    In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview

    Deng, Z., Shi, Y., Ji, K., Xu, L., Huang, S., Wang, J.: Human-object interaction via automatically designed VLM-guided motion policy. In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview. net/forum?id=LfkPlFTfe0

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Diller, C., Dai, A.: Cg-hoi: Contact-guided 3d human-object interaction generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19888–19901 (2024)

  8. [8]

    arXiv preprint arXiv:2309.11351 (2023)

    Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C·ase: Learning condi- tional adversarial skill embeddings for physics-based characters. arXiv preprint arXiv:2309.11351 (2023)

  9. [9]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  10. [10]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 12943–12954 (2023)

  11. [11]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Gao, J., Wang, Z., Xiao, Z., Wang, J., Wang, T., Cao, J., Hu, X., Liu, S., Dai, J., Pang, J.: Coohoi: Learning cooperative human-object interaction with 16 S. Li et al. manipulated object dynamics. In: Advances in Neural Information Process- ing Systems. pp. 79741–79763 (2024).https://doi.org/10.52202/079017- 2532,https : / / proceedings . neurips . cc / ...

  12. [12]

    Geng, Z., Hayder, Z., Liu, W., Mian, A.S.: Auto-regressive diffusion for generating 3dhuman-objectinteractions.In:ProceedingsoftheAAAIConferenceonArtificial Intelligence. vol. 39, pp. 3131–3139 (2025)

  13. [13]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

  14. [14]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambi- guities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)

  15. [15]

    InACM SIGGRAPH 2023 Conference Proceedings

    Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesiz- ing physical character-scene interactions. In: ACM SIGGRAPH 2023 Confer- ence Proceedings. SIGGRAPH ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3588432.3591525,https: //doi.org/10.1145/3588432.3591525

  16. [16]

    arXiv preprint arXiv:2210.02303 (2022)

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  17. [17]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  18. [18]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16750–16761 (2023)

  19. [19]

    International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

    Huang, Y., Taheri, O., Black, M.J., Tzionas, D.: InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images. International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

  20. [20]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ji, B., Pan, Y., Liu, Z., Tan, S., Jin, X., Yang, X.: Pomp: Physics-consistent motion generative model through phase manifolds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22690–22701 (2025)

  21. [21]

    In: Proceedings of European Conference on Computer Vision

    Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: Proceedings of European Conference on Computer Vision. Springer (2022)

  22. [22]

    In: ICCV (2023)

    Jiang, N., Liu, T., Cao, Z., Cui, J., Chen, Y., Wang, H., Zhu, Y., Huang, S.: Full-body articulated human-object interaction. In: ICCV (2023)

  23. [23]

    In: European Conference on Computer Vision

    Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human- object interaction synthesis. In: European Conference on Computer Vision. pp. 54–72. Springer (2024)

  24. [24]

    ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

    Li, J., Wu, J., Liu, C.K.: Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

  25. [25]

    arXiv preprint arXiv:2506.15483 (2025)

    Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text- driven 4d human-object interaction synthesis for unseen objects. arXiv preprint arXiv:2506.15483 (2025)

  26. [26]

    arXiv (2025) Abbreviated paper title 17

    Lin, Y., Xie, Y., Xie, J., Huang, Y., Wang, R., Lv, J., Ma, Y., Zuo, X.: Sim- genhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning. arXiv (2025) Abbreviated paper title 17

  27. [27]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 1769– 1782 (2025)

  28. [28]

    In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023),https://doi.org/10.1145/3596711.3596800

  29. [29]

    arXiv preprint arXiv:2503.20118 (2025)

    Lou, Y., Wang, Y., Wu, Z., Zhao, R., Wang, W., Shi, M., Komura, T.: Zero- shot human-object interaction synthesis with multimodal priors. arXiv preprint arXiv:2503.20118 (2025)

  30. [30]

    ACM Transactions on Graphics45(2), 1–18 (2025)

    Lu, J., Zhang, H., Ye, Y., Shiratori, T., Starke, S., Komura, T.: Choice: Coordi- nated human-object interaction in cluttered environments for pick-and-place ac- tions. ACM Transactions on Graphics45(2), 1–18 (2025)

  31. [31]

    Advances in Neural Information Processing Systems37, 2161–2184 (2024)

    Luo, Z., Cao, J., Christen, S., Winkler, A., Kitani, K., Xu, W.: Omnigrasp: Grasp- ing diverse objects with simulated humanoids. Advances in Neural Information Processing Systems37, 2161–2184 (2024)

  32. [32]

    Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real- timesimulatedavatars.In:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision. pp. 10895–10904 (2023)

  33. [33]

    In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview

    Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Universal humanoid motion representations for physics-based control. In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview. net/forum?id=OrOd8PxOO2

  34. [34]

    In: Advances in Neural Information Processing Systems (2021)

    Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic pol- icy for egocentric pose estimation. In: Advances in Neural Information Processing Systems (2021)

  35. [35]

    In: Advances in Neural Information Processing Systems (2022)

    Luo, Z., Iwase, S., Yuan, Y., Kitani, K.: Embodied scene-aware human pose esti- mation. In: Advances in Neural Information Processing Systems (2022)

  36. [36]

    ArXivabs/2206.09286(2022)

    Luo, Z., Yuan, Y., Kitani, K.M.: From universal humanoid control to automatic physically valid character creation. ArXivabs/2206.09286(2022)

  37. [37]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Pan, L., Yang, Z., Dou, Z., Wang, W., Huang, B., Dai, B., Komura, T., Wang, J.: Tokenhsi: Unified synthesis of physical human-scene interactions through task tokenization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5379–5391 (2025)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

  39. [39]

    arXiv preprint arXiv:2312.06553 (2023)

    Peng, X., Xie, Y., Wu, Z., Jampani, V., Sun, D., Jiang, H.: Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553 (2023)

  40. [40]

    arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

    Peng, X.B.: Mimickit: A reinforcement learning framework for motion imitation and control. arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

  41. [41]

    ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

    Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

  42. [42]

    Li et al

    Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022) 18 S. Li et al

  43. [43]

    ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

    Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

  44. [44]

    arXiv preprint arXiv:2209.14988 (2022)

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  45. [45]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point clouds with basis point sets. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4332–4341 (2019)

  46. [46]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  48. [48]

    arXiv preprint arXiv:2201.02610 (2022)

    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)

  49. [49]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  50. [50]

    In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

    Tessler, C., Jiang, Y., Coumans, E., Luo, Z., Chechik, G., Peng, X.B.: Masked- manipulator: Versatile whole-body manipulation. In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

  51. [51]

    In: ACM SIGGRAPH 2023 conference proceedings

    Tessler, C., Kasten, Y., Guo, Y., Mannor, S., Chechik, G., Peng, X.B.: Calm: Conditional adversarial latent models for directable virtual characters. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–9 (2023)

  52. [52]

    arXiv preprint arXiv:2410.03441 (2024)

    Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441 (2024)

  53. [53]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

  54. [54]

    Npga: Neural parametric gaussian avatars

    Truong, T.E., Piseno, M., Xie, Z., Liu, K.: Pdp: Physics-based character animation via diffusion policy. SA ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3680528.3687683,https://doi.org/ 10.1145/3680528.3687683

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Y., Zhao, Q., Yu, R., Tsui, H.W., Zeng, A., Lin, J., Luo, Z., Yu, J., Li, X., Chen, Q., et al.: Skillmimic: Learning basketball interaction skills from demon- strations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17540–17549 (2025)

  56. [56]

    Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

    Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: Language- conditioned human motion generation in 3d scenes. Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

  57. [57]

    arXiv preprint arXiv:2403.11208 (2024)

    Wu, Q., Shi, Y., Huang, X., Yu, J., Xu, L., Wang, J.: Thor: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208 (2024)

  58. [58]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

    Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

  59. [59]

    In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

    Wu, Z., Li, J., Xu, P., Liu, C.K.: Human-object interaction from human-level in- structions. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

  60. [60]

    In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

    Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

  61. [61]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Xu, M., Shi, Y., Yin, K., Peng, X.B.: Parc: Physics-based augmentation with rein- forcement learning for character controllers. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

  62. [62]

    ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

    Xu, P., Xie, K., Andrews, S., Kry, P.G., Neff, M., McGuire, M., Karamouzas, I., Zordan, V.: AdaptNet: Policy adaptation for physics-based character control. ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

  63. [63]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xu, S., Li, D., Zhang, Y., Xu, X., Long, Q., Wang, Z., Lu, Y., Dong, S., Jiang, H., Gupta, A., etal.:Interact: Advancing large-scaleversatile 3dhuman-object interac- tion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7048–7060 (2025)

  64. [64]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3d human-object in- teractions with physics-informed diffusion. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 14928–14940 (2023)

  65. [65]

    Yang, J., Niu, X., Jiang, N., Zhang, R., Huang, S.: F-hoi: Toward fine-grained semantic-aligned 3d human-object interactions (2024),https://arxiv.org/abs/ 2407.12435

  66. [66]

    In: European Conference on Computer Vision

    Yi, H., Thies, J., Black, M.J., Peng, X.B., Rempe, D.: Generating human interac- tion motions in scenes with text control. In: European Conference on Computer Vision. pp. 246–263. Springer (2024)

  67. [67]

    arXiv preprint arXiv:2310.085292(3), 5 (2023)

    Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaus- siandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.085292(3), 5 (2023)

  68. [68]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Yu,R.,Wang,Y.,Zhao,Q.,Tsui,H.W.,Wang,J.,Tan,P.,Chen,Q.:Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demon- strations. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

  69. [69]

    In: Proceedings of the IEEE/CVF international con- ference on computer vision

    Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided hu- man motion diffusion model. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 16010–16021 (2023)

  70. [70]

    arXiv preprint arXiv:2506.12769 (2025)

    Yue, J., Wang, Z., Wang, Y., Zeng, W., Wang, J., Xu, X., Zhang, Y., Zheng, S., Ding, Z., Lu, Z.: Rl from physical feedback: Aligning large motion models with humanoid control. arXiv preprint arXiv:2506.12769 (2025)

  71. [71]

    arXiv preprint arXiv:2503.13130 (2025)

    Zeng, L.A., Huang, G., Wei, Y.L., Gu, S., Tang, Y.M., Meng, J., Zheng, W.S.: Chainhoi: Joint-based kinematic chain modeling for human-object interaction gen- eration. arXiv preprint arXiv:2503.13130 (2025)

  72. [72]

    Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view videodiffusionmodelfor4dgeneration.AdvancesinNeuralInformationProcessing Systems37, 15272–15295 (2024)

  73. [73]

    A person approaches a chair/box/table, picks it up, and places it in the designated location

    Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: Towards controllable human-chair interactions. In: European Conference on Computer Vi- sion. pp. 518–535. Springer (2022) 20 S. Li et al. Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics Supplementary A Expert Policy Training This section details the...