Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

Haiyu Zhang; Haoyuan Jin; Jianshu Hu; Shujia Li; Xinyuan Chen; Yaohui Wang; Yunpeng Jiang; Yutong Ban

arxiv: 2606.22806 · v1 · pith:BFDYDMNPnew · submitted 2026-06-22 · 💻 cs.CV

Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

Shujia Li , Jianshu Hu , Haiyu Zhang , Yunpeng Jiang , Haoyuan Jin , Xinyuan Chen , Yaohui Wang , Yutong Ban This is my paper

Pith reviewed 2026-06-26 09:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interactiondiffusion modelsphysics simulationreinforcement learningdata generationgeneralizationmotion retargetingHOI synthesis

0 comments

The pith

A pipeline generating HOI training data from reinforcement learning policies in a physics simulator lets diffusion models generalize to unseen objects and long time horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that training reinforcement learning policies inside a physics simulator can produce large amounts of task-oriented synthetic data for human-object interactions, which after retargeting can train diffusion models more effectively than motion capture data alone. This addresses the limits of expensive, low-diversity real datasets by scaling data generation through simulation. If the approach holds, the resulting models would handle interactions with new objects, sustain physical consistency over extended sequences, and show more varied yet plausible motions. A sympathetic reader would care because it makes creating functional embodied avatars and virtual environments more practical without massive real-world data collection.

Core claim

The paper claims that its Policy-as-Data framework, which trains RL policies in a physics simulator to generate task-oriented HOI data and applies a coarse-to-fine retargeting process to match standard parametric body models, trains diffusion models that achieve enhanced generalization to unseen objects, long-horizon generation capability, greater dynamic diversity, and improved physical plausibility.

What carries the argument

The scalable pipeline that trains reinforcement learning policies in a physics simulator to generate synthetic HOI data and uses coarse-to-fine retargeting to align simulator outputs with generative model requirements.

If this is right

The trained diffusion models can produce interactions with objects absent from any real training data.
Generated sequences maintain consistency and physical rules across longer time horizons than prior approaches.
Motions display increased dynamic variety while respecting simulator-derived constraints.
The method reduces dependence on scaled-up motion capture collections for HOI tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same simulation-driven data generation could apply to training models for multi-person or tool-use scenarios.
It points toward hybrid pipelines where simulation supplies the bulk of data and limited real captures provide fine-tuning.
Direct transfer tests onto physical robots would reveal whether the generated motions remain valid outside simulation.
The framework suggests simulation can systematically address data scarcity across other physics-constrained generative tasks.

Load-bearing premise

The coarse-to-fine retargeting process accurately maps simplified simulator body representations to standard parametric models while preserving physical validity and task success.

What would settle it

A side-by-side test showing that models trained on the generated data produce more object interpenetrations or lower task completion rates on unseen objects than models trained on motion capture data would falsify the generalization and plausibility claims.

Figures

Figures reproduced from arXiv: 2606.22806 by Haiyu Zhang, Haoyuan Jin, Jianshu Hu, Shujia Li, Xinyuan Chen, Yaohui Wang, Yunpeng Jiang, Yutong Ban.

**Figure 2.** Figure 2: Overview of PAD-HOI. Our paradigm uses physics simulator to overcome MoCap data scarcity. (a) Physics-Based Data Synthesis: RL experts interact with procedurally randomized object geometries in simulator, generating a massive dataset of physically valid trajectories. (b) Coarse-to-Fine Retargeting: A retargeting module translates the simulator’s rigid-body states into the high-fidelity SMPL pose space to… view at source ↗

**Figure 3.** Figure 3: Qualitative results on objects in Dsim. PAD-HOI generates highly realistic, physically plausible interactions with procedurally generated objects. The model accurately adapts the human pose to varying object shapes and scales, maintaining strict surface contacts without unnatural penetrations. Ablation on Retargeting Quality. To verify how retargeting quality impacts downstream generation, we train our fr… view at source ↗

**Figure 4.** Figure 4: Qualitative results of multi skill long horizon generation. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of coarse to fine retargeting strategies. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Synthesizing realistic Human-Object Interactions (HOI) is critical for creating embodied avatars and functional virtual environments. However, current data-driven approaches primarily rely on motion capture datasets, which are expensive to scale and limited in functional diversity. Models trained with these datasets fail to generalize to unseen objects and maintain physical consistency over long horizons. In this paper, we propose a novel framework that leverages a physics simulator to overcome the data-scarcity bottleneck in HOI generation. Specifically, we propose a scalable pipeline, called \ours, which leverages policies trained with reinforcement learning in a physics simulator for task-oriented data generation and trains a generative model on the augmented dataset for generalizable HOI generation. To seamlessly utilize the synthetic data, we introduce a coarse-to-fine retargeting process that bridges the representation gap between the simplified model used in physics simulator and the standard parametric body models required for generative training. Validated through comprehensive experiments, our method demonstrates enhanced generalization to unseen objects and the capability of long-horizon generation, while exhibiting greater dynamic diversity and physical plausibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is generating HOI data via RL policies in a physics simulator plus coarse-to-fine retargeting for diffusion training, but the abstract supplies no metrics to check whether retargeting preserves the claimed physical advantages.

read the letter

The paper proposes training RL policies in a simulator to produce task-oriented HOI trajectories, then retargeting those outputs so they can train a diffusion model on standard body representations. This is presented as a way around the scaling and diversity limits of motion-capture datasets. The policy-as-data framing and the specific retargeting step are the concrete novelties; they directly address generalization to unseen objects and long-horizon consistency.

The approach is sensible on paper. Using a simulator lets you generate functional interactions at scale without new mocap sessions, and the coarse-to-fine retargeting is a practical bridge between simplified physics bodies and the parametric models diffusion models expect.

The main weakness is that none of the quantitative support is visible. The abstract asserts enhanced generalization, dynamic diversity, and physical plausibility, yet gives no baselines, ablation results, contact-error numbers, or penetration statistics. Without those, it is impossible to tell whether the retargeting step actually keeps the contact dynamics and velocities that the policies were optimized to produce. If the mapping introduces distortions, the downstream diffusion model is trained on data whose physical benefits are not real. The stress-test note correctly flags this as the least-secured link.

This work is aimed at researchers in HOI generation and embodied simulation. It deserves peer review because the pipeline is a clear, reproducible idea that targets a recognized bottleneck, even though the current draft will need the experimental details and validation metrics added before the claims can be assessed.

Referee Report

2 major / 1 minor

Summary. The paper introduces a framework ( exttt{Policy-as-Data}) that trains RL policies inside a physics simulator to produce task-oriented HOI trajectories, applies a coarse-to-fine retargeting step to map the simplified simulator body to standard parametric models, augments existing mocap data with the retargeted trajectories, and trains a diffusion model on the combined corpus, claiming improved generalization to unseen objects, long-horizon generation, greater dynamic diversity, and physical plausibility.

Significance. If the retargeting step demonstrably preserves contact dynamics and long-term consistency, the simulation-driven data pipeline would offer a scalable route to overcoming the limited functional diversity of motion-capture datasets for HOI generation.

major comments (2)

[coarse-to-fine retargeting description] The coarse-to-fine retargeting process is introduced precisely to close the representation gap between the simulator body and parametric models required for diffusion training, yet the manuscript supplies no quantitative validation (contact-force error, penetration statistics, kinematic fidelity, or velocity preservation metrics) that the retargeted sequences retain the task-oriented properties optimized by the RL policies. This validation is load-bearing for the central claim of improved physical plausibility and generalization.
[experimental validation] The abstract states that "comprehensive experiments" support enhanced generalization to unseen objects and long-horizon generation, but the provided text contains no reported metrics, baselines, ablation studies, or error analysis, preventing assessment of whether the claimed advantages are realized.

minor comments (1)

[abstract] The acronym exttt{\\{ours}} is used in the abstract without an explicit expansion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important gaps in validation and reporting that we will address in the revision. We respond to each point below.

read point-by-point responses

Referee: [coarse-to-fine retargeting description] The coarse-to-fine retargeting process is introduced precisely to close the representation gap between the simulator body and parametric models required for diffusion training, yet the manuscript supplies no quantitative validation (contact-force error, penetration statistics, kinematic fidelity, or velocity preservation metrics) that the retargeted sequences retain the task-oriented properties optimized by the RL policies. This validation is load-bearing for the central claim of improved physical plausibility and generalization.

Authors: We agree that quantitative validation of the retargeting step is necessary to substantiate claims of preserved task-oriented dynamics and physical plausibility. The current manuscript does not include these metrics. In the revised version we will add contact-force error, penetration statistics, kinematic fidelity, and velocity preservation metrics comparing retargeted trajectories to the original simulator outputs, along with analysis showing retention of RL-optimized properties. revision: yes
Referee: [experimental validation] The abstract states that "comprehensive experiments" support enhanced generalization to unseen objects and long-horizon generation, but the provided text contains no reported metrics, baselines, ablation studies, or error analysis, preventing assessment of whether the claimed advantages are realized.

Authors: We acknowledge that the manuscript text as provided lacks the detailed experimental metrics, baselines, ablations, and error analysis referenced in the abstract. This is a reporting omission. The revised manuscript will include a full experimental section with quantitative results, baseline comparisons, ablation studies, and error analysis to support the claims of improved generalization, long-horizon generation, dynamic diversity, and physical plausibility. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline is externally grounded.

full rationale

The paper presents an engineering pipeline that generates synthetic HOI trajectories via RL policies inside an external physics simulator, applies a coarse-to-fine retargeting step to map simplified bodies onto parametric models, and then trains a diffusion model on the resulting dataset. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or generalization result equivalent to its own inputs by construction. The retargeting procedure is introduced as an independent preprocessing choice rather than a self-defining or load-bearing assumption, and the central claims rest on the simulator and RL components being independent of the diffusion stage. This is the normal case of a self-contained applied method without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the retargeting process and simulator fidelity are implicit modeling choices whose details are not stated.

pith-pipeline@v0.9.1-grok · 5742 in / 1068 out tokens · 24013 ms · 2026-06-26T09:38:20.091044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 6 canonical work pages

[1]

In: Computer Graphics Forum

Aristidou, A., Lasenby, J., Chrysanthou, Y., Shamir, A.: Inverse kinematics tech- niques in computer graphics: A survey. In: Computer Graphics Forum. vol. 37(6), pp. 35–58 (2018)

2018
[2]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[3]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 15935–15946 (2022)

2022
[4]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023
[5]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cong, P., Wang, Z., Ma, Y., Yue, X.: Semgeomo: Dynamic contextual human motion generation with semantic and geometric guidance. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17561–17570 (2025)

2025
[6]

In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview

Deng, Z., Shi, Y., Ji, K., Xu, L., Huang, S., Wang, J.: Human-object interaction via automatically designed VLM-guided motion policy. In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview. net/forum?id=LfkPlFTfe0

2026
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Diller, C., Dai, A.: Cg-hoi: Contact-guided 3d human-object interaction generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19888–19901 (2024)

2024
[8]

arXiv preprint arXiv:2309.11351 (2023)

Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C·ase: Learning condi- tional adversarial skill embeddings for physics-based characters. arXiv preprint arXiv:2309.11351 (2023)

arXiv 2023
[9]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[10]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 12943–12954 (2023)

2023
[11]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Gao, J., Wang, Z., Xiao, Z., Wang, J., Wang, T., Cao, J., Hu, X., Liu, S., Dai, J., Pang, J.: Coohoi: Learning cooperative human-object interaction with 16 S. Li et al. manipulated object dynamics. In: Advances in Neural Information Process- ing Systems. pp. 79741–79763 (2024).https://doi.org/10.52202/079017- 2532,https : / / proceedings . neurips . cc / ...

work page doi:10.52202/079017- 2024
[12]

Geng, Z., Hayder, Z., Liu, W., Mian, A.S.: Auto-regressive diffusion for generating 3dhuman-objectinteractions.In:ProceedingsoftheAAAIConferenceonArtificial Intelligence. vol. 39, pp. 3131–3139 (2025)

2025
[13]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

2021
[14]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambi- guities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)

2019
[15]

InACM SIGGRAPH 2023 Conference Proceedings

Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesiz- ing physical character-scene interactions. In: ACM SIGGRAPH 2023 Confer- ence Proceedings. SIGGRAPH ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3588432.3591525,https: //doi.org/10.1145/3588432.3591525

work page doi:10.1145/3588432.3591525 2023
[16]

arXiv preprint arXiv:2210.02303 (2022)

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

Pith/arXiv arXiv 2022
[17]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[18]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16750–16761 (2023)

2023
[19]

International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

Huang, Y., Taheri, O., Black, M.J., Tzionas, D.: InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images. International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

work page doi:10.1007/s11263-024-01984-1 2024
[20]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ji, B., Pan, Y., Liu, Z., Tan, S., Jin, X., Yang, X.: Pomp: Physics-consistent motion generative model through phase manifolds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22690–22701 (2025)

2025
[21]

In: Proceedings of European Conference on Computer Vision

Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: Proceedings of European Conference on Computer Vision. Springer (2022)

2022
[22]

In: ICCV (2023)

Jiang, N., Liu, T., Cao, Z., Cui, J., Chen, Y., Wang, H., Zhu, Y., Huang, S.: Full-body articulated human-object interaction. In: ICCV (2023)

2023
[23]

In: European Conference on Computer Vision

Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human- object interaction synthesis. In: European Conference on Computer Vision. pp. 54–72. Springer (2024)

2024
[24]

ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

Li, J., Wu, J., Liu, C.K.: Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

2023
[25]

arXiv preprint arXiv:2506.15483 (2025)

Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text- driven 4d human-object interaction synthesis for unseen objects. arXiv preprint arXiv:2506.15483 (2025)

arXiv 2025
[26]

arXiv (2025) Abbreviated paper title 17

Lin, Y., Xie, Y., Xie, J., Huang, Y., Wang, R., Lv, J., Ma, Y., Zuo, X.: Sim- genhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning. arXiv (2025) Abbreviated paper title 17

2025
[27]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 1769– 1782 (2025)

2025
[28]

In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023),https://doi.org/10.1145/3596711.3596800

work page doi:10.1145/3596711.3596800 2023
[29]

arXiv preprint arXiv:2503.20118 (2025)

Lou, Y., Wang, Y., Wu, Z., Zhao, R., Wang, W., Shi, M., Komura, T.: Zero- shot human-object interaction synthesis with multimodal priors. arXiv preprint arXiv:2503.20118 (2025)

arXiv 2025
[30]

ACM Transactions on Graphics45(2), 1–18 (2025)

Lu, J., Zhang, H., Ye, Y., Shiratori, T., Starke, S., Komura, T.: Choice: Coordi- nated human-object interaction in cluttered environments for pick-and-place ac- tions. ACM Transactions on Graphics45(2), 1–18 (2025)

2025
[31]

Advances in Neural Information Processing Systems37, 2161–2184 (2024)

Luo, Z., Cao, J., Christen, S., Winkler, A., Kitani, K., Xu, W.: Omnigrasp: Grasp- ing diverse objects with simulated humanoids. Advances in Neural Information Processing Systems37, 2161–2184 (2024)

2024
[32]

Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real- timesimulatedavatars.In:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision. pp. 10895–10904 (2023)

2023
[33]

In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview

Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Universal humanoid motion representations for physics-based control. In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview. net/forum?id=OrOd8PxOO2

2024
[34]

In: Advances in Neural Information Processing Systems (2021)

Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic pol- icy for egocentric pose estimation. In: Advances in Neural Information Processing Systems (2021)

2021
[35]

In: Advances in Neural Information Processing Systems (2022)

Luo, Z., Iwase, S., Yuan, Y., Kitani, K.: Embodied scene-aware human pose esti- mation. In: Advances in Neural Information Processing Systems (2022)

2022
[36]

ArXivabs/2206.09286(2022)

Luo, Z., Yuan, Y., Kitani, K.M.: From universal humanoid control to automatic physically valid character creation. ArXivabs/2206.09286(2022)

arXiv 2022
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Pan, L., Yang, Z., Dou, Z., Wang, W., Huang, B., Dai, B., Komura, T., Wang, J.: Tokenhsi: Unified synthesis of physical human-scene interactions through task tokenization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5379–5391 (2025)

2025
[38]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

2019
[39]

arXiv preprint arXiv:2312.06553 (2023)

Peng, X., Xie, Y., Wu, Z., Jampani, V., Sun, D., Jiang, H.: Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553 (2023)

arXiv 2023
[40]

arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

Peng, X.B.: Mimickit: A reinforcement learning framework for motion imitation and control. arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

arXiv 2025
[41]

ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

2018
[42]

Li et al

Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022) 18 S. Li et al

2022
[43]

ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

2021
[44]

arXiv preprint arXiv:2209.14988 (2022)

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

Pith/arXiv arXiv 2022
[45]

In: Proceedings of the IEEE/CVF international conference on computer vision

Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point clouds with basis point sets. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4332–4341 (2019)

2019
[46]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[48]

arXiv preprint arXiv:2201.02610 (2022)

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)

arXiv 2022
[49]

Advances in neural information processing systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

2022
[50]

In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

Tessler, C., Jiang, Y., Coumans, E., Luo, Z., Chechik, G., Peng, X.B.: Masked- manipulator: Versatile whole-body manipulation. In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

2025
[51]

In: ACM SIGGRAPH 2023 conference proceedings

Tessler, C., Kasten, Y., Guo, Y., Mannor, S., Chechik, G., Peng, X.B.: Calm: Conditional adversarial latent models for directable virtual characters. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–9 (2023)

2023
[52]

arXiv preprint arXiv:2410.03441 (2024)

Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441 (2024)

arXiv 2024
[53]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

2023
[54]

Npga: Neural parametric gaussian avatars

Truong, T.E., Piseno, M., Xie, Z., Liu, K.: Pdp: Physics-based character animation via diffusion policy. SA ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3680528.3687683,https://doi.org/ 10.1145/3680528.3687683

work page doi:10.1145/3680528.3687683 2024
[55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Zhao, Q., Yu, R., Tsui, H.W., Zeng, A., Lin, J., Luo, Z., Yu, J., Li, X., Chen, Q., et al.: Skillmimic: Learning basketball interaction skills from demon- strations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17540–17549 (2025)

2025
[56]

Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: Language- conditioned human motion generation in 3d scenes. Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

2022
[57]

arXiv preprint arXiv:2403.11208 (2024)

Wu, Q., Shi, Y., Huang, X., Yu, J., Xu, L., Wang, J.: Thor: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208 (2024)

arXiv 2024
[58]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

2025
[59]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

Wu, Z., Li, J., Xu, P., Liu, C.K.: Human-object interaction from human-level in- structions. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

2025
[60]

In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

2024
[61]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Xu, M., Shi, Y., Yin, K., Peng, X.B.: Parc: Physics-based augmentation with rein- forcement learning for character controllers. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025
[62]

ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

Xu, P., Xie, K., Andrews, S., Kry, P.G., Neff, M., McGuire, M., Karamouzas, I., Zordan, V.: AdaptNet: Policy adaptation for physics-based character control. ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

work page doi:10.1145/3618375 2023
[63]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, S., Li, D., Zhang, Y., Xu, X., Long, Q., Wang, Z., Lu, Y., Dong, S., Jiang, H., Gupta, A., etal.:Interact: Advancing large-scaleversatile 3dhuman-object interac- tion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7048–7060 (2025)

2025
[64]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3d human-object in- teractions with physics-informed diffusion. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 14928–14940 (2023)

2023
[65]

Yang, J., Niu, X., Jiang, N., Zhang, R., Huang, S.: F-hoi: Toward fine-grained semantic-aligned 3d human-object interactions (2024),https://arxiv.org/abs/ 2407.12435

arXiv 2024
[66]

In: European Conference on Computer Vision

Yi, H., Thies, J., Black, M.J., Peng, X.B., Rempe, D.: Generating human interac- tion motions in scenes with text control. In: European Conference on Computer Vision. pp. 246–263. Springer (2024)

2024
[67]

arXiv preprint arXiv:2310.085292(3), 5 (2023)

Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaus- siandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.085292(3), 5 (2023)

arXiv 2023
[68]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Yu,R.,Wang,Y.,Zhao,Q.,Tsui,H.W.,Wang,J.,Tan,P.,Chen,Q.:Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demon- strations. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025
[69]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided hu- man motion diffusion model. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 16010–16021 (2023)

2023
[70]

arXiv preprint arXiv:2506.12769 (2025)

Yue, J., Wang, Z., Wang, Y., Zeng, W., Wang, J., Xu, X., Zhang, Y., Zheng, S., Ding, Z., Lu, Z.: Rl from physical feedback: Aligning large motion models with humanoid control. arXiv preprint arXiv:2506.12769 (2025)

arXiv 2025
[71]

arXiv preprint arXiv:2503.13130 (2025)

Zeng, L.A., Huang, G., Wei, Y.L., Gu, S., Tang, Y.M., Meng, J., Zheng, W.S.: Chainhoi: Joint-based kinematic chain modeling for human-object interaction gen- eration. arXiv preprint arXiv:2503.13130 (2025)

arXiv 2025
[72]

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view videodiffusionmodelfor4dgeneration.AdvancesinNeuralInformationProcessing Systems37, 15272–15295 (2024)

2024
[73]

A person approaches a chair/box/table, picks it up, and places it in the designated location

Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: Towards controllable human-chair interactions. In: European Conference on Computer Vi- sion. pp. 518–535. Springer (2022) 20 S. Li et al. Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics Supplementary A Expert Policy Training This section details the...

2022

[1] [1]

In: Computer Graphics Forum

Aristidou, A., Lasenby, J., Chrysanthou, Y., Shamir, A.: Inverse kinematics tech- niques in computer graphics: A survey. In: Computer Graphics Forum. vol. 37(6), pp. 35–58 (2018)

2018

[2] [2]

In: SIGGRAPH Asia 2024 Conference Papers

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

[3] [3]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 15935–15946 (2022)

2022

[4] [4]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023

[5] [5]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cong, P., Wang, Z., Ma, Y., Yue, X.: Semgeomo: Dynamic contextual human motion generation with semantic and geometric guidance. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17561–17570 (2025)

2025

[6] [6]

In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview

Deng, Z., Shi, Y., Ji, K., Xu, L., Huang, S., Wang, J.: Human-object interaction via automatically designed VLM-guided motion policy. In: The Fourteenth Inter- national Conference on Learning Representations (2026),https://openreview. net/forum?id=LfkPlFTfe0

2026

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Diller, C., Dai, A.: Cg-hoi: Contact-guided 3d human-object interaction generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19888–19901 (2024)

2024

[8] [8]

arXiv preprint arXiv:2309.11351 (2023)

Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C·ase: Learning condi- tional adversarial skill embeddings for physics-based characters. arXiv preprint arXiv:2309.11351 (2023)

arXiv 2023

[9] [9]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024

[10] [10]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 12943–12954 (2023)

2023

[11] [11]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Gao, J., Wang, Z., Xiao, Z., Wang, J., Wang, T., Cao, J., Hu, X., Liu, S., Dai, J., Pang, J.: Coohoi: Learning cooperative human-object interaction with 16 S. Li et al. manipulated object dynamics. In: Advances in Neural Information Process- ing Systems. pp. 79741–79763 (2024).https://doi.org/10.52202/079017- 2532,https : / / proceedings . neurips . cc / ...

work page doi:10.52202/079017- 2024

[12] [12]

Geng, Z., Hayder, Z., Liu, W., Mian, A.S.: Auto-regressive diffusion for generating 3dhuman-objectinteractions.In:ProceedingsoftheAAAIConferenceonArtificial Intelligence. vol. 39, pp. 3131–3139 (2025)

2025

[13] [13]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

2021

[14] [14]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambi- guities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)

2019

[15] [15]

InACM SIGGRAPH 2023 Conference Proceedings

Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesiz- ing physical character-scene interactions. In: ACM SIGGRAPH 2023 Confer- ence Proceedings. SIGGRAPH ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3588432.3591525,https: //doi.org/10.1145/3588432.3591525

work page doi:10.1145/3588432.3591525 2023

[16] [16]

arXiv preprint arXiv:2210.02303 (2022)

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

Pith/arXiv arXiv 2022

[17] [17]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[18] [18]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, S., Wang, Z., Li, P., Jia, B., Liu, T., Zhu, Y., Liang, W., Zhu, S.C.: Diffusion-based generation, optimization, and planning in 3d scenes. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16750–16761 (2023)

2023

[19] [19]

International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

Huang, Y., Taheri, O., Black, M.J., Tzionas, D.: InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images. International Journal of Computer Vision132(7), 2551–2566 (Jul 2024).https: //doi.org/10.1007/s11263-024-01984-1,https://doi.org/10.1007/s11263- 024-01984-1

work page doi:10.1007/s11263-024-01984-1 2024

[20] [20]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ji, B., Pan, Y., Liu, Z., Tan, S., Jin, X., Yang, X.: Pomp: Physics-consistent motion generative model through phase manifolds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22690–22701 (2025)

2025

[21] [21]

In: Proceedings of European Conference on Computer Vision

Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape, P., Holz, C.: Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In: Proceedings of European Conference on Computer Vision. Springer (2022)

2022

[22] [22]

In: ICCV (2023)

Jiang, N., Liu, T., Cao, Z., Cui, J., Chen, Y., Wang, H., Zhu, Y., Huang, S.: Full-body articulated human-object interaction. In: ICCV (2023)

2023

[23] [23]

In: European Conference on Computer Vision

Li, J., Clegg, A., Mottaghi, R., Wu, J., Puig, X., Liu, C.K.: Controllable human- object interaction synthesis. In: European Conference on Computer Vision. pp. 54–72. Springer (2024)

2024

[24] [24]

ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

Li, J., Wu, J., Liu, C.K.: Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG)42(6), 1–11 (2023)

2023

[25] [25]

arXiv preprint arXiv:2506.15483 (2025)

Li, S., Zhang, H., Chen, X., Wang, Y., Ban, Y.: Genhoi: Generalizing text- driven 4d human-object interaction synthesis for unseen objects. arXiv preprint arXiv:2506.15483 (2025)

arXiv 2025

[26] [26]

arXiv (2025) Abbreviated paper title 17

Lin, Y., Xie, Y., Xie, J., Huang, Y., Wang, R., Lv, J., Ma, Y., Zuo, X.: Sim- genhoi: Physically realistic whole-body humanoid-object interaction via generative modeling and reinforcement learning. arXiv (2025) Abbreviated paper title 17

2025

[27] [27]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 1769– 1782 (2025)

2025

[28] [28]

In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2. Association for Computing Machinery, New York, NY, USA, 1 edn. (2023),https://doi.org/10.1145/3596711.3596800

work page doi:10.1145/3596711.3596800 2023

[29] [29]

arXiv preprint arXiv:2503.20118 (2025)

Lou, Y., Wang, Y., Wu, Z., Zhao, R., Wang, W., Shi, M., Komura, T.: Zero- shot human-object interaction synthesis with multimodal priors. arXiv preprint arXiv:2503.20118 (2025)

arXiv 2025

[30] [30]

ACM Transactions on Graphics45(2), 1–18 (2025)

Lu, J., Zhang, H., Ye, Y., Shiratori, T., Starke, S., Komura, T.: Choice: Coordi- nated human-object interaction in cluttered environments for pick-and-place ac- tions. ACM Transactions on Graphics45(2), 1–18 (2025)

2025

[31] [31]

Advances in Neural Information Processing Systems37, 2161–2184 (2024)

Luo, Z., Cao, J., Christen, S., Winkler, A., Kitani, K., Xu, W.: Omnigrasp: Grasp- ing diverse objects with simulated humanoids. Advances in Neural Information Processing Systems37, 2161–2184 (2024)

2024

[32] [32]

Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real- timesimulatedavatars.In:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision. pp. 10895–10904 (2023)

2023

[33] [33]

In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview

Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Universal humanoid motion representations for physics-based control. In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview. net/forum?id=OrOd8PxOO2

2024

[34] [34]

In: Advances in Neural Information Processing Systems (2021)

Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic pol- icy for egocentric pose estimation. In: Advances in Neural Information Processing Systems (2021)

2021

[35] [35]

In: Advances in Neural Information Processing Systems (2022)

Luo, Z., Iwase, S., Yuan, Y., Kitani, K.: Embodied scene-aware human pose esti- mation. In: Advances in Neural Information Processing Systems (2022)

2022

[36] [36]

ArXivabs/2206.09286(2022)

Luo, Z., Yuan, Y., Kitani, K.M.: From universal humanoid control to automatic physically valid character creation. ArXivabs/2206.09286(2022)

arXiv 2022

[37] [37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Pan, L., Yang, Z., Dou, Z., Wang, W., Huang, B., Dai, B., Komura, T., Wang, J.: Tokenhsi: Unified synthesis of physical human-scene interactions through task tokenization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5379–5391 (2025)

2025

[38] [38]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

2019

[39] [39]

arXiv preprint arXiv:2312.06553 (2023)

Peng, X., Xie, Y., Wu, Z., Jampani, V., Sun, D., Jiang, H.: Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553 (2023)

arXiv 2023

[40] [40]

arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

Peng, X.B.: Mimickit: A reinforcement learning framework for motion imitation and control. arXiv preprint arXiv:2510.13794 (2025),https://arxiv.org/abs/ 2510.13794

arXiv 2025

[41] [41]

ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

2018

[42] [42]

Li et al

Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022) 18 S. Li et al

2022

[43] [43]

ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

2021

[44] [44]

arXiv preprint arXiv:2209.14988 (2022)

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

Pith/arXiv arXiv 2022

[45] [45]

In: Proceedings of the IEEE/CVF international conference on computer vision

Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point clouds with basis point sets. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4332–4341 (2019)

2019

[46] [46]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[47] [47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022

[48] [48]

arXiv preprint arXiv:2201.02610 (2022)

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)

arXiv 2022

[49] [49]

Advances in neural information processing systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

2022

[50] [50]

In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

Tessler, C., Jiang, Y., Coumans, E., Luo, Z., Chechik, G., Peng, X.B.: Masked- manipulator: Versatile whole-body manipulation. In: ACM SIGGRAPH Asia 2025 Conference Proceedings (2025)

2025

[51] [51]

In: ACM SIGGRAPH 2023 conference proceedings

Tessler, C., Kasten, Y., Guo, Y., Mannor, S., Chechik, G., Peng, X.B.: Calm: Conditional adversarial latent models for directable virtual characters. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–9 (2023)

2023

[52] [52]

arXiv preprint arXiv:2410.03441 (2024)

Tevet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., van de Panne, M.: Closd: Closing the loop between simulation and diffusion for multi-task character control. arXiv preprint arXiv:2410.03441 (2024)

arXiv 2024

[53] [53]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

2023

[54] [54]

Npga: Neural parametric gaussian avatars

Truong, T.E., Piseno, M., Xie, Z., Liu, K.: Pdp: Physics-based character animation via diffusion policy. SA ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3680528.3687683,https://doi.org/ 10.1145/3680528.3687683

work page doi:10.1145/3680528.3687683 2024

[55] [55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Zhao, Q., Yu, R., Tsui, H.W., Zeng, A., Lin, J., Luo, Z., Yu, J., Li, X., Chen, Q., et al.: Skillmimic: Learning basketball interaction skills from demon- strations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17540–17549 (2025)

2025

[56] [56]

Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: Language- conditioned human motion generation in 3d scenes. Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

2022

[57] [57]

arXiv preprint arXiv:2403.11208 (2024)

Wu, Q., Shi, Y., Huang, X., Yu, J., Xu, L., Wang, J.: Thor: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208 (2024)

arXiv 2024

[58] [58]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025) Abbreviated paper title 19

2025

[59] [59]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

Wu, Z., Li, J., Xu, P., Liu, C.K.: Human-object interaction from human-level in- structions. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) (October 2025)

2025

[60] [60]

In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. In: The Twelfth Interna- tional Conference on Learning Representations (2024),https://openreview.net/ forum?id=1vCnDyQkjg

2024

[61] [61]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Xu, M., Shi, Y., Yin, K., Peng, X.B.: Parc: Physics-based augmentation with rein- forcement learning for character controllers. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025

[62] [62]

ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

Xu, P., Xie, K., Andrews, S., Kry, P.G., Neff, M., McGuire, M., Karamouzas, I., Zordan, V.: AdaptNet: Policy adaptation for physics-based character control. ACM Transactions on Graphics42(6) (2023).https://doi.org/10.1145/3618375

work page doi:10.1145/3618375 2023

[63] [63]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, S., Li, D., Zhang, Y., Xu, X., Long, Q., Wang, Z., Lu, Y., Dong, S., Jiang, H., Gupta, A., etal.:Interact: Advancing large-scaleversatile 3dhuman-object interac- tion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7048–7060 (2025)

2025

[64] [64]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3d human-object in- teractions with physics-informed diffusion. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 14928–14940 (2023)

2023

[65] [65]

Yang, J., Niu, X., Jiang, N., Zhang, R., Huang, S.: F-hoi: Toward fine-grained semantic-aligned 3d human-object interactions (2024),https://arxiv.org/abs/ 2407.12435

arXiv 2024

[66] [66]

In: European Conference on Computer Vision

Yi, H., Thies, J., Black, M.J., Peng, X.B., Rempe, D.: Generating human interac- tion motions in scenes with text control. In: European Conference on Computer Vision. pp. 246–263. Springer (2024)

2024

[67] [67]

arXiv preprint arXiv:2310.085292(3), 5 (2023)

Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaus- siandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.085292(3), 5 (2023)

arXiv 2023

[68] [68]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Yu,R.,Wang,Y.,Zhao,Q.,Tsui,H.W.,Wang,J.,Tan,P.,Chen,Q.:Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demon- strations. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025

[69] [69]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: Physics-guided hu- man motion diffusion model. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 16010–16021 (2023)

2023

[70] [70]

arXiv preprint arXiv:2506.12769 (2025)

Yue, J., Wang, Z., Wang, Y., Zeng, W., Wang, J., Xu, X., Zhang, Y., Zheng, S., Ding, Z., Lu, Z.: Rl from physical feedback: Aligning large motion models with humanoid control. arXiv preprint arXiv:2506.12769 (2025)

arXiv 2025

[71] [71]

arXiv preprint arXiv:2503.13130 (2025)

Zeng, L.A., Huang, G., Wei, Y.L., Gu, S., Tang, Y.M., Meng, J., Zheng, W.S.: Chainhoi: Joint-based kinematic chain modeling for human-object interaction gen- eration. arXiv preprint arXiv:2503.13130 (2025)

arXiv 2025

[72] [72]

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view videodiffusionmodelfor4dgeneration.AdvancesinNeuralInformationProcessing Systems37, 15272–15295 (2024)

2024

[73] [73]

A person approaches a chair/box/table, picks it up, and places it in the designated location

Zhang, X., Bhatnagar, B.L., Starke, S., Guzov, V., Pons-Moll, G.: Couch: Towards controllable human-chair interactions. In: European Conference on Computer Vi- sion. pp. 518–535. Springer (2022) 20 S. Li et al. Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics Supplementary A Expert Policy Training This section details the...

2022