ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments

Hongwei Zhao; Lu Pan

arxiv: 2607.02034 · v1 · pith:O2F6OJXFnew · submitted 2026-07-02 · 💻 cs.CV

ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments

Lu Pan , Hongwei Zhao This is my paper

Pith reviewed 2026-07-03 16:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-scene interactionphysics-based imitationmotion capture datacomplex 3D environmentsdual expert strategydifficulty-aware distillationcollision avoidanceembodied intelligence

0 comments

The pith

A dual-expert framework with difficulty-aware distillation reconstructs diverse human-scene interactions from imperfect motion data in complex 3D environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that physics-based imitation of human interactions with scenes can succeed in cluttered, realistic 3D spaces by recovering useful signal from noisy motion-capture recordings. It notes an inherent tension between following captured motion closely and producing collision-free, physically stable behavior. The solution trains one expert focused on motion fidelity and another focused on scene adaptation, then combines them through a distillation process that gives more weight to difficult trajectories according to observed failures and learning progress. A sympathetic reader would care because prior methods have been limited to simplified rooms or single objects, so extending reliable imitation to everyday environments would make embodied agents far more practical. If the claim holds, the approach would allow training on existing imperfect datasets without per-scene redesign.

Core claim

The authors claim that ComplexMimic reconstructs diverse human-scene interactions in complex environments by interpreting imperfect MoCap data through a Dual Flow Strategy that maintains an imitation expert for accurate motion tracking alongside an interaction expert for collision-aware adaptation, followed by difficulty-aware distillation that adaptively weights supervision toward hard-yet-learnable trajectories using failure statistics and learning progress signals; this combination outperforms prior state-of-the-art methods across three benchmark datasets.

What carries the argument

Dual Flow Strategy consisting of an imitation expert and an interaction expert, combined via difficulty-aware distillation that prioritizes challenging behaviors.

If this is right

Imitation learning becomes feasible in cluttered rather than simplified scenes without extra scene-specific engineering.
Training can leverage existing imperfect MoCap datasets more effectively than uniform multi-expert distillation.
The resulting policies produce both higher motion accuracy and better physical plausibility on standard benchmarks.
Embodied agents gain access to a wider range of natural interaction behaviors in realistic 3D settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of tracking and adaptation experts might transfer to other imitation domains that face noisy demonstration data, such as robot manipulation.
If the distillation weighting proves stable, it could reduce the need for manual curriculum design when scaling to larger scene collections.
Success here suggests that explicit difficulty signals from failure counts could be tested as a general regularizer in multi-task physics simulation.
Deployment in real robots would still require checking whether the learned collision avoidance survives sim-to-real gaps not addressed in the benchmarks.

Load-bearing premise

Imperfect motion-capture recordings contain enough recoverable information to support both precise motion tracking and collision-aware physical adaptation at the same time in complex scenes.

What would settle it

Running the method on a held-out collection of complex scenes with deliberately degraded MoCap data and finding that it produces either unnatural motions or frequent collisions while matching or falling below baseline performance.

Figures

Figures reproduced from arXiv: 2607.02034 by Hongwei Zhao, Lu Pan.

**Figure 1.** Figure 1: (a) Prior humanoid imitation-learning studies [20] typically focus on scenefree or overly simplified environments (upper row), whereas our work targets realistic HSI imitation in complex 3D scenes (lower row). (b) Feasibility–faithfulness trade-off in HSI imitation. Varying the early termination threshold τ during training reveals a clear trade-off: larger τ improves task feasibility (higher success rate)… view at source ↗

**Figure 2.** Figure 2: Overview of ComplexMimic, a two-stage framework for human–scene interaction imitation in complex environments. In Stage I, an imitation expert and an interaction expert are trained to achieve motion-faithful tracking and collision-aware adaptation, respectively. In Stage II, these experts are frozen and distilled into a unified policy via Difficulty-Aware Distillation (DAD). Black arrows denote policy upd… view at source ↗

**Figure 3.** Figure 3: Examples of return improvement under different motion cases. The first reference motion conflicts heavily with the scene meshes, resulting in limited return gains. In contrast, the second motion achieves better tracking performance and larger return gains, demonstrating the effectiveness of our learning progress signal. A simple and effective choice is to set sk(m) proportional to the motion difficulty, s… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons on the TRUMANS [16] dataset. Red boxes highlight typical failure modes under complicated scenes, such as undesired collisions, unstable contacts, and noticeable deviations from the reference. More visual results can be found in the supplementary material. 4.4 Ablation Study Effectiveness of Dual Flow Strategy. As shown in Tab. 4, removing either teacher degrades the performance sign… view at source ↗

**Figure 5.** Figure 5: Qualitative ablation comparisons on TRUMANS [16]. showing that (i) focusing updates on harder motions within each regime and (ii) balancing training across regimes are both beneficial. The learnability filtering is particularly important: removing it significantly degrades both Succ (from 0.906 to 0.877) and Eg-mpjpe (from 77.840 to 88.497), suggesting that suppressing hard but unlearnable clips is necessa… view at source ↗

**Figure 6.** Figure 6: Visualization of our method in Isaac Lab [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization under different body scales. J Discussion Limitations and Future Work. Although our method demonstrates strong performance on human–scene interaction imitation under complex 3D environments, it still has several limitations. First, our method cannot handle cases where the reference motion severely penetrates the scene meshes, as illustrated [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of failure cases. The reference motion is severely penetrated with the 3D environments. in [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Physics-based Human-Scene Interaction (HSI) imitation learning is crucial for embodied intelligence as it bridges the gap between kinematic 3D motions and real-world dynamics. However, most existing methods focus on simplified scene settings, leaving complex environments largely unexplored, which limits their applicability in real-world scenarios. In this paper, we focus on HSI mimicry in complex environments. Under this complex setting, we observe an inherent trade-off between successfully performing interaction and maintaining natural, physically plausible motions. To address this challenge, we propose ComplexMimic, a framework that reconstructs diverse HSI by interpreting imperfect MoCap data. First, we introduce a Dual Flow Strategy, which learns two complementary experts: an imitation expert for accurate motion tracking and an interaction expert for collision-aware adaptation in complex scenes. Second, naive multi-expert distillation, which treats all experts equally, often under-samples challenging behaviors, limiting effective learning. To mitigate this issue, we propose a difficulty-aware distillation strategy that adaptively weights supervision and prioritizes hard-yet-learnable trajectories guided by failure statistics and learning progress signals. Extensive experiments on three benchmark datasets demonstrate that our approach outperforms current state-of-the-art methods. Our implementation is available at https://github.com/LuPan23/ComplexMimic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComplexMimic targets the gap in HSI imitation for complex scenes with a dual-expert setup and difficulty-aware distillation, but the abstract supplies no metrics or ablations to check the outperformance claim.

read the letter

The main point is that this paper takes on human-scene interaction imitation in messy 3D environments instead of the usual simplified ones, using a Dual Flow Strategy with separate imitation and interaction experts plus a weighting scheme that focuses distillation on harder trajectories based on failure stats and progress.

What looks new is the explicit split into those two experts to handle the tracking-versus-collision trade-off, then the adaptive weighting to avoid under-sampling tough cases during multi-expert distillation. The abstract frames this as a direct response to a recognized limitation in prior work, and the GitHub release helps with checking the implementation.

The paper does a reasonable job naming the practical problem and offering a straightforward engineering fix without inventing new theory. The difficulty-aware part builds on existing curriculum ideas but applies them specifically to this HSI setting.

The soft spot is the evidence. The abstract states outperformance on three benchmarks yet gives no numbers, no ablation tables, no dataset descriptions, and no error bars. Without those details it is impossible to judge whether the gains come from the proposed components or from dataset-specific tuning. The core assumption—that imperfect MoCap data carries enough signal for both experts to succeed simultaneously in complex scenes—also stays untested in the provided text and would need failure analysis or real-world checks to hold up.

This is for people working on physics-based animation and robotics simulation who care about moving beyond toy scenes. A reader in that area could extract the dual-flow and weighting ideas even if the results require closer inspection.

I would send it to peer review because the setting is relevant and the method is described clearly enough to evaluate, though the authors should expect detailed questions on the experiments and validation.

Referee Report

1 major / 0 minor

Summary. The paper proposes ComplexMimic, a framework for physics-based human-scene interaction (HSI) imitation learning in complex 3D environments. It introduces a Dual Flow Strategy consisting of an imitation expert for motion tracking and an interaction expert for collision-aware adaptation, together with a difficulty-aware distillation method that weights supervision based on failure statistics and learning progress. The central claim is that this approach reconstructs diverse HSI from imperfect MoCap data and outperforms current state-of-the-art methods on three benchmark datasets.

Significance. If the empirical claims of outperformance hold with rigorous quantitative support, the work would address an important gap in handling complex scenes for embodied HSI, moving beyond simplified settings. The dual-expert and adaptive-distillation ideas are conceptually plausible for managing the noted trade-off between interaction success and motion plausibility. However, the provided text supplies no metrics, ablations, or dataset details, so significance cannot be assessed.

major comments (1)

[Abstract] Abstract: The assertion that the method 'outperforms current state-of-the-art methods' on three benchmark datasets is unsupported by any quantitative metrics, ablation results, error bars, dataset descriptions, or implementation details. This absence makes the central empirical claim unverifiable from the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify our manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the method 'outperforms current state-of-the-art methods' on three benchmark datasets is unsupported by any quantitative metrics, ablation results, error bars, dataset descriptions, or implementation details. This absence makes the central empirical claim unverifiable from the manuscript.

Authors: We agree that the abstract would benefit from explicit quantitative support to make the central claim immediately verifiable. The full manuscript contains these elements in the Experiments section (including performance tables on three benchmarks, ablation studies, error analysis, dataset details, and implementation information). We will revise the abstract to incorporate key quantitative highlights from those results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description present an empirical framework (Dual Flow Strategy with imitation and interaction experts, plus difficulty-aware distillation) whose central claims rest on outperformance across three external benchmark datasets. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are referenced that would allow any result to reduce to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5753 in / 1034 out tokens · 25631 ms · 2026-07-03T16:04:47.737495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 16 canonical work pages · 9 internal anchors

[1]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Collorone, L., Gioia, M., Pappa, M., Leoni, P., Ficarra, G., Litany, O., Spinelli, I., Galasso, F.: Monster: a unified model for motion, scene, text retrieval. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 10940– 10949 (2025)

2025
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5152–5161 (2022)

2022
[3]

In: European Conference on Computer Vision

Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized model- ing for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision. pp. 580–597. Springer (2022)

2022
[4]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

2021
[5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambi- guities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)

2019
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3d scenes by learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14708–14718 (2021)

2021
[7]

In: ACM SIGGRAPH 2023 Conference Pro- ceedings

Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesizing physical character-scene interactions. In: ACM SIGGRAPH 2023 Conference Pro- ceedings. pp. 1–9 (2023)

2023
[8]

Available: https://arxiv.org/abs/2502.01143

He, T., Gao, J., Xiao, W., Zhang, Y., Wang, Z., Wang, J., Luo, Z., He, G., Soban- bab, N., Pan, C., et al.: Asap: Aligning simulation and real-world physics for learn- ing agile humanoid whole-body skills. arXiv preprint arXiv:2502.01143 (2025)

work page arXiv 2025
[9]

Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning,

He,T.,Luo,Z.,He,X.,Xiao,W.,Zhang,C.,Zhang,W.,Kitani,K.,Liu,C.,Shi,G.: Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 (2024)

work page arXiv 2024
[10]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 16 L. Pan and H. Zhao

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13274–13285 (2022)

2022
[13]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Hwang,I.,Zhou,B.,Kim,Y.M.,Wang,J.,Guo,C.:Scenemi:Motionin-betweening for modeling human-scene interaction. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 6034–6045 (2025)

2025
[14]

Neural computation3(1), 79–87 (1991)

Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation3(1), 79–87 (1991)

1991
[15]

In: SIGGRAPH Asia 2024 Conference Papers

Jiang, N., He, Z., Wang, Z., Li, H., Chen, Y., Huang, S., Zhu, Y.: Autonomous character-scene interaction synthesis from text instruction. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1737– 1747 (2024)

2024
[17]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Liao, Q., Truong, T.E., Huang, X., Gao, Y., Tevet, G., Sreenath, K., Liu, C.K.: Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion. arXiv preprint arXiv:2508.08241 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

In: European Conference on Computer Vision

Liu, X., Hou, H., Yang, Y., Li, Y.L., Lu, C.: Revisit human-scene interaction via space occupancy. In: European Conference on Computer Vision. pp. 1–19. Springer (2024)

2024
[20]

arXiv preprint arXiv:2412.17730 (2024)

Liu, Y., Yang, B., Zhong, L., Wang, H., Yi, L.: Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730 (2024)

work page arXiv 2024
[21]

Liu, Z., Ge, J., Xiong, M., Gu, J., Tang, B., Jing, W., Chen, S.: It takes two: Learning interactive whole-body control between humanoid robots (2025)

2025
[22]

Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real- timesimulatedavatars.In:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision. pp. 10895–10904 (2023)

2023
[23]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al.: Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

In: 2024 International Conference on 3D Vision (3DV)

Pan, L., Wang, J., Huang, B., Zhang, J., Wang, H., Tang, X., Wang, Y.: Syn- thesizing physically plausible human motions in 3d scenes. In: 2024 International Conference on 3D Vision (3DV). pp. 1498–1507. IEEE (2024)

2024
[25]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Pan, L., Yang, Z., Dou, Z., Wang, W., Huang, B., Dai, B., Komura, T., Wang, J.: Tokenhsi: Unified synthesis of physical human-scene interactions through task tokenization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5379–5391 (2025)

2025
[26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019) ComplexMimic 17

2019
[27]

ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

2018
[28]

Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022)

2022
[29]

ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

2021
[30]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

arXiv preprint arXiv:2508.07842 (2025)

Shen, Y., Liu, H., Zhang, L., Liu, P., Xia, R., Yao, T., Feng, T.: Detach: Cross- domain learning for long-horizon tasks via mixture of disentangled experts. arXiv preprint arXiv:2508.07842 (2025)

work page arXiv 2025
[33]

In: nternational conference on machine learning

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: nternational conference on machine learning. pp. 2256–2265 (2015)

2015
[34]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[36]

ACM Trans- actions On Graphics (TOG)43(6), 1–21 (2024)

Tessler, C., Guo, Y., Nabati, O., Chechik, G., Peng, X.B.: Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Trans- actions On Graphics (TOG)43(6), 1–21 (2024)

2024
[37]

Human Motion Diffusion Model

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Wang, W., Pan, L., Dou, Z., Mei, J., Liao, Z., Lou, Y., Wu, Y., Yang, L., Wang, J., Komura, T.: Sims: Simulating stylized human-scene interactions with retrieval- augmented script generation. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14117–14127 (2025)

2025
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Zhao, Q., Yu, R., Tsui, H.W., Zeng, A., Lin, J., Luo, Z., Yu, J., Li, X., Chen, Q., et al.: Skillmimic: Learning basketball interaction skills from demon- strations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17540–17549 (2025)

2025
[40]

arXiv preprint arXiv:2506.12779 (2025)

Wang, Y., Yang, M., Ding, Z., Zhang, Y., Zeng, W., Xu, X., Jiang, H., Lu, Z.: From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 (2025)

work page arXiv 2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,Z.,Chen,Y.,Jia,B.,Li,P.,Zhang,J.,Zhang,J.,Liu,T.,Zhu,Y.,Liang,W., Huang, S.: Move as you say interact as you can: Language-guided human motion generation with scene affordance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 433–444 (2024)

2024
[42]

Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: Language- conditioned human motion generation in 3d scenes. Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

2022
[43]

arXiv preprint arXiv:2309.07918 (2023) 18 L

Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918 (2023) 18 L. Pan and H. Zhao

work page arXiv 2023
[44]

ACM Transactions on Graphics (TOG)44(6), 1–14 (2025)

Xu, P., Wu, Z., Wang, R., Sarukkai, V., Fatahalian, K., Karamouzas, I., Zordan, V., Liu, C.K.: Learning to ball: Composing policies for long-horizon basketball moves. ACM Transactions on Graphics (TOG)44(6), 1–14 (2025)

2025
[45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, S., Ling, H.Y., Wang, Y.X., Gui, L.Y.: Intermimic: Towards universal whole- body control for physics-based human-object interactions. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12266–12277 (2025)

2025
[46]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Yu,R.,Wang,Y.,Zhao,Q.,Tsui,H.W.,Wang,J.,Tan,P.,Chen,Q.:Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demon- strations. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025
[47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, J., Fan, H., Yang, Y.: Energymogen: Compositional human motion gen- eration with energy-based diffusion model in latent space. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17592–17602 (2025)

2025
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Fan, H., Yang, Y.: Towards decompositional human motion generation with energy-based diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 30650–30660 (2026)

2026
[49]

In: Proceedings of theIEEE/CVF conference on computer vision and pattern recognition

Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representa- tions. In: Proceedings of theIEEE/CVF conference on computer vision and pattern recognition. pp. 14730–14740 (2023)

2023
[50]

IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

2024
[51]

In: European Conference on Computer Vision (ECCV) (2022)

Zhao, K., Wang, S., Zhang, Y., Beeler, T., Tang, S.: Compositional human-scene interaction synthesis with semantic control. In: European Conference on Computer Vision (ECCV) (2022)

2022
[52]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Zhao, K., Zhang, Y., Wang, S., Beeler, T., Tang, S.: Synthesizing diverse human motions in 3d indoor scenes. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 14738–14749 (2023)

2023
[53]

w/o” indicates removal; “–

Zheng, Y., Yang, Y., Mo, K., Li, J., Yu, T., Liu, Y., Liu, C.K., Guibas, L.J.: Gimo: Gaze-informed human motion prediction in context. In: European Conference on Computer Vision. pp. 676–694. Springer (2022) ComplexMimic 19 Supplementary Material In this supplementary material, we present: –Sec. A: Additional ablation results on LINGO [15] and GIMO [53]. ...

work page arXiv 2022

[1] [1]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Collorone, L., Gioia, M., Pappa, M., Leoni, P., Ficarra, G., Litany, O., Spinelli, I., Galasso, F.: Monster: a unified model for motion, scene, text retrieval. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 10940– 10949 (2025)

2025

[2] [2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5152–5161 (2022)

2022

[3] [3]

In: European Conference on Computer Vision

Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized model- ing for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision. pp. 580–597. Springer (2022)

2022

[4] [4]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M.J.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 11374–11384 (2021)

2021

[5] [5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambi- guities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)

2019

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3d scenes by learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14708–14718 (2021)

2021

[7] [7]

In: ACM SIGGRAPH 2023 Conference Pro- ceedings

Hassan, M., Guo, Y., Wang, T., Black, M., Fidler, S., Peng, X.B.: Synthesizing physical character-scene interactions. In: ACM SIGGRAPH 2023 Conference Pro- ceedings. pp. 1–9 (2023)

2023

[8] [8]

Available: https://arxiv.org/abs/2502.01143

He, T., Gao, J., Xiao, W., Zhang, Y., Wang, Z., Wang, J., Luo, Z., He, G., Soban- bab, N., Pan, C., et al.: Asap: Aligning simulation and real-world physics for learn- ing agile humanoid whole-body skills. arXiv preprint arXiv:2502.01143 (2025)

work page arXiv 2025

[9] [9]

Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning,

He,T.,Luo,Z.,He,X.,Xiao,W.,Zhang,C.,Zhang,W.,Kitani,K.,Liu,C.,Shi,G.: Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858 (2024)

work page arXiv 2024

[10] [10]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 16 L. Pan and H. Zhao

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[12] [12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13274–13285 (2022)

2022

[13] [13]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Hwang,I.,Zhou,B.,Kim,Y.M.,Wang,J.,Guo,C.:Scenemi:Motionin-betweening for modeling human-scene interaction. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 6034–6045 (2025)

2025

[14] [14]

Neural computation3(1), 79–87 (1991)

Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation3(1), 79–87 (1991)

1991

[15] [15]

In: SIGGRAPH Asia 2024 Conference Papers

Jiang, N., He, Z., Wang, Z., Li, H., Chen, Y., Huang, S., Zhu, Y.: Autonomous character-scene interaction synthesis from text instruction. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1737– 1747 (2024)

2024

[17] [17]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Liao, Q., Truong, T.E., Huang, X., Gao, Y., Tevet, G., Sreenath, K., Liu, C.K.: Beyondmimic: From motion tracking to versatile humanoid control via guided dif- fusion. arXiv preprint arXiv:2508.08241 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

In: European Conference on Computer Vision

Liu, X., Hou, H., Yang, Y., Li, Y.L., Lu, C.: Revisit human-scene interaction via space occupancy. In: European Conference on Computer Vision. pp. 1–19. Springer (2024)

2024

[20] [20]

arXiv preprint arXiv:2412.17730 (2024)

Liu, Y., Yang, B., Zhong, L., Wang, H., Yi, L.: Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730 (2024)

work page arXiv 2024

[21] [21]

Liu, Z., Ge, J., Xiong, M., Gu, J., Tang, B., Jing, W., Chen, S.: It takes two: Learning interactive whole-body control between humanoid robots (2025)

2025

[22] [22]

Luo, Z., Cao, J., Kitani, K., Xu, W., et al.: Perpetual humanoid control for real- timesimulatedavatars.In:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision. pp. 10895–10904 (2023)

2023

[23] [23]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al.: Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

In: 2024 International Conference on 3D Vision (3DV)

Pan, L., Wang, J., Huang, B., Zhang, J., Wang, H., Tang, X., Wang, Y.: Syn- thesizing physically plausible human motions in 3d scenes. In: 2024 International Conference on 3D Vision (3DV). pp. 1498–1507. IEEE (2024)

2024

[25] [25]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Pan, L., Yang, Z., Dou, Z., Wang, W., Huang, B., Dai, B., Komura, T., Wang, J.: Tokenhsi: Unified synthesis of physical human-scene interactions through task tokenization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5379–5391 (2025)

2025

[26] [26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019) ComplexMimic 17

2019

[27] [27]

ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

Peng, X.B., Abbeel, P., Levine, S., Van de Panne, M.: Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37(4), 1–14 (2018)

2018

[28] [28]

Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarialskillembeddingsforphysicallysimulatedcharacters.ACMTransactions On Graphics (TOG)41(4), 1–17 (2022)

2022

[29] [29]

ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG)40(4), 1–20 (2021)

2021

[30] [30]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

arXiv preprint arXiv:2508.07842 (2025)

Shen, Y., Liu, H., Zhang, L., Liu, P., Xia, R., Yao, T., Feng, T.: Detach: Cross- domain learning for long-horizon tasks via mixture of disentangled experts. arXiv preprint arXiv:2508.07842 (2025)

work page arXiv 2025

[33] [33]

In: nternational conference on machine learning

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: nternational conference on machine learning. pp. 2256–2265 (2015)

2015

[34] [34]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[35] [35]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011

[36] [36]

ACM Trans- actions On Graphics (TOG)43(6), 1–21 (2024)

Tessler, C., Guo, Y., Nabati, O., Chechik, G., Peng, X.B.: Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Trans- actions On Graphics (TOG)43(6), 1–21 (2024)

2024

[37] [37]

Human Motion Diffusion Model

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Wang, W., Pan, L., Dou, Z., Mei, J., Liao, Z., Lou, Y., Wu, Y., Yang, L., Wang, J., Komura, T.: Sims: Simulating stylized human-scene interactions with retrieval- augmented script generation. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 14117–14127 (2025)

2025

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Zhao, Q., Yu, R., Tsui, H.W., Zeng, A., Lin, J., Luo, Z., Yu, J., Li, X., Chen, Q., et al.: Skillmimic: Learning basketball interaction skills from demon- strations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17540–17549 (2025)

2025

[40] [40]

arXiv preprint arXiv:2506.12779 (2025)

Wang, Y., Yang, M., Ding, Z., Zhang, Y., Zeng, W., Xu, X., Jiang, H., Lu, Z.: From experts to a generalist: Toward general whole-body control for humanoid robots. arXiv preprint arXiv:2506.12779 (2025)

work page arXiv 2025

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,Z.,Chen,Y.,Jia,B.,Li,P.,Zhang,J.,Zhang,J.,Liu,T.,Zhu,Y.,Liang,W., Huang, S.: Move as you say interact as you can: Language-guided human motion generation with scene affordance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 433–444 (2024)

2024

[42] [42]

Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: Language- conditioned human motion generation in 3d scenes. Advances in Neural Informa- tion Processing Systems35, 14959–14971 (2022)

2022

[43] [43]

arXiv preprint arXiv:2309.07918 (2023) 18 L

Xiao, Z., Wang, T., Wang, J., Cao, J., Zhang, W., Dai, B., Lin, D., Pang, J.: Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918 (2023) 18 L. Pan and H. Zhao

work page arXiv 2023

[44] [44]

ACM Transactions on Graphics (TOG)44(6), 1–14 (2025)

Xu, P., Wu, Z., Wang, R., Sarukkai, V., Fatahalian, K., Karamouzas, I., Zordan, V., Liu, C.K.: Learning to ball: Composing policies for long-horizon basketball moves. ACM Transactions on Graphics (TOG)44(6), 1–14 (2025)

2025

[45] [45]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, S., Ling, H.Y., Wang, Y.X., Gui, L.Y.: Intermimic: Towards universal whole- body control for physics-based human-object interactions. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12266–12277 (2025)

2025

[46] [46]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Yu,R.,Wang,Y.,Zhao,Q.,Tsui,H.W.,Wang,J.,Tan,P.,Chen,Q.:Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demon- strations. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

2025

[47] [47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, J., Fan, H., Yang, Y.: Energymogen: Compositional human motion gen- eration with energy-based diffusion model in latent space. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17592–17602 (2025)

2025

[48] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Fan, H., Yang, Y.: Towards decompositional human motion generation with energy-based diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 30650–30660 (2026)

2026

[49] [49]

In: Proceedings of theIEEE/CVF conference on computer vision and pattern recognition

Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representa- tions. In: Proceedings of theIEEE/CVF conference on computer vision and pattern recognition. pp. 14730–14740 (2023)

2023

[50] [50]

IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence46(6), 4115–4128 (2024)

2024

[51] [51]

In: European Conference on Computer Vision (ECCV) (2022)

Zhao, K., Wang, S., Zhang, Y., Beeler, T., Tang, S.: Compositional human-scene interaction synthesis with semantic control. In: European Conference on Computer Vision (ECCV) (2022)

2022

[52] [52]

In: Proceedings of the IEEE/CVF international con- ference on computer vision

Zhao, K., Zhang, Y., Wang, S., Beeler, T., Tang, S.: Synthesizing diverse human motions in 3d indoor scenes. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 14738–14749 (2023)

2023

[53] [53]

w/o” indicates removal; “–

Zheng, Y., Yang, Y., Mo, K., Li, J., Yu, T., Liu, Y., Liu, C.K., Guibas, L.J.: Gimo: Gaze-informed human motion prediction in context. In: European Conference on Computer Vision. pp. 676–694. Springer (2022) ComplexMimic 19 Supplementary Material In this supplementary material, we present: –Sec. A: Additional ablation results on LINGO [15] and GIMO [53]. ...

work page arXiv 2022