pith. machine review for the scientific record. sign in

arxiv: 2604.04843 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human-object-scene interactionconsistency modeldynamic perceptionbump-aware guidancehybrid traininggenerative modelingscene-aware generation
0
0 comments X

The pith

A consistency model with dynamic scene updates and hybrid training generates consistent human-object-scene interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of generating human-object-scene interactions, which must account for how objects move and change their relations to the surrounding scene yet lack large annotated datasets. It does so by aligning generation to the iterative denoising steps of a consistency model and feeding updated scene context derived from the previous refinement step into the next step. A bump-aware guidance term reduces collisions and penetrations during sampling even when only coarse occupancy is available. To address data scarcity, the method creates pseudo-HOSI examples by adding voxelized scene occupancy to existing human-object interaction datasets and trains jointly with high-fidelity human-scene interaction data. Experiments show the resulting model reaches state-of-the-art quality on both HOSI and HOI benchmarks and generalizes to scenes not seen during training.

Core claim

By conditioning each denoising step of a consistency model on an instruction and on scene context updated from the trajectory of the preceding refinement, the framework produces interactions that remain consistent with both the object and the scene; bump-aware guidance further reduces physical violations at sampling time, while a hybrid training regimen that augments HOI data with voxelized occupancy and mixes it with HSI data supplies the necessary scene-aware supervision.

What carries the argument

The dynamic perception strategy, which extracts trajectories from the current refinement step to refresh the scene context supplied to the next denoising step of the consistency model.

If this is right

  • Real-time generation becomes feasible because bump-aware guidance operates on coarse occupancy rather than full geometry.
  • The same iterative refinement loop can be applied to existing HOI generators to add scene awareness without retraining from scratch.
  • Generalization to novel scenes improves because the pseudo-samples expose the model to varied object-scene configurations during training.
  • Fewer post-processing steps are needed to correct penetrations and floating objects in the output animations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-based context update could be inserted into other iterative generative pipelines that must maintain multi-object coherence over time.
  • Voxelized occupancy augmentation may prove useful in any domain where full scene meshes are expensive to obtain but coarse spatial constraints are sufficient.
  • If the consistency-model alignment proves robust, the approach could be tested on longer-horizon tasks such as multi-step object manipulation sequences.

Load-bearing premise

Voxelized scene occupancy injected into HOI datasets yields useful pseudo-HOSI samples that, when mixed with real HSI data, teach consistent interactions without creating artifacts or erasing scene awareness.

What would settle it

A controlled ablation in which the model is trained only on unmodified HOI data plus HSI data and then evaluated on scenes containing movable objects; if collision rates rise sharply or consistency with scene layout collapses, the hybrid data strategy is not supplying the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.04843 by Guanjie Zheng, Junji Gong, Tianxing Chen, Xing Gao, Yude Zou, Zixuan Li.

Figure 1
Figure 1. Figure 1: Overview of InfBaGel. Our method operates through an iterative refinement process. (a) Auto-regressive Motion Model generating arbitrary long-sequence motions con￾ditioned on textual instructions, goals, object geometry, and scene context. (b) Dynamic Perception Encoder perceives the evolving environment with the temporal-aligned scene state updated by iterative sampling. (c) Bump-aware Guidance detects co… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison. Top 2 rows: Comparison on human-object interaction in scenes. Bottom row: Comparison on a complex multi-stage task involving moving a chair and then sitting on it. physically plausible and semantically correct, outperforming baseline methods. Specifically, both TRUMANS (a/d) and LINGO (b/e) exhibit severe object-scene penetration. In con￾trast, InfBaGel (c/f) produces nearly collisi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison in ablation study. Replacing/removing specific modules: (a) diffusion model instead of consistency model, (b) static perception instead of dynamic perception, and (c) without bump-aware guidance, all resulted in collisions with the scene. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on different scenes, motion types and object types. The top two rows (a/b) show diverse human-object interactions including lifting over head and kicking. The last row shows a static scene interaction. A.2 HOI Experiment Experiment Settings. To assess InfBaGel against specialized HOI methods, we eval￾uate on the standard HOI benchmark OMOMO. Following CHOIS, we assess the results from m… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results in socially-interactive scenes, including a store and a physical therapy room. These unseen scenes are chosen from LINGO, displayed in default white due to the lack of texture. Baseline Comparison. We compare InfBaGel and LINGO (Jiang et al., 2024a), by treating them in an empty scene, including CHOIS (Li et al., 2024b) and ROG (Xue et al., 2025), two standard methods for HOI generation… view at source ↗
read the original abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes InfBaGel, a coarse-to-fine framework for human-object-scene interaction (HOSI) generation aligned with consistency model denoising. It introduces dynamic perception to iteratively update scene context from prior trajectories, bump-aware guidance to mitigate collisions and penetrations at sampling time, and a hybrid training strategy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets before joint training with HSI data. The central claim is that this yields state-of-the-art performance on both HOSI and HOI generation tasks together with strong generalization to unseen scenes.

Significance. If the empirical claims hold, the work would offer a practical route to scene-aware interaction synthesis under severe data constraints, combining efficient consistency-model sampling with a lightweight data-augmentation trick. This could benefit downstream applications in embodied AI, animation, and simulation where explicit 3-D scene geometry is unavailable or expensive.

major comments (2)
  1. Abstract: the assertion of 'state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes' is presented without any quantitative metrics, baseline tables, ablation results, or error analysis, rendering the load-bearing claim that the hybrid voxel-injection strategy produces usable pseudo-HOSI samples unverifiable from the supplied text.
  2. Abstract (hybrid training strategy paragraph): the central assumption that 'injecting voxelized scene occupancy into HOI datasets' yields pseudo-samples that teach consistent dynamic object-scene reasoning is not accompanied by any reported ablation isolating voxel resolution, penetration-rate measurements on held-out real HOSI data, or comparison against non-voxelized baselines; without such evidence the iterative refinement and bump-aware guidance cannot be shown to correct rather than reinforce artifacts introduced by the coarse, static voxel representation.
minor comments (2)
  1. Abstract: 'startegy' is a typographical error and should read 'strategy'.
  2. Abstract: the phrases 'dynamic perception strategy' and 'bump-aware guidance' are introduced without reference to the corresponding equations or algorithmic steps that would appear in the methods section, reducing immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and have made targeted revisions to strengthen the presentation of our claims and supporting evidence.

read point-by-point responses
  1. Referee: Abstract: the assertion of 'state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes' is presented without any quantitative metrics, baseline tables, ablation results, or error analysis, rendering the load-bearing claim that the hybrid voxel-injection strategy produces usable pseudo-HOSI samples unverifiable from the supplied text.

    Authors: We agree that the abstract would be more informative if it included concrete quantitative support for the state-of-the-art and generalization claims. Although the full manuscript provides detailed tables, baseline comparisons, and ablation studies in Sections 4 and 5, we have revised the abstract to incorporate key performance metrics (e.g., improvements on HOSI and HOI benchmarks) and a brief statement on generalization to unseen scenes. This makes the central claims verifiable from the abstract alone while preserving its concise nature. revision: yes

  2. Referee: Abstract (hybrid training strategy paragraph): the central assumption that 'injecting voxelized scene occupancy into HOI datasets' yields pseudo-samples that teach consistent dynamic object-scene reasoning is not accompanied by any reported ablation isolating voxel resolution, penetration-rate measurements on held-out real HOSI data, or comparison against non-voxelized baselines; without such evidence the iterative refinement and bump-aware guidance cannot be shown to correct rather than reinforce artifacts introduced by the coarse, static voxel representation.

    Authors: The referee correctly identifies that the abstract does not explicitly detail ablations isolating voxel resolution or penetration rates on held-out data. The manuscript already contains ablations on the hybrid training strategy and overall artifact reduction; however, to directly address this concern we have added a concise summary of the relevant ablation results (including voxel-resolution sensitivity and penetration metrics versus non-voxelized baselines) to the abstract. We have also expanded the experimental section with additional held-out evaluations confirming that the pseudo-samples improve dynamic reasoning and that subsequent refinement steps reduce rather than reinforce voxel-induced artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation builds on external consistency models and empirical hybrid training without self-referential reductions.

full rationale

The paper presents a coarse-to-fine instruction-conditioned framework aligned with consistency model denoising, using dynamic perception from prior trajectories and bump-aware guidance at sampling time. The hybrid training synthesizes pseudo-HOSI via voxelized occupancy injection into HOI data and joint training with HSI, but this is an explicit design choice for data augmentation rather than a fitted parameter renamed as prediction or a self-definitional loop. No equations, uniqueness theorems, or ansatzes are shown that reduce by construction to the inputs; SOTA and generalization claims rest on experimental results. The approach extends prior consistency models without load-bearing self-citations or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical components are presented as extensions of existing consistency models and datasets.

pith-pipeline@v0.9.0 · 5543 in / 1154 out tokens · 61120 ms · 2026-05-10T20:06:04.311044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    coarse-to-fine instruction-conditioned interaction generation framework ... aligned with the iterative denoising process of a consistency model ... dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context ... bump-aware guidance ... hybrid training strategy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    hybrid data training strategy ... voxelized scene occupancy ... jointly trains with high-fidelity HSI data

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Physically plausible full-body hand-object interaction synthesis

    Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. In 2024 International Conference on 3D Vision (3DV) ,

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    11 Published as a conference paper at ICLR 2026 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 ,

  3. [3]

    Coohoi: Learning cooperative human-object interaction with manipulated object dynamics

    Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, and Jiangmiao Pang. Coohoi: Learning cooperative human-object interaction with manipulated object dynamics. arXiv preprint arXiv:2406.14558 ,

  4. [4]

    Unihm: Universal human motion generation with object interactions in indoor scenes

    Zichen Geng, Zeeshan Hayder, Wei Liu, and Ajmal Mian. Unihm: Universal human motion generation with object interactions in indoor scenes. arXiv preprint arXiv:2505.12774 ,

  5. [5]

    Synthesizing physical character-scene interactions

    Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing physical character-scene interactions. In ACM SIGGRAPH 2023 Conference Proceedings,

  6. [6]

    Autonomous character-scene interaction synthesis from text instruction

    Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. In SIGGRAPH Asia 2024 Conference Papers , 2024a. Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interacti...

  7. [7]

    Zerohsi: Zero-shot 4d human-scene interaction by video generation

    Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Zerohsi: Zero-shot 4d human-scene interaction by video generation. arXiv preprint arXiv:2412.18600 , 2024a. Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42:1–11,

  8. [8]

    Controllable human-object interaction synthesis

    12 Published as a conference paper at ICLR 2026 Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. In European Conference on Computer Vision (ECCV) , 2024b. Lei Li and Angela Dai. GenZI: Zero-shot 3D human-scene interaction generation. In Pro- ceedings of the IEEE/CVF Con...

  9. [9]

    Task-oriented human- object interactions generation with implicit neural representations

    Quanzhou Li, Jingbo Wang, Chen Change Loy, and Bo Dai. Task-oriented human- object interactions generation with implicit neural representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV) , 2024c. Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, and Yutong Ban. Genhoi: General- izing text-driven 4d human-...

  10. [10]

    arXiv preprint arXiv:2406.01586 (2024)

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research, in press:1–22, 2025a. Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Wenbo Ding, and Yansong Tang. Manicm: Real-time 3d diffusion policy via consistency mo...

  11. [11]

    48550/arXiv.2504.10414

    Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, and Yi Zhou. Hu- moto: A 4d dataset of mocap human object interactions. arXiv preprint arXiv:2504.10414, 2025b. Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378,

  12. [12]

    Generating continual human motion in diverse 3d scenes

    13 Published as a conference paper at ICLR 2026 Aymen Mir, Xavier Puig, Angjoo Kanazawa, and Gerard Pons-Moll. Generating continual human motion in diverse 3d scenes. In 2024 International Conference on 3D Vision (3DV),

  13. [13]

    Synthesizing physically plausible human motions in 3d scenes

    Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. In 2024 International Conference on 3D Vision (3DV) ,

  14. [14]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 ,

  15. [15]

    Towards diverse and natural scene-aware 3d human motion synthesis

    Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, and Bo Dai. Towards diverse and natural scene-aware 3d human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022a. Wenjia Wang, Liang Pan, Zhiyang Dou, Jidong Mei, Zhouyingcheng Liao, Yuke Lou, Yifan Wu, Lei Yang, Jingbo Wang, and Tak...

  16. [16]

    Human-object interaction from human-level instructions

    Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human-object interaction from human-level instructions. arXiv preprint arXiv:2406.17840 ,

  17. [17]

    Hosig: Full-body human-object-scene interaction generation with hierarchical scene perception

    Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, and Jinhui Tang. Hosig: Full-body human-object-scene interaction generation with hierarchical scene perception. arXiv preprint arXiv:2506.01579 ,

  18. [18]

    Black, Xue Bin Peng, and Davis Rempe

    Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. arXiv:2404.10685,

  19. [19]

    Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations

    15 Published as a conference paper at ICLR 2026 Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, and Qifeng Chen. Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conferenc...

  20. [20]

    Humanoidverse: A versatile humanoid for vision-language guided multi- object rearrangement

    Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, and Wei Pan. Humanoidverse: A versatile humanoid for vision-language guided multi- object rearrangement. arXiv preprint arXiv:2508.16943 , 2025a. He Zhang, Yuting Ye, Takaaki Shiratori, and Taku Komura. Manipnet: neural manipulation synthesis with a hand-object spatial re...

  21. [21]

    Scenic: Scene-aware semantic navigation with instruction- guided control.arXiv preprint arXiv:2412.15664, 2024

    Siwei Zhang, Yan Zhang, Qianli Ma, Michael J Black, and Siyu Tang. Generating person- scene interactions in 3d scenes. In International Conference on 3D Vision (3DV) , 2020a. Siwei Zhang, Yan Zhang, Qianli Ma, Michael J Black, and Siyu Tang. Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vi...

  22. [22]

    arXiv preprint arXiv:2503.12955 (2025)

    Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence , 2025b. Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, and Shiguang Shan. His-gpt: Towards 3d human-in-s...

  23. [23]

    Evolvinggrasp: Evo- lutionary grasp generation via efficient preference alignment

    Yufei Zhu, Yiming Zhong, Zemin Yang, Peishan Cong, Jingyi Yu, Xinge Zhu, and Yuexin Ma. Evolvinggrasp: Evolutionary grasp generation via efficient preference alignment. arXiv preprint arXiv:2503.14329 ,

  24. [24]

    4, with more scenes, object types and motion types, including lifting over head, kicking and interaction with static scene

    A Appendix A.1 More Qualitative Results More qualitative results are presented in Fig. 4, with more scenes, object types and motion types, including lifting over head, kicking and interaction with static scene. As shown in Fig. 5, InfBaGel is largely environment-independent because it conditions on a voxelized scene representation, rather than relying on ...

  25. [25]

    Finally, the difference from ground truth is quantified by the mean per joint po- sition error (MPJPE), the root translation error ( Troot) and the object pose error ( Tobj, Oobj)

    and contact percentage ( C%), and measure the penetration between the human body and the object (Pbody). Finally, the difference from ground truth is quantified by the mean per joint po- sition error (MPJPE), the root translation error ( Troot) and the object pose error ( Tobj, Oobj). There are a large number of noise points in the SDF of some objects in ...

  26. [26]

    Competitive results can be achieved through simpler guidance and fewer optimization steps (CHOIS and ROG used 10 optimization steps)

    with the guidance of CHOIS, to demonstrate the effectiveness of our iterative optimization. Competitive results can be achieved through simpler guidance and fewer optimization steps (CHOIS and ROG used 10 optimization steps). Results on HOI. Table 5 presents the quantitative results on the OMOMO dataset. Inf- BaGel consistently outperforms all baselines a...

  27. [27]

    A balanced ratio of 1:1 or 1:0.5 between synthesized OMOMO data and LINGO data achieves the best overall performance across all metrics

    The results show that the gain from HSI data on scene understanding has a limit, too much HSI data may compromise the model’s ability to learn object manipulation from HOI data, indicating a trade-off between scene-level physical plausibility and task-specific interaction priors. A balanced ratio of 1:1 or 1:0.5 between synthesized OMOMO data and LINGO da...

  28. [28]

    , 2025a)

    or basketball sport ( Liu & Hodgins , 2018; Wang et al. , 2025a). Recent mimic learning methods ( Xu et al., 2025; Yu et al.,

  29. [29]

    With the growing availability of human-object interaction datasets ( Bhatnagar et al

    have achieved generalized character interaction. With the growing availability of human-object interaction datasets ( Bhatnagar et al. , 2022; Li et al. , 2023; Lu et al. , 2025b), certain methods ( Li et al. , 2024b; Cong et al. ,

  30. [30]

    Other methods ( Xu et al

    have started generating motions for interactions with large objects, often relying on sequential points or object trajectories, which restricts the model’s ability to autonomously generate diverse interactions. Other methods ( Xu et al. , 2023; Diller & Dai , 2024; Song et al. , 2024; Peng et al. , 2025; Li et al. , 2025; Xue et al. , 2025; Zeng et al. ,

  31. [31]

    Limitation of HOI dataset

    attempt to simultaneously synthesize human and object motions but require additional models for optimization, unable to achieve real-time generation. Limitation of HOI dataset. Due to the general lack of scene annotations in datasets, human-object interactions are typically placed into scenes by plan collision-free paths ( Li et al. , 2024b; Wu et al. , 2...

  32. [32]

    , 2020b; Xuan et al

    to interacting with static objects ( Zhang et al. , 2020b; Xuan et al. , 2023; Zhao et al. , 2022; Wang et al. , 2022b), such as sitting and lying down. Subsequently, some methods considered both locomotion and static interaction simultaneously. Hassan et al. (2021), Wang et al. (2022a), Huang et al. (2023), Mir et al. (2024) and Zhang et al. (2024) model...