arxiv: 2604.04843 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Yude Zou , Junji Gong , Xing Gao , Zixuan Li , Tianxing Chen , Guanjie Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human-object-scene interactionconsistency modeldynamic perceptionbump-aware guidancehybrid traininggenerative modelingscene-aware generation

0 comments

The pith

A consistency model with dynamic scene updates and hybrid training generates consistent human-object-scene interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of generating human-object-scene interactions, which must account for how objects move and change their relations to the surrounding scene yet lack large annotated datasets. It does so by aligning generation to the iterative denoising steps of a consistency model and feeding updated scene context derived from the previous refinement step into the next step. A bump-aware guidance term reduces collisions and penetrations during sampling even when only coarse occupancy is available. To address data scarcity, the method creates pseudo-HOSI examples by adding voxelized scene occupancy to existing human-object interaction datasets and trains jointly with high-fidelity human-scene interaction data. Experiments show the resulting model reaches state-of-the-art quality on both HOSI and HOI benchmarks and generalizes to scenes not seen during training.

Core claim

By conditioning each denoising step of a consistency model on an instruction and on scene context updated from the trajectory of the preceding refinement, the framework produces interactions that remain consistent with both the object and the scene; bump-aware guidance further reduces physical violations at sampling time, while a hybrid training regimen that augments HOI data with voxelized occupancy and mixes it with HSI data supplies the necessary scene-aware supervision.

What carries the argument

The dynamic perception strategy, which extracts trajectories from the current refinement step to refresh the scene context supplied to the next denoising step of the consistency model.

If this is right

Real-time generation becomes feasible because bump-aware guidance operates on coarse occupancy rather than full geometry.
The same iterative refinement loop can be applied to existing HOI generators to add scene awareness without retraining from scratch.
Generalization to novel scenes improves because the pseudo-samples expose the model to varied object-scene configurations during training.
Fewer post-processing steps are needed to correct penetrations and floating objects in the output animations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-based context update could be inserted into other iterative generative pipelines that must maintain multi-object coherence over time.
Voxelized occupancy augmentation may prove useful in any domain where full scene meshes are expensive to obtain but coarse spatial constraints are sufficient.
If the consistency-model alignment proves robust, the approach could be tested on longer-horizon tasks such as multi-step object manipulation sequences.

Load-bearing premise

Voxelized scene occupancy injected into HOI datasets yields useful pseudo-HOSI samples that, when mixed with real HSI data, teach consistent interactions without creating artifacts or erasing scene awareness.

What would settle it

A controlled ablation in which the model is trained only on unmodified HOI data plus HSI data and then evaluated on scenes containing movable objects; if collision rates rise sharply or consistency with scene layout collapses, the hybrid data strategy is not supplying the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.04843 by Guanjie Zheng, Junji Gong, Tianxing Chen, Xing Gao, Yude Zou, Zixuan Li.

**Figure 1.** Figure 1: Overview of InfBaGel. Our method operates through an iterative refinement process. (a) Auto-regressive Motion Model generating arbitrary long-sequence motions conditioned on textual instructions, goals, object geometry, and scene context. (b) Dynamic Perception Encoder perceives the evolving environment with the temporal-aligned scene state updated by iterative sampling. (c) Bump-aware Guidance detects co… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison. Top 2 rows: Comparison on human-object interaction in scenes. Bottom row: Comparison on a complex multi-stage task involving moving a chair and then sitting on it. physically plausible and semantically correct, outperforming baseline methods. Specifically, both TRUMANS (a/d) and LINGO (b/e) exhibit severe object-scene penetration. In contrast, InfBaGel (c/f) produces nearly collisi… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison in ablation study. Replacing/removing specific modules: (a) diffusion model instead of consistency model, (b) static perception instead of dynamic perception, and (c) without bump-aware guidance, all resulted in collisions with the scene. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on different scenes, motion types and object types. The top two rows (a/b) show diverse human-object interactions including lifting over head and kicking. The last row shows a static scene interaction. A.2 HOI Experiment Experiment Settings. To assess InfBaGel against specialized HOI methods, we evaluate on the standard HOI benchmark OMOMO. Following CHOIS, we assess the results from m… view at source ↗

**Figure 5.** Figure 5: Qualitative results in socially-interactive scenes, including a store and a physical therapy room. These unseen scenes are chosen from LINGO, displayed in default white due to the lack of texture. Baseline Comparison. We compare InfBaGel and LINGO (Jiang et al., 2024a), by treating them in an empty scene, including CHOIS (Li et al., 2024b) and ROG (Xue et al., 2025), two standard methods for HOI generation… view at source ↗

read the original abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InfBaGel aligns dynamic scene updates with consistency model denoising steps and uses voxel-injected HOI data for hybrid HOSI training, but the SOTA and generalization claims have no visible metrics or ablations to back them up.

read the letter

The core idea here is a framework that conditions each denoising step on updated scene context pulled from the prior trajectory, paired with bump-aware guidance that avoids collisions at sampling time without full geometry, and a hybrid training trick that voxelizes scene occupancy into existing HOI datasets to create pseudo-HOSI samples for joint training with real HSI data. That combination targets the data scarcity problem directly and tries to keep the generation both consistent and fast enough for real-time use in animation or simulation. The alignment of perception with the iterative process is a clean way to handle object motion relative to the scene, and skipping fine geometry for the guidance step is a practical choice that could matter for deployment. The hybrid synthesis approach is also a reasonable engineering move when annotated HOSI data is thin. Those pieces feel like genuine attempts to solve the stated gaps rather than incremental tweaks. The main weakness is that the abstract asserts state-of-the-art results and strong generalization to unseen scenes with no numbers, baselines, error breakdowns, or ablation results shown. Without those, it is impossible to judge whether the voxel injection actually teaches useful scene awareness or just adds coarse static occupancy that the later steps cannot fully correct. The concern that initial pseudo-samples could embed penetrations or lost affordances, which the inference-only guidance then fails to fix, looks plausible given the lack of any reported scene-consistency metrics or held-out HOSI tests. If the full paper contains solid quantitative tables and component ablations, that would change the picture; from the text available, the central performance claims stay unverified. This is the sort of targeted generative modeling work that people building embodied AI systems or interaction datasets would want to read. It is coherent on its own terms and engages the right prior literature on HOI and consistency models, so it deserves a serious referee to examine the experiments and check whether the hybrid data strategy holds up under scrutiny. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes InfBaGel, a coarse-to-fine framework for human-object-scene interaction (HOSI) generation aligned with consistency model denoising. It introduces dynamic perception to iteratively update scene context from prior trajectories, bump-aware guidance to mitigate collisions and penetrations at sampling time, and a hybrid training strategy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets before joint training with HSI data. The central claim is that this yields state-of-the-art performance on both HOSI and HOI generation tasks together with strong generalization to unseen scenes.

Significance. If the empirical claims hold, the work would offer a practical route to scene-aware interaction synthesis under severe data constraints, combining efficient consistency-model sampling with a lightweight data-augmentation trick. This could benefit downstream applications in embodied AI, animation, and simulation where explicit 3-D scene geometry is unavailable or expensive.

major comments (2)

Abstract: the assertion of 'state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes' is presented without any quantitative metrics, baseline tables, ablation results, or error analysis, rendering the load-bearing claim that the hybrid voxel-injection strategy produces usable pseudo-HOSI samples unverifiable from the supplied text.
Abstract (hybrid training strategy paragraph): the central assumption that 'injecting voxelized scene occupancy into HOI datasets' yields pseudo-samples that teach consistent dynamic object-scene reasoning is not accompanied by any reported ablation isolating voxel resolution, penetration-rate measurements on held-out real HOSI data, or comparison against non-voxelized baselines; without such evidence the iterative refinement and bump-aware guidance cannot be shown to correct rather than reinforce artifacts introduced by the coarse, static voxel representation.

minor comments (2)

Abstract: 'startegy' is a typographical error and should read 'strategy'.
Abstract: the phrases 'dynamic perception strategy' and 'bump-aware guidance' are introduced without reference to the corresponding equations or algorithmic steps that would appear in the methods section, reducing immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and have made targeted revisions to strengthen the presentation of our claims and supporting evidence.

read point-by-point responses

Referee: Abstract: the assertion of 'state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes' is presented without any quantitative metrics, baseline tables, ablation results, or error analysis, rendering the load-bearing claim that the hybrid voxel-injection strategy produces usable pseudo-HOSI samples unverifiable from the supplied text.

Authors: We agree that the abstract would be more informative if it included concrete quantitative support for the state-of-the-art and generalization claims. Although the full manuscript provides detailed tables, baseline comparisons, and ablation studies in Sections 4 and 5, we have revised the abstract to incorporate key performance metrics (e.g., improvements on HOSI and HOI benchmarks) and a brief statement on generalization to unseen scenes. This makes the central claims verifiable from the abstract alone while preserving its concise nature. revision: yes
Referee: Abstract (hybrid training strategy paragraph): the central assumption that 'injecting voxelized scene occupancy into HOI datasets' yields pseudo-samples that teach consistent dynamic object-scene reasoning is not accompanied by any reported ablation isolating voxel resolution, penetration-rate measurements on held-out real HOSI data, or comparison against non-voxelized baselines; without such evidence the iterative refinement and bump-aware guidance cannot be shown to correct rather than reinforce artifacts introduced by the coarse, static voxel representation.

Authors: The referee correctly identifies that the abstract does not explicitly detail ablations isolating voxel resolution or penetration rates on held-out data. The manuscript already contains ablations on the hybrid training strategy and overall artifact reduction; however, to directly address this concern we have added a concise summary of the relevant ablation results (including voxel-resolution sensitivity and penetration metrics versus non-voxelized baselines) to the abstract. We have also expanded the experimental section with additional held-out evaluations confirming that the pseudo-samples improve dynamic reasoning and that subsequent refinement steps reduce rather than reinforce voxel-induced artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation builds on external consistency models and empirical hybrid training without self-referential reductions.

full rationale

The paper presents a coarse-to-fine instruction-conditioned framework aligned with consistency model denoising, using dynamic perception from prior trajectories and bump-aware guidance at sampling time. The hybrid training synthesizes pseudo-HOSI via voxelized occupancy injection into HOI data and joint training with HSI, but this is an explicit design choice for data augmentation rather than a fitted parameter renamed as prediction or a self-definitional loop. No equations, uniqueness theorems, or ansatzes are shown that reduce by construction to the inputs; SOTA and generalization claims rest on experimental results. The approach extends prior consistency models without load-bearing self-citations or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical components are presented as extensions of existing consistency models and datasets.

pith-pipeline@v0.9.0 · 5543 in / 1154 out tokens · 61120 ms · 2026-05-10T20:06:04.311044+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

coarse-to-fine instruction-conditioned interaction generation framework ... aligned with the iterative denoising process of a consistency model ... dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context ... bump-aware guidance ... hybrid training strategy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid data training strategy ... voxelized scene occupancy ... jointly trains with high-fidelity HSI data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Physically plausible full-body hand-object interaction synthesis

Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. In 2024 International Conference on 3D Vision (3DV) ,

work page 2024
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

11 Published as a conference paper at ICLR 2026 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 ,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Coohoi: Learning cooperative human-object interaction with manipulated object dynamics

Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, and Jiangmiao Pang. Coohoi: Learning cooperative human-object interaction with manipulated object dynamics. arXiv preprint arXiv:2406.14558 ,

work page arXiv
[4]

Unihm: Universal human motion generation with object interactions in indoor scenes

Zichen Geng, Zeeshan Hayder, Wei Liu, and Ajmal Mian. Unihm: Universal human motion generation with object interactions in indoor scenes. arXiv preprint arXiv:2505.12774 ,

work page arXiv
[5]

Synthesizing physical character-scene interactions

Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing physical character-scene interactions. In ACM SIGGRAPH 2023 Conference Proceedings,

work page 2023
[6]

Autonomous character-scene interaction synthesis from text instruction

Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. In SIGGRAPH Asia 2024 Conference Papers , 2024a. Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interacti...

work page 2024
[7]

Zerohsi: Zero-shot 4d human-scene interaction by video generation

Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Zerohsi: Zero-shot 4d human-scene interaction by video generation. arXiv preprint arXiv:2412.18600 , 2024a. Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG) , 42:1–11,

work page arXiv
[8]

Controllable human-object interaction synthesis

12 Published as a conference paper at ICLR 2026 Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. In European Conference on Computer Vision (ECCV) , 2024b. Lei Li and Angela Dai. GenZI: Zero-shot 3D human-scene interaction generation. In Pro- ceedings of the IEEE/CVF Con...

work page 2026
[9]

Task-oriented human- object interactions generation with implicit neural representations

Quanzhou Li, Jingbo Wang, Chen Change Loy, and Bo Dai. Task-oriented human- object interactions generation with implicit neural representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV) , 2024c. Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, and Yutong Ban. Genhoi: General- izing text-driven 4d human-...

work page arXiv
[10]

arXiv preprint arXiv:2406.01586 (2024)

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research, in press:1–22, 2025a. Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Wenbo Ding, and Yansong Tang. Manicm: Real-time 3d diffusion policy via consistency mo...

work page arXiv
[11]

48550/arXiv.2504.10414

Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, and Yi Zhou. Hu- moto: A 4d dataset of mocap human object interactions. arXiv preprint arXiv:2504.10414, 2025b. Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378,

work page arXiv
[12]

Generating continual human motion in diverse 3d scenes

13 Published as a conference paper at ICLR 2026 Aymen Mir, Xavier Puig, Angjoo Kanazawa, and Gerard Pons-Moll. Generating continual human motion in diverse 3d scenes. In 2024 International Conference on 3D Vision (3DV),

work page 2026
[13]

Synthesizing physically plausible human motions in 3d scenes

Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. In 2024 International Conference on 3D Vision (3DV) ,

work page 2024
[14]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 ,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

Towards diverse and natural scene-aware 3d human motion synthesis

Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, and Bo Dai. Towards diverse and natural scene-aware 3d human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022a. Wenjia Wang, Liang Pan, Zhiyang Dou, Jidong Mei, Zhouyingcheng Liao, Yuke Lou, Yifan Wu, Lei Yang, Jingbo Wang, and Tak...

work page arXiv 2026
[16]

Human-object interaction from human-level instructions

Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human-object interaction from human-level instructions. arXiv preprint arXiv:2406.17840 ,

work page arXiv
[17]

Hosig: Full-body human-object-scene interaction generation with hierarchical scene perception

Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, and Jinhui Tang. Hosig: Full-body human-object-scene interaction generation with hierarchical scene perception. arXiv preprint arXiv:2506.01579 ,

work page arXiv
[18]

Black, Xue Bin Peng, and Davis Rempe

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. arXiv:2404.10685,

work page arXiv
[19]

Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations

15 Published as a conference paper at ICLR 2026 Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, and Qifeng Chen. Skillmimic-v2: Learning robust and generalizable interaction skills from sparse and noisy demonstrations. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conferenc...

work page 2026
[20]

Humanoidverse: A versatile humanoid for vision-language guided multi- object rearrangement

Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, and Wei Pan. Humanoidverse: A versatile humanoid for vision-language guided multi- object rearrangement. arXiv preprint arXiv:2508.16943 , 2025a. He Zhang, Yuting Ye, Takaaki Shiratori, and Taku Komura. Manipnet: neural manipulation synthesis with a hand-object spatial re...

work page arXiv
[21]

Scenic: Scene-aware semantic navigation with instruction- guided control.arXiv preprint arXiv:2412.15664, 2024

Siwei Zhang, Yan Zhang, Qianli Ma, Michael J Black, and Siyu Tang. Generating person- scene interactions in 3d scenes. In International Conference on 3D Vision (3DV) , 2020a. Siwei Zhang, Yan Zhang, Qianli Ma, Michael J Black, and Siyu Tang. Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vi...

work page arXiv 2020
[22]

arXiv preprint arXiv:2503.12955 (2025)

Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence , 2025b. Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, and Shiguang Shan. His-gpt: Towards 3d human-in-s...

work page arXiv
[23]

Evolvinggrasp: Evo- lutionary grasp generation via efficient preference alignment

Yufei Zhu, Yiming Zhong, Zemin Yang, Peishan Cong, Jingyi Yu, Xinge Zhu, and Yuexin Ma. Evolvinggrasp: Evolutionary grasp generation via eﬀicient preference alignment. arXiv preprint arXiv:2503.14329 ,

work page arXiv
[24]

4, with more scenes, object types and motion types, including lifting over head, kicking and interaction with static scene

A Appendix A.1 More Qualitative Results More qualitative results are presented in Fig. 4, with more scenes, object types and motion types, including lifting over head, kicking and interaction with static scene. As shown in Fig. 5, InfBaGel is largely environment-independent because it conditions on a voxelized scene representation, rather than relying on ...

work page 2026
[25]

Finally, the difference from ground truth is quantified by the mean per joint po- sition error (MPJPE), the root translation error ( Troot) and the object pose error ( Tobj, Oobj)

and contact percentage ( C%), and measure the penetration between the human body and the object (Pbody). Finally, the difference from ground truth is quantified by the mean per joint po- sition error (MPJPE), the root translation error ( Troot) and the object pose error ( Tobj, Oobj). There are a large number of noise points in the SDF of some objects in ...

work page 2026
[26]

Competitive results can be achieved through simpler guidance and fewer optimization steps (CHOIS and ROG used 10 optimization steps)

with the guidance of CHOIS, to demonstrate the effectiveness of our iterative optimization. Competitive results can be achieved through simpler guidance and fewer optimization steps (CHOIS and ROG used 10 optimization steps). Results on HOI. Table 5 presents the quantitative results on the OMOMO dataset. Inf- BaGel consistently outperforms all baselines a...

work page 2026
[27]

A balanced ratio of 1:1 or 1:0.5 between synthesized OMOMO data and LINGO data achieves the best overall performance across all metrics

The results show that the gain from HSI data on scene understanding has a limit, too much HSI data may compromise the model’s ability to learn object manipulation from HOI data, indicating a trade-off between scene-level physical plausibility and task-specific interaction priors. A balanced ratio of 1:1 or 1:0.5 between synthesized OMOMO data and LINGO da...

work page 2021
[28]

, 2025a)

or basketball sport ( Liu & Hodgins , 2018; Wang et al. , 2025a). Recent mimic learning methods ( Xu et al., 2025; Yu et al.,

work page 2018
[29]

With the growing availability of human-object interaction datasets ( Bhatnagar et al

have achieved generalized character interaction. With the growing availability of human-object interaction datasets ( Bhatnagar et al. , 2022; Li et al. , 2023; Lu et al. , 2025b), certain methods ( Li et al. , 2024b; Cong et al. ,

work page 2022
[30]

Other methods ( Xu et al

have started generating motions for interactions with large objects, often relying on sequential points or object trajectories, which restricts the model’s ability to autonomously generate diverse interactions. Other methods ( Xu et al. , 2023; Diller & Dai , 2024; Song et al. , 2024; Peng et al. , 2025; Li et al. , 2025; Xue et al. , 2025; Zeng et al. ,

work page 2023
[31]

Limitation of HOI dataset

attempt to simultaneously synthesize human and object motions but require additional models for optimization, unable to achieve real-time generation. Limitation of HOI dataset. Due to the general lack of scene annotations in datasets, human-object interactions are typically placed into scenes by plan collision-free paths ( Li et al. , 2024b; Wu et al. , 2...

work page 2024
[32]

, 2020b; Xuan et al

to interacting with static objects ( Zhang et al. , 2020b; Xuan et al. , 2023; Zhao et al. , 2022; Wang et al. , 2022b), such as sitting and lying down. Subsequently, some methods considered both locomotion and static interaction simultaneously. Hassan et al. (2021), Wang et al. (2022a), Huang et al. (2023), Mir et al. (2024) and Zhang et al. (2024) model...

work page 2023