IMAGIN-4D: Image-Guided Controllable Interaction Generation

Bu\u{g}ra Tekin; Chenhongyi Yang; Dimitrios Tzionas; Federica Bogo; Michael J. Black; Nadine Bertsch; Sai Kumar Dwivedi; Shreyas Hampali; Tomas Hodan

arxiv: 2606.23675 · v1 · pith:XMIXOLKJnew · submitted 2026-06-22 · 💻 cs.CV

IMAGIN-4D: Image-Guided Controllable Interaction Generation

Sai Kumar Dwivedi , Federica Bogo , Bu\u{g}ra Tekin , Chenhongyi Yang , Nadine Bertsch , Tomas Hodan , Michael J. Black , Dimitrios Tzionas

show 1 more author

Shreyas Hampali

This is my paper

Pith reviewed 2026-06-26 09:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interactiondiffusion modelimage conditioningmotion generationcontrollable synthesisspatio-temporal tokensinteraction control

0 comments

The pith

Decomposing a reference image into spatial interaction-state tokens and temporal frame-aware patches lets a diffusion model generate human-object motions that match specific grasps, contacts and layouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reduce ambiguity in human-object interaction generation, where text prompts and waypoints alone allow multiple valid outcomes for the same action and trajectory. It introduces a method that conditions a diffusion model on one reference image by extracting tokens for body pose, object pose, contact and spatial relations at the shown moment, plus per-frame patch queries that let different sequence parts draw on distinct parts of the same image. Role-aware streams keep text, waypoint and image signals distinct during generation. Experiments indicate this yields tighter visual adherence than single-token or uniform-image baselines while waypoint following and motion quality stay intact.

Core claim

IMAGIN-4D decomposes image conditioning spatio-temporally: supervised interaction-state tokens capture body pose, object pose, body-object contact and spatial relationships at the depicted frame; frame-aware tokens are formed by querying image patches for each generated frame so sequence segments attend to different visual cues from the identical image. These signals integrate through role-aware conditioning where text, waypoints and interaction-state tokens use separate AdaLN streams while frame-aware tokens cross-attend with motion tokens. The resulting model improves fine-grained interaction control over single-token and uniformly image-conditioned baselines on the tested datasets while p

What carries the argument

Spatio-temporal decomposition of a single reference image into interaction-state tokens for the depicted state and frame-aware tokens queried per generated frame, routed via separate AdaLN streams and cross-attention.

If this is right

Generated motions adhere more closely to the visual snapshot in body pose, object pose, contacts and spatial layout.
Waypoint following and overall motion quality remain comparable to prior methods.
The same image can supply different conditioning signals to different parts of the motion sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Casual photographs could serve as direct specifications for interaction synthesis in animation pipelines.
The token decomposition pattern might apply to other single-image signals such as depth or segmentation maps.
Reduced reliance on lengthy text descriptions could simplify user interfaces for embodied AI task specification.

Load-bearing premise

Tokens extracted from images rendered by the synthetic motion-to-image pipeline supply unambiguous interaction cues that transfer to control real motions.

What would settle it

Running the model on real photographs of interactions and checking whether output contacts, grasps and body-object layouts match the depicted snapshot more closely than baselines would falsify the transfer claim if no improvement appears.

Figures

Figures reproduced from arXiv: 2606.23675 by Bu\u{g}ra Tekin, Chenhongyi Yang, Dimitrios Tzionas, Federica Bogo, Michael J. Black, Nadine Bertsch, Sai Kumar Dwivedi, Shreyas Hampali, Tomas Hodan.

**Figure 1.** Figure 1: Image-guided 4D HOI generation. Given a text prompt, object geometry, object waypoints, and a reference image, IMAGIN-4D synthesizes a 4D human-object interaction sequence. Text and waypoints specify the action and object trajectory, but leave fine-grained interaction details such as pose, contact, and layout ambiguous. We resolve this ambiguity with a reference image that specifies the interaction snapsho… view at source ↗

**Figure 2.** Figure 2: IMAGIN-4D overview (Sec. 3). Given a reference image I, text prompt 𝑦, object geometry O, and sparse waypoints W, IMAGIN-4D generates a 4D human-object motion sequence. A frozen image encoder extracts patch tokens P from I. The Spatially Factorized Image Encoder (SFIE) reads these patches with role-specific queries and produces supervised latent tokens for contact 𝜿, human pose 𝝆, object pose 𝝃 , and body-… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on FullBodyManipulation (Sec. 4.1). Each row shows a reference image (EditImage) and generated motion from CHOIS+Img, ViHOI★, and IMAGIN-4D, all conditioned on the same text prompt, object shape, waypoints, and reference image. Single-token image conditioning often misses fine-grained details, such as contact, hand placement, or body-object layout, while IMAGIN-4D better matches the … view at source ↗

**Figure 4.** Figure 4: Flip-image consistency test (Sec.4.3). We horizontally flip only the reference image at inference time while keeping the text prompt, object geometry, and waypoints fixed. The generated contact side and body-object layout change with the flipped image, showing that IMAGIN-4D uses visual evidence rather than ignoring the image condition. The motion is not an exact mirror because the unchanged non-image cond… view at source ↗

**Figure 5.** Figure 5: Conditioning image domains (Sec. 4.2). We render conditioning images from each ground-truth sequence from the FullBodyManipulation dataset and use the contact-centered frame for evaluation. MeshImg is a clean body-object render, SceneImg adds Replica scenes, body textures, and posed objects, and EditImg applies image editing for more photorealistic references. SceneImg is used for evaluation, while MeshImg… view at source ↗

**Figure 6.** Figure 6: Sketch-to-motion (Sec.4.3). We replace the RGB reference image with a line drawing and retrain the model. Despite removing texture, color, and scene appearance, the model preserves the depicted interaction layout and generates a complete motion sequence. This shows that IMAGIN-4D can also support sketch-based conditioning, where users specify the desired contact and body-object arrangement with a simple dr… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on FullBodyManipulation (SceneImg) (Sec. 4.1). Each row shows the SceneImg reference and generated motion from CHOIS+Img, ViHOI★, and IMAGIN-4D under the same text prompt, object shape, waypoints, and reference image. Single-token image conditioning often misses contact, hand placement, object orientation, or body-object layout. IMAGIN-4D better matches the depicted interaction state… view at source ↗

**Figure 8.** Figure 8: Why EditImg is derived from SceneImg. Columns 1–2 show direct MeshImg→EditImg results. Columns 3–5 show the same interaction as MeshImg, SceneImg, and SceneImg-derived EditImg. Direct MeshImg editing can preserve the interaction in some cases, but can also change human pose, object pose, contact, or body-object layout. Since image adherence assumes that the reference preserves the target interaction geome… view at source ↗

read the original abstract

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at https://imagin4d.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IMAGIN-4D splits image conditioning into spatial state tokens and per-frame patch queries to tighten HOI control, but the synthetic FBM pipeline leaves the real-data transfer claim thin.

read the letter

The paper's main contribution is a diffusion model that conditions HOI motion on a reference image by splitting the visual signal: supervised interaction-state tokens capture body pose, object pose, contact and layout at one frame, while frame-aware tokens let each generated frame query different patches from the same image. Separate AdaLN streams handle text, waypoints and state tokens, with cross-attention for the frame-aware part. They also release a synthetic motion-to-image pipeline and an adherence metric because paired real images are missing from existing datasets.

This decomposition directly targets the underspecification problem that text-plus-waypoint methods leave open. The architecture choice is concrete and the motivation is clear from the abstract.

The experiments claim better fine-grained control on FBM and BEHAVE without hurting waypoint adherence or motion quality. That would be useful for animation and robotics work if the numbers hold.

The soft spot is the domain gap. All conditioning tokens come from synthetic renders of FBM motions, yet the paper tests on real BEHAVE sequences. Nothing in the abstract shows an explicit transfer test with real reference images on BEHAVE, so it is unclear whether the reported gains survive rendering artifacts or missing real-world variability. The lack of any quantitative numbers, baselines or ablations in the abstract makes it hard to judge effect size.

This is for researchers working on controllable motion synthesis and image-conditioned generation. A reader who needs a practical way to add visual constraints to HOI models would find the token split and conditioning streams worth looking at.

It deserves peer review because the problem is well-posed and the architecture is a direct attempt to solve it, even if the evidence for cross-domain robustness needs checking.

Referee Report

2 major / 0 minor

Summary. The paper introduces IMAGIN-4D, a diffusion-based model for generating human-object interactions (HOI) from text, waypoints, and a reference image. It decomposes image conditioning into supervised interaction-state tokens (body/object pose, contact, spatial relations) extracted at a depicted frame for spatial control and frame-aware tokens obtained by per-frame patch querying for temporal control. Role-aware conditioning separates AdaLN streams for text/waypoints/interaction-state tokens while using cross-attention for frame-aware tokens. Because HOI datasets lack paired images, the authors construct a synthetic motion-to-image rendering pipeline on FullBodyManipulation (FBM) and introduce an image-adherence metric. Experiments on FBM and BEHAVE are reported to show improved fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint following and motion quality; code and models will be released.

Significance. If the central claims hold, the spatio-temporal decomposition of image conditioning offers a concrete mechanism for resolving underspecification in HOI generation beyond text and sparse waypoints, with potential utility in animation, robotics, and embodied AI. The explicit release of code and models strengthens reproducibility and enables follow-up work.

major comments (2)

[Abstract] Abstract (dataset construction paragraph): the central generalization claim—that interaction-state tokens and frame-aware patches extracted from synthetic FBM renderings supply transferable conditioning signals on the real BEHAVE dataset—rests on an untested cross-domain assumption. No experiment is described that applies real reference images to BEHAVE motions or quantifies the domain gap between synthetic and captured interaction cues.
[Abstract] Abstract (experiments paragraph): the claim that IMAGIN-4D “improves fine-grained interaction control” on both FBM and BEHAVE is stated without any reported quantitative values, baseline definitions, error bars, or ablation tables. This absence makes it impossible to evaluate whether the reported gains are load-bearing or merely consistent with the synthetic-domain training distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the manuscript to improve clarity on the cross-domain aspects and to include key quantitative details in the abstract. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (dataset construction paragraph): the central generalization claim—that interaction-state tokens and frame-aware patches extracted from synthetic FBM renderings supply transferable conditioning signals on the real BEHAVE dataset—rests on an untested cross-domain assumption. No experiment is described that applies real reference images to BEHAVE motions or quantifies the domain gap between synthetic and captured interaction cues.

Authors: We agree the manuscript does not describe experiments that apply real captured reference images to BEHAVE motions, since BEHAVE provides no such paired images. The BEHAVE results instead apply the model (trained exclusively on synthetic FBM renderings) to real motion sequences while using synthetically rendered reference images generated via the same pipeline. This tests transfer of the learned conditioning mechanism but does not quantify the synthetic-to-real image domain gap. We will revise the abstract and add an explicit limitations paragraph discussing this point. revision: yes
Referee: [Abstract] Abstract (experiments paragraph): the claim that IMAGIN-4D “improves fine-grained interaction control” on both FBM and BEHAVE is stated without any reported quantitative values, baseline definitions, error bars, or ablation tables. This absence makes it impossible to evaluate whether the reported gains are load-bearing or merely consistent with the synthetic-domain training distribution.

Authors: The abstract is a high-level summary; the full quantitative results (image-adherence, contact accuracy, waypoint error, FID, with baselines, standard deviations, and ablations) appear in Section 4 and Tables 1–3 of the main paper. To make the abstract self-contained, we will insert concise numerical improvements (e.g., “+12.4% image adherence on FBM, +8.7% on BEHAVE”) while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: novel conditioning tokens evaluated on external datasets

full rationale

The paper introduces interaction-state tokens (body/object pose, contact, spatial relations) and frame-aware patch tokens extracted from a newly constructed synthetic motion-to-image pipeline on FBM, then demonstrates gains over single-token and uniform-image baselines via experiments on both FBM and the independent BEHAVE dataset. No equations, predictions, or central claims reduce by construction to fitted parameters from the same data or to self-citations; the image-adherence metric is an external evaluation tool, and the spatio-temporal decomposition is a methodological choice whose effectiveness is tested rather than presupposed. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the validity of the new token extraction process and the assumption that synthetic images can stand in for real paired data; these are introduced without independent verification in the provided abstract.

axioms (2)

domain assumption Diffusion models can be effectively conditioned on decomposed visual tokens extracted from a single reference image for spatio-temporal control.
Invoked in the description of spatial and temporal conditioning mechanisms.
ad hoc to paper A synthetic motion-to-image rendering pipeline can produce reference images whose interaction cues transfer to real motion generation.
Stated as the solution to the lack of paired image-motion datasets.

invented entities (2)

interaction-state tokens no independent evidence
purpose: Extract supervised cues for body pose, object pose, contact, and spatial relationships from the reference image at the depicted frame.
New token type for spatial conditioning; no independent evidence provided beyond the method description.
frame-aware tokens no independent evidence
purpose: Query image patches per generated frame to allow sequence segments to attend to different visual cues from the same image.
New token type for temporal conditioning; no independent evidence provided beyond the method description.

pith-pipeline@v0.9.1-grok · 5900 in / 1665 out tokens · 29465 ms · 2026-06-26T09:07:16.416144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Flamingo: a Visual Language Model for Few-Shot Learning , author=
[2]

and Sminchisescu, Cristian and Theobalt, Christian and Pons-Moll, Gerard , booktitle=CVPR, publisher=

Bhatnagar, Bharat Lal and Xie, Xianghui and Petrov, Ilya A. and Sminchisescu, Cristian and Theobalt, Christian and Pons-Moll, Gerard , booktitle=CVPR, publisher=
[3]

Cha, Junuk and Kim, Jihyeon and Yoon, Jae Shin and Baek, Seungryul , booktitle=CVPR, publisher=
[4]

Cai, Songjin and Zhong, Linjie and Guo, Ling and Ding, Changxing , booktitle=CVPR, publisher=
[5]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven C. H. , booktitle=ICML, publisher=
[6]

Object Motion Guided Human Motion Synthesis , author=
[7]

Karen , booktitle=ECCV, publisher=

Li, Jiaman and Clegg, Alexander and Mottaghi, Roozbeh and Wu, Jiajun and Puig, Xavier and Liu, C. Karen , booktitle=ECCV, publisher=
[8]

Transactions on Machine Learning Research , publisher=

Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , publisher=
[9]

Scalable Diffusion Models with Transformers , author=
[10]

Xu, Sirui and Li, Dongting and Zhang, Yucheng and Xu, Xiyan and Long, Qi and Wang, Ziyin and Lu, Yunzhi and Dong, Shuchang and Jiang, Hezi and Gupta, Akshat and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=CVPR, publisher=
[11]

Yao, Wei and Sun, Yunlian and Zhang, Hongwen and Liu, Yebin and Tang, Jinhui , booktitle=AAAI, publisher=
[12]

Zhang, Lvmin and Rao, Anyi and Agrawala, Maneesh , booktitle=ICCV, publisher=
[13]

Diller, Christian and Dai, Angela , booktitle=CVPR, publisher=
[14]

Xu, Sirui and Li, Zhengyuan and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=ICCV, publisher=
[15]

Xie, Yiming and Jampani, Varun and Zhong, Lei and Sun, Deqing and Jiang, Huaizu , booktitle=ICLR, publisher=
[16]

Flexible Motion In-betweening with Diffusion Models , author=
[17]

Ghosh, Anindita and Dabral, Rishabh and Golyanik, Vladislav and Theobalt, Christian and Slusallek, Philipp , journal=CGF, publisher=
[18]

and Tzionas, Dimitrios , booktitle=GCPR, publisher=

Huang, Yinghao and Taheri, Omid and Black, Michael J. and Tzionas, Dimitrios , booktitle=GCPR, publisher=
[19]

Full-Body Articulated Human-Object Interaction , author=
[20]

Human Motion Diffusion Model , author=
[21]

Guided Motion Diffusion for Controllable Human Motion Synthesis , author=
[22]

Peng, Xiaogang and Xie, Yiming and Wu, Zizhao and Jampani, Varun and Sun, Deqing and Jiang, Huaizu , booktitle=CVPRW, publisher=
[23]

Efficient Learning on Point Clouds with Basis Point Sets , author=
[24]

Learning Transferable Visual Models from Natural Language Supervision , author=
[25]

Hierarchical Text-Conditional Image Generation with

Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal=. Hierarchical Text-Conditional Image Generation with
[26]

Denoising Diffusion Probabilistic Models , author=
[27]

NeurIPS Workshop on Deep Generative Models and Downstream Applications , publisher=

Classifier-Free Diffusion Guidance , author=. NeurIPS Workshop on Deep Generative Models and Downstream Applications , publisher=
[28]

Song, Wenfeng and Zhang, Xinyu and Li, Shuai and Gao, Yang and Hao, Aimin and Hou, Xia and Chen, Chenglizhao and Li, Ning and Qin, Hong , booktitle=CVPR, publisher=
[29]

Pavlakos, Georgios and Choutas, Vasileios and Ghorbani, Nima and Bolkart, Timo and Osman, Ahmed A. A. and Tzionas, Dimitrios and Black, Michael J. , booktitle=CVPR, publisher=
[30]

ACM Transactions on Graphics, (Proc

Embodied Hands: Modeling and Capturing Hands and Bodies Together , author =. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , volume =. 2017 , month_numeric =

2017
[31]

and Patel, Priyanka and Tesch, Joachim and Yang, Jinlong , booktitle=CVPR, publisher=

Black, Michael J. and Patel, Priyanka and Tesch, Joachim and Yang, Jinlong , booktitle=CVPR, publisher=
[32]

Straub, Julian and Whelan, Thomas and Ma, Lingni and Chen, Yufan and Wijmans, Erik and Green, Simon and Engel, Jakob J. and Mur-Artal, Raul and Ren, Carl and Verma, Shobhit and Clarkson, Anton and Yan, Mingfei and Budge, Brian and Yan, Yajie and Pan, Xiaqing and Yon, June and Zou, Yuyang and Leon, Kimberly and Carter, Nigel and Briales, Jesus and Gillingh...
[33]

and Tzionas, Dimitrios , booktitle=ECCV, publisher=

Taheri, Omid and Ghorbani, Nima and Black, Michael J. and Tzionas, Dimitrios , booktitle=ECCV, publisher=
[34]

and Tzionas, Dimitrios , booktitle=CVPR, publisher=

Taheri, Omid and Choutas, Vasileios and Black, Michael J. and Tzionas, Dimitrios , booktitle=CVPR, publisher=
[35]

Yang, Jie and Niu, Xuesong and Jiang, Nan and Zhang, Ruimao and Huang, Siyuan , booktitle=ECCV, publisher=
[36]

Zhang, Xiaohan and Bhatnagar, Bharat Lal and Starke, Sebastian and Guzov, Vladimir and Pons-Moll, Gerard , booktitle=ECCV, publisher=
[37]

and Hilliges, Otmar , booktitle=CVPR, publisher=

Fan, Zicong and Taheri, Omid and Tzionas, Dimitrios and Kocabas, Muhammed and Kaufmann, Manuel and Black, Michael J. and Hilliges, Otmar , booktitle=CVPR, publisher=
[38]

Liu, Yunze and Liu, Yun and Jiang, Che and Lyu, Kangbo and Wan, Weikang and Shen, Hao and Liang, Boqiang and Fu, Zhoujie and Wang, He and Yi, Li , booktitle=CVPR, publisher=
[39]

Zhan, Xinyu and Yang, Lixin and Zhao, Yifei and Mao, Kangrui and Xu, Hanlin and Lin, Zenan and Li, Kailin and Lu, Cewu , booktitle=CVPR, publisher=
[40]

Liu, Yun and Yang, Haolin and Si, Xu and Liu, Lei and Li, Zipeng and Zhang, Yuxiang and Liu, Yebin and Yi, Li , booktitle=CVPR, publisher=
[41]

Lv, Xintao and Xu, Liang and Yan, Yichao and Jin, Xin and Xu, Congsheng and Wu, Shuwen and Liu, Yifan and Li, Lincheng and Bi, Mengxiao and Zeng, Wenjun and Yang, Xiaokang , booktitle=ECCV, publisher=
[42]

Kim, Jeonghwan and Kim, Jisoo and Na, Jeonghyeon and Joo, Hanbyul , booktitle=CVPR, publisher=
[43]

Lu, Jiaxin and Huang, Chun-Hao Paul and Bhattacharya, Uttaran and Huang, Qixing and Zhou, Yi , booktitle=ICCV, publisher=
[44]

Zhang, Juze and Zhang, Jingyan and Song, Zining and Shi, Zhanhe and Zhao, Chengfeng and Shi, Ye and Yu, Jingyi and Xu, Lan and Wang, Jingya , booktitle=CVPR, publisher=
[45]

Zhao, Chengfeng and Zhang, Juze and Du, Jiashen and Shan, Ziwei and Wang, Junye and Yu, Jingyi and Wang, Jingya and Xu, Lan , booktitle=CVPR, publisher=
[46]

, booktitle=ICCV, publisher=

Hassan, Mohamed and Choutas, Vasileios and Tzionas, Dimitrios and Black, Michael J. , booktitle=ICCV, publisher=. Resolving
[47]

Wang, Zan and Chen, Yixin and Liu, Tengyu and Zhu, Yixin and Liang, Wei and Huang, Siyuan , booktitle=NEURIPS, publisher=
[48]

Scaling Up Dynamic Human-Scene Interaction Modeling , author=
[49]

Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei , journal=PAMI, publisher=
[50]

Executing your Commands via Motion Diffusion in Latent Space , author=
[51]

Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi , booktitle=CVPR, publisher=
[52]

Jiang, Biao and Chen, Xin and Liu, Wen and Yu, Jingyi and Yu, Gang and Chen, Tao , booktitle=NEURIPS, publisher=
[53]

Guo, Chuan and Mu, Yuxuan and Javed, Muhammad Gohar and Wang, Sen and Cheng, Li , booktitle=CVPR, publisher=
[54]

Zhang, Mingyuan and Guo, Xinying and Pan, Liang and Cai, Zhongang and Hong, Fangzhou and Li, Huirong and Yang, Lei and Liu, Ziwei , booktitle=ICCV, publisher=
[55]

Zhang, Mingyuan and Li, Huirong and Cai, Zhongang and Ren, Jiawei and Yang, Lei and Liu, Ziwei , booktitle=NEURIPS, publisher=
[56]

Human Motion Diffusion as a Generative Prior , author=
[57]

Wan, Weilin and Dou, Zhiyang and Komura, Taku and Wang, Wenping and Jayaraman, Dinesh and Liu, Lingjie , booktitle=ECCV, publisher=
[58]

Pinyoanuntapong, Ekkasit and Saleem, Muhammad Usama and Karunratanakul, Korrawe and Wang, Pu and Xue, Hongfei and Chen, Chen and Guo, Chuan and Cao, Junli and Ren, Jian and Tulyakov, Sergey , booktitle=ICCV, publisher=
[59]

Optimizing Diffusion Noise Can Serve As Universal Motion Priors , author=
[60]

Tessler, Chen and Guo, Yunrong and Nabati, Ofir and Chechik, Gal and Peng, Xue Bin , journal=TOG, publisher=
[61]

Hand-Object Contact Consistency Reasoning for Human Grasps Generation , author=
[62]

Liu, Shaowei and Zhou, Yang and Yang, Jimei and Gupta, Saurabh and Wang, Shenlong , booktitle=ICCV, publisher=
[63]

Tendulkar, Purva and Sur
[64]

, booktitle=I3DV, publisher=

Taheri, Omid and Zhou, Yi and Tzionas, Dimitrios and Zhou, Yang and Ceylan, Duygu and Pirk, Soeren and Black, Michael J. , booktitle=I3DV, publisher=
[65]

Zhou, Keyang and Bhatnagar, Bharat Lal and Lenssen, Jan Eric and Pons-Moll, Gerard , booktitle=ECCV, publisher=
[66]

Zhang, Hui and Christen, Sammy and Fan, Zicong and Hilliges, Otmar and Song, Jie , booktitle=ECCV, publisher=
[67]

Christen, Sammy and Hampali, Shreyas and Sener, Fadime and Remelli, Edoardo and Hodan, Tomas and Sauser, Eric and Ma, Shugao and Tekin, Bugra , booktitle=
[68]

Wu, Qianyang and Shi, Ye and Huang, Xiaoshui and Yu, Jingyi and Xu, Lan and Wang, Jingya , journal=
[69]

, journal=

Ron, Roey and Tevet, Guy and Sawdayee, Haim and Bermano, Amit H. , journal=
[70]

Diffusion-Based Generation, Optimization, and Planning in

Huang, Siyuan and Wang, Zan and Li, Puhao and Jia, Baoxiong and Liu, Tengyu and Zhu, Yixin and Liang, Wei and Zhu, Song-Chun , booktitle=CVPR, publisher=. Diffusion-Based Generation, Optimization, and Planning in
[71]

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance , author=
[72]

SIGGRAPH Asia 2024 Conference Papers , publisher=

Autonomous Character-Scene Interaction Synthesis from Text Instruction , author=. SIGGRAPH Asia 2024 Conference Papers , publisher=

2024
[73]

Generating Human Interaction Motions in Scenes with Text Control , author=
[74]

Kulkarni, Nilesh and Rempe, Davis and Genova, Kyle and Kundu, Abhijit and Johnson, Justin and Fouhey, David and Guibas, Leonidas , booktitle=CVPR, publisher=
[75]

Li, Lei and Dai, Angela , booktitle=CVPR, publisher=
[76]

Xu, Sirui and Wang, Ziyin and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=NEURIPS, publisher=
[77]

Li, Hongjie and Yu, Hong-Xing and Li, Jiaman and Wu, Jiajun , journal=
[78]

Zhang, Jinlu and Chen, Yixin and Wang, Zan and Yang, Jie and Wang, Yizhou and Huang, Siyuan , booktitle=CVPR, publisher=
[79]

Yuan, Ye and Song, Jiaming and Iqbal, Umar and Vahdat, Arash and Kautz, Jan , booktitle=ICCV, publisher=
[80]

Wang, Yinhuai and Lin, Jing and Zeng, Ailing and Luo, Zhengyi and Zhang, Jian and Zhang, Lei , journal=

Showing first 80 references.

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning , author=

[2] [2]

and Sminchisescu, Cristian and Theobalt, Christian and Pons-Moll, Gerard , booktitle=CVPR, publisher=

Bhatnagar, Bharat Lal and Xie, Xianghui and Petrov, Ilya A. and Sminchisescu, Cristian and Theobalt, Christian and Pons-Moll, Gerard , booktitle=CVPR, publisher=

[3] [3]

Cha, Junuk and Kim, Jihyeon and Yoon, Jae Shin and Baek, Seungryul , booktitle=CVPR, publisher=

[4] [4]

Cai, Songjin and Zhong, Linjie and Guo, Ling and Ding, Changxing , booktitle=CVPR, publisher=

[5] [5]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven C. H. , booktitle=ICML, publisher=

[6] [6]

Object Motion Guided Human Motion Synthesis , author=

[7] [7]

Karen , booktitle=ECCV, publisher=

Li, Jiaman and Clegg, Alexander and Mottaghi, Roozbeh and Wu, Jiajun and Puig, Xavier and Liu, C. Karen , booktitle=ECCV, publisher=

[8] [8]

Transactions on Machine Learning Research , publisher=

Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , publisher=

[9] [9]

Scalable Diffusion Models with Transformers , author=

[10] [10]

Xu, Sirui and Li, Dongting and Zhang, Yucheng and Xu, Xiyan and Long, Qi and Wang, Ziyin and Lu, Yunzhi and Dong, Shuchang and Jiang, Hezi and Gupta, Akshat and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=CVPR, publisher=

[11] [11]

Yao, Wei and Sun, Yunlian and Zhang, Hongwen and Liu, Yebin and Tang, Jinhui , booktitle=AAAI, publisher=

[12] [12]

Zhang, Lvmin and Rao, Anyi and Agrawala, Maneesh , booktitle=ICCV, publisher=

[13] [13]

Diller, Christian and Dai, Angela , booktitle=CVPR, publisher=

[14] [14]

Xu, Sirui and Li, Zhengyuan and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=ICCV, publisher=

[15] [15]

Xie, Yiming and Jampani, Varun and Zhong, Lei and Sun, Deqing and Jiang, Huaizu , booktitle=ICLR, publisher=

[16] [16]

Flexible Motion In-betweening with Diffusion Models , author=

[17] [17]

Ghosh, Anindita and Dabral, Rishabh and Golyanik, Vladislav and Theobalt, Christian and Slusallek, Philipp , journal=CGF, publisher=

[18] [18]

and Tzionas, Dimitrios , booktitle=GCPR, publisher=

Huang, Yinghao and Taheri, Omid and Black, Michael J. and Tzionas, Dimitrios , booktitle=GCPR, publisher=

[19] [19]

Full-Body Articulated Human-Object Interaction , author=

[20] [20]

Human Motion Diffusion Model , author=

[21] [21]

Guided Motion Diffusion for Controllable Human Motion Synthesis , author=

[22] [22]

Peng, Xiaogang and Xie, Yiming and Wu, Zizhao and Jampani, Varun and Sun, Deqing and Jiang, Huaizu , booktitle=CVPRW, publisher=

[23] [23]

Efficient Learning on Point Clouds with Basis Point Sets , author=

[24] [24]

Learning Transferable Visual Models from Natural Language Supervision , author=

[25] [25]

Hierarchical Text-Conditional Image Generation with

Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal=. Hierarchical Text-Conditional Image Generation with

[26] [26]

Denoising Diffusion Probabilistic Models , author=

[27] [27]

NeurIPS Workshop on Deep Generative Models and Downstream Applications , publisher=

Classifier-Free Diffusion Guidance , author=. NeurIPS Workshop on Deep Generative Models and Downstream Applications , publisher=

[28] [28]

Song, Wenfeng and Zhang, Xinyu and Li, Shuai and Gao, Yang and Hao, Aimin and Hou, Xia and Chen, Chenglizhao and Li, Ning and Qin, Hong , booktitle=CVPR, publisher=

[29] [29]

Pavlakos, Georgios and Choutas, Vasileios and Ghorbani, Nima and Bolkart, Timo and Osman, Ahmed A. A. and Tzionas, Dimitrios and Black, Michael J. , booktitle=CVPR, publisher=

[30] [30]

ACM Transactions on Graphics, (Proc

Embodied Hands: Modeling and Capturing Hands and Bodies Together , author =. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , volume =. 2017 , month_numeric =

2017

[31] [31]

and Patel, Priyanka and Tesch, Joachim and Yang, Jinlong , booktitle=CVPR, publisher=

Black, Michael J. and Patel, Priyanka and Tesch, Joachim and Yang, Jinlong , booktitle=CVPR, publisher=

[32] [32]

Straub, Julian and Whelan, Thomas and Ma, Lingni and Chen, Yufan and Wijmans, Erik and Green, Simon and Engel, Jakob J. and Mur-Artal, Raul and Ren, Carl and Verma, Shobhit and Clarkson, Anton and Yan, Mingfei and Budge, Brian and Yan, Yajie and Pan, Xiaqing and Yon, June and Zou, Yuyang and Leon, Kimberly and Carter, Nigel and Briales, Jesus and Gillingh...

[33] [33]

and Tzionas, Dimitrios , booktitle=ECCV, publisher=

Taheri, Omid and Ghorbani, Nima and Black, Michael J. and Tzionas, Dimitrios , booktitle=ECCV, publisher=

[34] [34]

and Tzionas, Dimitrios , booktitle=CVPR, publisher=

Taheri, Omid and Choutas, Vasileios and Black, Michael J. and Tzionas, Dimitrios , booktitle=CVPR, publisher=

[35] [35]

Yang, Jie and Niu, Xuesong and Jiang, Nan and Zhang, Ruimao and Huang, Siyuan , booktitle=ECCV, publisher=

[36] [36]

Zhang, Xiaohan and Bhatnagar, Bharat Lal and Starke, Sebastian and Guzov, Vladimir and Pons-Moll, Gerard , booktitle=ECCV, publisher=

[37] [37]

and Hilliges, Otmar , booktitle=CVPR, publisher=

Fan, Zicong and Taheri, Omid and Tzionas, Dimitrios and Kocabas, Muhammed and Kaufmann, Manuel and Black, Michael J. and Hilliges, Otmar , booktitle=CVPR, publisher=

[38] [38]

Liu, Yunze and Liu, Yun and Jiang, Che and Lyu, Kangbo and Wan, Weikang and Shen, Hao and Liang, Boqiang and Fu, Zhoujie and Wang, He and Yi, Li , booktitle=CVPR, publisher=

[39] [39]

Zhan, Xinyu and Yang, Lixin and Zhao, Yifei and Mao, Kangrui and Xu, Hanlin and Lin, Zenan and Li, Kailin and Lu, Cewu , booktitle=CVPR, publisher=

[40] [40]

Liu, Yun and Yang, Haolin and Si, Xu and Liu, Lei and Li, Zipeng and Zhang, Yuxiang and Liu, Yebin and Yi, Li , booktitle=CVPR, publisher=

[41] [41]

Lv, Xintao and Xu, Liang and Yan, Yichao and Jin, Xin and Xu, Congsheng and Wu, Shuwen and Liu, Yifan and Li, Lincheng and Bi, Mengxiao and Zeng, Wenjun and Yang, Xiaokang , booktitle=ECCV, publisher=

[42] [42]

Kim, Jeonghwan and Kim, Jisoo and Na, Jeonghyeon and Joo, Hanbyul , booktitle=CVPR, publisher=

[43] [43]

Lu, Jiaxin and Huang, Chun-Hao Paul and Bhattacharya, Uttaran and Huang, Qixing and Zhou, Yi , booktitle=ICCV, publisher=

[44] [44]

Zhang, Juze and Zhang, Jingyan and Song, Zining and Shi, Zhanhe and Zhao, Chengfeng and Shi, Ye and Yu, Jingyi and Xu, Lan and Wang, Jingya , booktitle=CVPR, publisher=

[45] [45]

Zhao, Chengfeng and Zhang, Juze and Du, Jiashen and Shan, Ziwei and Wang, Junye and Yu, Jingyi and Wang, Jingya and Xu, Lan , booktitle=CVPR, publisher=

[46] [46]

, booktitle=ICCV, publisher=

Hassan, Mohamed and Choutas, Vasileios and Tzionas, Dimitrios and Black, Michael J. , booktitle=ICCV, publisher=. Resolving

[47] [47]

Wang, Zan and Chen, Yixin and Liu, Tengyu and Zhu, Yixin and Liang, Wei and Huang, Siyuan , booktitle=NEURIPS, publisher=

[48] [48]

Scaling Up Dynamic Human-Scene Interaction Modeling , author=

[49] [49]

Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei , journal=PAMI, publisher=

[50] [50]

Executing your Commands via Motion Diffusion in Latent Space , author=

[51] [51]

Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi , booktitle=CVPR, publisher=

[52] [52]

Jiang, Biao and Chen, Xin and Liu, Wen and Yu, Jingyi and Yu, Gang and Chen, Tao , booktitle=NEURIPS, publisher=

[53] [53]

Guo, Chuan and Mu, Yuxuan and Javed, Muhammad Gohar and Wang, Sen and Cheng, Li , booktitle=CVPR, publisher=

[54] [54]

Zhang, Mingyuan and Guo, Xinying and Pan, Liang and Cai, Zhongang and Hong, Fangzhou and Li, Huirong and Yang, Lei and Liu, Ziwei , booktitle=ICCV, publisher=

[55] [55]

Zhang, Mingyuan and Li, Huirong and Cai, Zhongang and Ren, Jiawei and Yang, Lei and Liu, Ziwei , booktitle=NEURIPS, publisher=

[56] [56]

Human Motion Diffusion as a Generative Prior , author=

[57] [57]

Wan, Weilin and Dou, Zhiyang and Komura, Taku and Wang, Wenping and Jayaraman, Dinesh and Liu, Lingjie , booktitle=ECCV, publisher=

[58] [58]

Pinyoanuntapong, Ekkasit and Saleem, Muhammad Usama and Karunratanakul, Korrawe and Wang, Pu and Xue, Hongfei and Chen, Chen and Guo, Chuan and Cao, Junli and Ren, Jian and Tulyakov, Sergey , booktitle=ICCV, publisher=

[59] [59]

Optimizing Diffusion Noise Can Serve As Universal Motion Priors , author=

[60] [60]

Tessler, Chen and Guo, Yunrong and Nabati, Ofir and Chechik, Gal and Peng, Xue Bin , journal=TOG, publisher=

[61] [61]

Hand-Object Contact Consistency Reasoning for Human Grasps Generation , author=

[62] [62]

Liu, Shaowei and Zhou, Yang and Yang, Jimei and Gupta, Saurabh and Wang, Shenlong , booktitle=ICCV, publisher=

[63] [63]

Tendulkar, Purva and Sur

[64] [64]

, booktitle=I3DV, publisher=

Taheri, Omid and Zhou, Yi and Tzionas, Dimitrios and Zhou, Yang and Ceylan, Duygu and Pirk, Soeren and Black, Michael J. , booktitle=I3DV, publisher=

[65] [65]

Zhou, Keyang and Bhatnagar, Bharat Lal and Lenssen, Jan Eric and Pons-Moll, Gerard , booktitle=ECCV, publisher=

[66] [66]

Zhang, Hui and Christen, Sammy and Fan, Zicong and Hilliges, Otmar and Song, Jie , booktitle=ECCV, publisher=

[67] [67]

Christen, Sammy and Hampali, Shreyas and Sener, Fadime and Remelli, Edoardo and Hodan, Tomas and Sauser, Eric and Ma, Shugao and Tekin, Bugra , booktitle=

[68] [68]

Wu, Qianyang and Shi, Ye and Huang, Xiaoshui and Yu, Jingyi and Xu, Lan and Wang, Jingya , journal=

[69] [69]

, journal=

Ron, Roey and Tevet, Guy and Sawdayee, Haim and Bermano, Amit H. , journal=

[70] [70]

Diffusion-Based Generation, Optimization, and Planning in

Huang, Siyuan and Wang, Zan and Li, Puhao and Jia, Baoxiong and Liu, Tengyu and Zhu, Yixin and Liang, Wei and Zhu, Song-Chun , booktitle=CVPR, publisher=. Diffusion-Based Generation, Optimization, and Planning in

[71] [71]

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance , author=

[72] [72]

SIGGRAPH Asia 2024 Conference Papers , publisher=

Autonomous Character-Scene Interaction Synthesis from Text Instruction , author=. SIGGRAPH Asia 2024 Conference Papers , publisher=

2024

[73] [73]

Generating Human Interaction Motions in Scenes with Text Control , author=

[74] [74]

Kulkarni, Nilesh and Rempe, Davis and Genova, Kyle and Kundu, Abhijit and Johnson, Justin and Fouhey, David and Guibas, Leonidas , booktitle=CVPR, publisher=

[75] [75]

Li, Lei and Dai, Angela , booktitle=CVPR, publisher=

[76] [76]

Xu, Sirui and Wang, Ziyin and Wang, Yu-Xiong and Gui, Liang-Yan , booktitle=NEURIPS, publisher=

[77] [77]

Li, Hongjie and Yu, Hong-Xing and Li, Jiaman and Wu, Jiajun , journal=

[78] [78]

Zhang, Jinlu and Chen, Yixin and Wang, Zan and Yang, Jie and Wang, Yizhou and Huang, Siyuan , booktitle=CVPR, publisher=

[79] [79]

Yuan, Ye and Song, Jiaming and Iqbal, Umar and Vahdat, Arash and Kautz, Jan , booktitle=ICCV, publisher=

[80] [80]

Wang, Yinhuai and Lin, Jing and Zeng, Ailing and Luo, Zhengyi and Zhang, Jian and Zhang, Lei , journal=