VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Christian Theobalt; Lin Geng Foo; Rishabh Dabral; Thabo Beeler; Wanyue Zhang

arxiv: 2512.09646 · v2 · submitted 2025-12-10 · 💻 cs.CV

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Wanyue Zhang , Lin Geng Foo , Thabo Beeler , Rishabh Dabral , Christian Theobalt This is my paper

Pith reviewed 2026-05-16 23:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interactioncontrollable video generationmotion densificationvideo diffusion modelsparse trajectoriesHOI masksbody-part dynamics

0 comments

The pith

VHOI converts sparse human trajectories into dense color-coded masks that condition a video diffusion model to generate controllable human-object interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage process for creating videos of humans interacting with objects from easy-to-provide sparse trajectory inputs. First, a novel motion representation densifies those trajectories into sequences of masks that use distinct colors to separate human motion, object motion, and even individual body-part dynamics. These masks then serve as conditioning signals for fine-tuning a video diffusion model. The approach seeks to deliver instance-specific realism without requiring users to supply costly dense controls like full 3D meshes or optical flow, while also supporting full-scene navigation that leads into the interactions.

Core claim

VHOI is a two-stage framework that first densifies sparse trajectories into HOI mask sequences via an HOI-aware motion representation using color encodings to distinguish human, object, and body-part-specific dynamics, then fine-tunes a video diffusion model conditioned on these masks to produce controllable, realistic human-object interaction videos, including full navigation sequences.

What carries the argument

The HOI-aware motion representation that applies color encodings to sparse trajectories to produce dense mask sequences distinguishing overall human motion, object motion, and body-part-specific dynamics for use as conditioning input.

If this is right

Users can control HOI videos with simple trajectory sketches rather than expensive dense signals.
Generation extends naturally to complete scenes that include human navigation before the interaction occurs.
Body-part color distinctions improve the model's grasp of fine-grained dynamics like hand or foot movements during contact.
The same pipeline supports both isolated interaction clips and longer navigation-to-interaction sequences without separate modules.
Performance reaches state-of-the-art levels on controllable HOI video benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The color-based densification might extend to multi-person or multi-object scenes if the encoding scheme is expanded.
Real-time applications could arise by pairing the method with live skeleton tracking from cameras or wearables.
Efficiency gains over mesh-based methods could be quantified by measuring user effort versus output quality on the same tasks.
Testing on out-of-distribution objects or environments would reveal how much the human prior in the masks helps generalization.

Load-bearing premise

The color-encoding scheme will reliably turn sparse trajectories into clean, instance-specific masks that capture realistic interaction dynamics without introducing artifacts when fed to the diffusion model.

What would settle it

Generated videos showing motion artifacts, incorrect body-part interactions, or loss of object identity when sparse trajectories are provided for complex actions such as grasping or throwing.

Figures

Figures reproduced from arXiv: 2512.09646 by Christian Theobalt, Lin Geng Foo, Rishabh Dabral, Thabo Beeler, Wanyue Zhang.

**Figure 1.** Figure 1: The input to VHOI consists of a text prompt, an input image, an HOI mask image, and trajectories. We first convert the trajectories [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Motion Representation. We visualize two frames of colored sparse trajectories alongside the three intermediate motion representations studied in this work. (a) The sparse trajectory representation, where different colors denote different human parts or objects. (b) HOI masks (ours): constructed by combining object masks [63] with part-level human segmentation [34], each assigned a consistent color to enc… view at source ↗

**Figure 3.** Figure 3: The trajectory augmentor A receives sparse trajectories and the corresponding visibility maps (optional) as inputs. The trajectories are processed by a trajectory extractor and fused with transformer latents and visibility cues in the augmentor fuser, producing a sequence of HOI masks that densifies the sparse control signals, used in the dense control model D as shown in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 4.** Figure 4: The dense control model D conditions on HOI masks. The masks are encoded by a HOI extractor and fused with transformer latents in the dense control fuser, which also includes a confidence prediction head to modulate reliance on the control signal. The final output is an HOI video that follows the densified motion cues. Orange modules denote learnable components; blue modules are frozen. (Best viewed with… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons of TORA-finetuned (TORA*), Go-with-the-Flow (Go-Flow), and our method alongside ground-truth [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative ablation of different motion representations. We compare augmentors trained on foreground optical flow, instance masks, and our proposed HOI masks. Flow-based conditioning lacks interaction semantics and fails to capture the grasp in this example. Instance-mask conditioning predicts the interaction but does not preserve object identity. Our HOI mask representation provides richer interaction … view at source ↗

**Figure 7.** Figure 7: The fuser is inserted between the AdaLN scale-and [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 10.** Figure 10: Prompt template used to process text prompts for aug [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 9.** Figure 9: HOI Masks Color Palette. We visualize the color encoding of the 29 classes, where each color corresponds to a distinct part. The color scheme for human parts follows SAPIEN [34], and light gray is used for the object. the augmentor, enabling richer motion guidance for the dense model. Task: Rewrite the caption as a concise description of only the human and foreground object motion. Do not describe backgro… view at source ↗

read the original abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VHOI's color-encoded densification step is a reasonable engineering fix for turning sparse trajectories into usable conditioning for HOI video diffusion, but the abstract's SOTA claims rest on unshown experiments.

read the letter

The main takeaway is that this paper gives a two-stage method: first convert sparse trajectories into dense HOI mask sequences via a color-encoded representation that tags human body parts separately from objects, then fine-tune a video diffusion model on those masks. This lets them handle both tight interactions and longer navigation sequences that lead into them, which addresses a real pain point in controllable video synthesis where sparse inputs are easy but weak and dense ones are accurate but expensive to create. The color encoding is the concrete new piece—it injects a human prior into the signal so the model can better separate instance dynamics without needing full 3D meshes or flow fields upfront. If the masks stay clean, that is a useful incremental improvement over plain keypoint conditioning. The approach builds directly on existing diffusion backbones, so the novelty sits in the representation and the densification pipeline rather than a new model architecture. What the paper does well is keep the user input cheap while aiming for instance-specific realism, and the end-to-end navigation claim is a practical bonus if the results back it up. The soft spot is the evidence base. The abstract asserts state-of-the-art performance with no numbers, baselines, ablations, or qualitative failure cases shown, so it is impossible to check whether the color masks actually avoid crosstalk, quantization artifacts, or loss of fine motion in overlapping regions. The stress-test concern about channel issues in the encoding is worth verifying in the full experiments; if those sections are thin or rely on cherry-picked visuals, the central controllability claim weakens. No load-bearing circularity or invented entities appear in the description. This is for people working on practical controllable video generation in vision and graphics who already use diffusion models and want better sparse-to-dense tricks. A reader focused on HOI or media production tools would get value from the representation idea. It deserves a serious referee because the method is grounded and the problem is well-motivated, even if the results need closer inspection to confirm the masks deliver reliable control.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VHOI, a two-stage framework for controllable video generation of human-object interactions from sparse trajectories. The first stage converts sparse trajectories into dense HOI mask sequences via a novel color-encoded motion representation that distinguishes human/object motion and body-part-specific dynamics, incorporating a human prior. The second stage fine-tunes a video diffusion model conditioned on these masks. The authors claim state-of-the-art results in controllable HOI video generation and demonstrate end-to-end generation of full human navigation leading to object interactions.

Significance. If the densification stage produces artifact-free, instance-specific masks that faithfully capture realistic HOI dynamics, the work would meaningfully address the sparse-vs-dense control trade-off in video synthesis by enabling easy-to-specify inputs to yield informative conditioning signals. The incorporation of body-part priors and the extension to navigation scenarios are positive aspects. The approach builds on existing diffusion models without introducing free parameters or circular derivations.

major comments (2)

[§3.2] §3.2 (HOI-aware motion representation): The central claim depends on the color encoding reliably producing dense masks without artifacts or loss of fine dynamics in overlapping regions. The manuscript does not provide quantitative validation (e.g., mask IoU or optical-flow consistency metrics) or ablation against non-color encodings to confirm this holds for instance-specific interactions.
[§5] §5 (Experiments): The SOTA performance assertion requires explicit comparison tables against recent baselines (e.g., trajectory-conditioned diffusion methods) with standard metrics such as FID, FVD, and controllability scores; without these, the claim that VHOI outperforms prior work on both interaction-only and navigation scenarios cannot be evaluated.

minor comments (2)

[Abstract] The abstract and §1 could more clearly state the exact input format of the sparse trajectories (e.g., 2D keypoints per frame) to help readers assess practicality.
[Figure 2] Figure 2 (pipeline overview) would benefit from explicit channel legends for the color encodings to illustrate how body-part distinctions are encoded.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We agree with the identified gaps in quantitative validation and experimental comparisons, and we will revise the paper accordingly to strengthen these aspects while preserving the core contributions of the VHOI framework.

read point-by-point responses

Referee: [§3.2] §3.2 (HOI-aware motion representation): The central claim depends on the color encoding reliably producing dense masks without artifacts or loss of fine dynamics in overlapping regions. The manuscript does not provide quantitative validation (e.g., mask IoU or optical-flow consistency metrics) or ablation against non-color encodings to confirm this holds for instance-specific interactions.

Authors: We agree that quantitative validation is needed to rigorously support the reliability of the color-encoded representation. In the revised manuscript, we will add a dedicated evaluation subsection reporting mask IoU and optical-flow consistency metrics computed on held-out test sequences. We will also include an ablation study comparing our color encoding against non-color alternatives (e.g., grayscale or channel-separated masks) to demonstrate its advantages for instance-specific HOI dynamics. revision: yes
Referee: [§5] §5 (Experiments): The SOTA performance assertion requires explicit comparison tables against recent baselines (e.g., trajectory-conditioned diffusion methods) with standard metrics such as FID, FVD, and controllability scores; without these, the claim that VHOI outperforms prior work on both interaction-only and navigation scenarios cannot be evaluated.

Authors: We acknowledge that the current experimental section would be strengthened by more comprehensive quantitative tables. In the revision, we will expand Section 5 with explicit comparison tables reporting FID, FVD, and controllability scores against recent trajectory-conditioned diffusion baselines. These tables will cover both interaction-only and navigation scenarios, with notes on any baseline limitations for the latter. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a practical two-stage engineering pipeline (sparse trajectory densification via color-encoded HOI motion representation followed by conditioning a pre-existing video diffusion model) without any equations, derivations, or parameter-fitting steps that reduce to the inputs by construction. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises, and the method is presented as building directly on standard diffusion models with an added conditioning signal. The approach is self-contained and externally falsifiable via the reported experiments on controllability and navigation, yielding no circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that video diffusion models can be conditioned effectively on mask sequences and that the proposed color encoding injects useful human priors; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Video diffusion models can be fine-tuned on dense mask sequences to achieve controllable generation of human-object interactions.
Central to the second stage of the proposed framework.

pith-pipeline@v0.9.0 · 5532 in / 1159 out tokens · 33049 ms · 2026-05-16T23:27:09.087889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VHOI consists of (1) a trajectory augmentor A that converts sparse trajectories ξ into dense HOI mask sequences M_hoi

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
cs.CV 2026-03 unverdicted novelty 6.0

A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Interdyn: Con- trollable interactive dynamics with video diffusion models

Rick Akkerman, Haiwen Feng, Michael J Black, Dimitrios Tzionas, and Victoria Fern´andez Abrevaya. Interdyn: Con- trollable interactive dynamics with video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3

work page 2025
[2]

arXiv preprint arXiv:2503.14492 (2025)

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 2

work page arXiv 2025
[3]

Follow my hold: Hand-object inter- action reconstruction through geometric guidance.arXiv preprint arXiv:2508.18213, 2025

Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, and Christian Theobalt. Follow my hold: Hand-object inter- action reconstruction through geometric guidance.arXiv preprint arXiv:2508.18213, 2025. 2, 3

work page arXiv 2025
[4]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

A database and evaluation methodology for optical flow.Int

Simon Baker, Daniel Scharstein, James P Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. A database and evaluation methodology for optical flow.Int. J. Com- put. Vis., 2011. 3

work page 2011
[6]

Behave: Dataset and method for tracking human object in- teractions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6

work page 2022
[7]

Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 6, 7, 8

work page 2025
[8]

Goku: Flow based video generative foundation models

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896, 2025. 2, 3

work page arXiv 2025
[9]

Black, and Dimitrios Tzionas

Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dimitrios Tzionas. Detecting human-object contact in im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023

work page 2023
[10]

Arctic: A dataset for dexterous bimanual hand-object manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Ot- mar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023
[11]

Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J Black, and Otmar Hilliges. Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[12]

3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation

Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation. InInt. Conf. Learn. Rep- resent., 2025. 2

work page 2025
[13]

Motion prompting: Controlling video generation with motion tra- jectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion tra- jectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[14]

Detecting and recognizing human-object interac- tions

Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaim- ing He. Detecting and recognizing human-object interac- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 3

work page 2018
[15]

Stochas- tic scene-aware motion prediction

Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- tic scene-aware motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3

work page 2021
[16]

Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis.arXiv preprint arXiv:2412.20104, 2024

Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis.arXiv preprint arXiv:2412.20104, 2024

work page arXiv 2024
[17]

Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis

Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Training-free camera control for video generation

Chen Hou and Zhibo Chen. Training-free camera control for video generation. InInt. Conf. Learn. Represent., 2025. 2

work page 2025
[20]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent., 2022. 2

work page 2022
[21]

Hand-object interaction image generation.Adv

Hezhen Hu, Weilun Wang, Wengang Zhou, and Houqiang Li. Hand-object interaction image generation.Adv. Neural Inform. Process. Syst., 2022. 3

work page 2022
[22]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[23]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 2

work page arXiv 2025
[24]

Personahoi: Effortlessly improving face person- alization in human-object interaction generation

Xinting Hu, Haoran Wang, Jan Eric Lenssen, and Bernt Schiele. Personahoi: Effortlessly improving face person- alization in human-object interaction generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[25]

Layered controllable video generation

Jiahui Huang, Yuhe Jin, Kwang Moo Yi, and Leonid Si- gal. Layered controllable video generation. InProceedings of the European Conference on Computer Vision (ECCV),

work page
[26]

VBench: Comprehensive benchmark suite for video generative mod- els

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recog...

work page 2024
[27]

Hunyuanvideo- homa: Generic human-object interaction in multimodal driven human animation.arXiv preprint arXiv:2506.08797,

Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, Qinglin Lu, and Fan Tang. Hunyuanvideo- homa: Generic human-object interaction in multimodal driven human animation.arXiv preprint arXiv:2506.08797,

work page arXiv
[28]

Monocular human-object reconstruction in the wild

Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human-object reconstruction in the wild. InACM Int. Conf. Multimedia, 2024. 3

work page 2024
[29]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, 2015. 6

work page 2015
[30]

Interactive syn- thesis of human-object interaction

Sumit Jain and C Karen Liu. Interactive syn- thesis of human-object interaction. InACM SIG- GRAPH/Eurographics Symp. Computer Animation, 2009. 3

work page 2009
[31]

Full-body articulated human-object interac- tion

Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Zhiyuan Zhang, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. Full-body articulated human-object interac- tion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2023. 3

work page 2023
[32]

Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2025. 2

work page 2025
[33]

Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint, 2024

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint, 2024. 6, 8

work page 2024
[34]

arXiv preprint arXiv:2408.12569 , year=

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models.arXiv preprint arXiv:2408.12569, 2024. 2, 3, 4, 6, 7, 8

work page arXiv 2024
[35]

arXiv preprint arXiv:2503.18950 (2025)

Taeksoo Kim and Hanbyul Joo. Target-aware video diffu- sion models.arXiv preprint arXiv:2503.18950, 2025. 1, 2, 3, 6

work page arXiv 2025
[36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Nifty: Neural object interaction fields for guided human motion synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhi- jit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[38]

Efficient adaptive human-object inter- action detection with concept-guided memory

Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, and Yang Liu. Efficient adaptive human-object inter- action detection with concept-guided memory. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023

work page 2023
[39]

Mask2iv: Interaction-centric video generation via mask tra- jectories.arXiv preprint arXiv:2510.03135, 2025

Gen Li, Bo Zhao, Jianfei Yang, and Laura Sevilla-Lara. Mask2iv: Interaction-centric video generation via mask tra- jectories.arXiv preprint arXiv:2510.03135, 2025. 2

work page arXiv 2025
[40]

Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration

Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration.arXiv preprint arXiv:2412.18600, 2024. 3

work page arXiv 2024
[41]

Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023

work page 2023
[42]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Con- ference on Computer Vision (ECCV), 2024. 2, 3

work page 2024
[43]

GenZI: Zero-shot 3D human-scene interaction generation

Lei Li and Angela Dai. GenZI: Zero-shot 3D human-scene interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[44]

Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,

Yichen Li and Antonio Torralba. Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,

work page arXiv
[45]

GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, and Srinath Sridhar. Genhsi: Controllable gen- eration of human-scene interaction videos.arXiv preprint arXiv:2506.19840, 2025. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 3

work page 2022
[47]

Hoigen-1m: A large- scale dataset for human-object interaction video generation

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 6

work page 2025
[48]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3

work page 2022
[49]

Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking

Yun Liu, Bowen Yang, Licheng Zhong, He Wang, and Li Yi. Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730, 2024. 3

work page arXiv 2024
[50]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement

Yun Liu, Chengwen Zhang, Ruofan Xing, Bingda Tang, Bowen Yang, and Li Yi. Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[52]

Wan-Duo Kurt Ma, J. P. Lewis, and W. Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation, 2023. 2

work page 2023
[53]

Mimo: Controllable character video synthesis with spatial decomposed modeling.arXiv preprint arXiv:2409.16160,

Yifang Men, Yuan Yao, Miaomiao Cui, and Bo Liefeng. Mimo: Controllable character video synthesis with spatial decomposed modeling.arXiv preprint arXiv:2409.16160,

work page arXiv
[54]

T2i-adapter: Learn- ing adapters to dig out more controllable ability for text-to- image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learn- ing adapters to dig out more controllable ability for text-to- image diffusion models. InAssociation for the Advance- ment of Artificial Intelligence, 2024. 2

work page 2024
[55]

Detecting hands and recognizing physical contact in the wild.Adv

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild.Adv. Neural Inform. Process. Syst., 2020. 2, 6

work page 2020
[56]

Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models

Shan Ning, Longtian Qiu, Yongfei Liu, and Xuming He. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023
[57]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[58]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 2

work page arXiv 2024
[59]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAssociation for the Advance- ment of Artificial Intelligence, 2018. 4

work page 2018
[60]

Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024. 2

work page 2024
[61]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning,

work page
[62]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos, 2024. 7

work page 2024
[63]

Grounded sam: Assembling open-world models for diverse visual tasks,

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

work page
[64]

Isa4d: Interspatial attention for efficient 4d human video generation.ACM Transactions on Graphics (TOG), 2025

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, and Gordon Wetzstein. Isa4d: Interspatial attention for efficient 4d human video generation.ACM Transactions on Graphics (TOG), 2025. 2

work page 2025
[65]

Motion-i2v: Con- sistent and controllable image-to-video generation with ex- plicit motion modeling.SIGGRAPH Conf

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Con- sistent and controllable image-to-video generation with ex- plicit motion modeling.SIGGRAPH Conf. Pap., 2024. 2, 3, 1

work page 2024
[66]

Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019

Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019. 3

work page 2019
[67]

Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, and Nanxuan Zhao. Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025. 2

work page arXiv 2025
[68]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 4, 7, 8

work page 2020
[69]

Videoanydoor: High-fidelity video ob- ject insertion with precise motion control

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InSIGGRAPH Conf. Pap., 2025. 2

work page 2025
[70]

Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions

Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 3

work page 2020
[71]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new met- ric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

Diffusion models are real-time game engines,

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines,

work page
[73]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025

Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025. 2, 3

work page 2025
[75]

Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIG- GRAPH Conf. Pap., 2025. 2

work page 2025
[76]

Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips

Shibo Wang, Haonan He, Maria Parelli, Christoph Geb- hardt, Zicong Fan, and Jie Song. Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 2, 3

work page 2025
[77]

Videocomposer: Compositional video syn- thesis with motion controllability.Adv

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video syn- thesis with motion controllability.Adv. Neural Inform. Pro- cess. Syst., 2023. 2

work page 2023
[78]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH Conf. Pap., 2024. 6

work page 2024
[79]

End-to-end hoi reconstruction transformer with graph-based encoding

Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yib- ing Zhan, and Dongjiang Li. End-to-end hoi reconstruction transformer with graph-based encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[80]

Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv preprint arXiv:2410.13830, 2024. 2

work page arXiv 2024

Showing first 80 references.

[1] [1]

Interdyn: Con- trollable interactive dynamics with video diffusion models

Rick Akkerman, Haiwen Feng, Michael J Black, Dimitrios Tzionas, and Victoria Fern´andez Abrevaya. Interdyn: Con- trollable interactive dynamics with video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3

work page 2025

[2] [2]

arXiv preprint arXiv:2503.14492 (2025)

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 2

work page arXiv 2025

[3] [3]

Follow my hold: Hand-object inter- action reconstruction through geometric guidance.arXiv preprint arXiv:2508.18213, 2025

Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, and Christian Theobalt. Follow my hold: Hand-object inter- action reconstruction through geometric guidance.arXiv preprint arXiv:2508.18213, 2025. 2, 3

work page arXiv 2025

[4] [4]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

A database and evaluation methodology for optical flow.Int

Simon Baker, Daniel Scharstein, James P Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. A database and evaluation methodology for optical flow.Int. J. Com- put. Vis., 2011. 3

work page 2011

[6] [6]

Behave: Dataset and method for tracking human object in- teractions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6

work page 2022

[7] [7]

Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 6, 7, 8

work page 2025

[8] [8]

Goku: Flow based video generative foundation models

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896, 2025. 2, 3

work page arXiv 2025

[9] [9]

Black, and Dimitrios Tzionas

Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dimitrios Tzionas. Detecting human-object contact in im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023

work page 2023

[10] [10]

Arctic: A dataset for dexterous bimanual hand-object manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Ot- mar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023

[11] [11]

Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J Black, and Otmar Hilliges. Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[12] [12]

3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation

Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation. InInt. Conf. Learn. Rep- resent., 2025. 2

work page 2025

[13] [13]

Motion prompting: Controlling video generation with motion tra- jectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion tra- jectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025

[14] [14]

Detecting and recognizing human-object interac- tions

Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaim- ing He. Detecting and recognizing human-object interac- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 3

work page 2018

[15] [15]

Stochas- tic scene-aware motion prediction

Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- tic scene-aware motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3

work page 2021

[16] [16]

Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis.arXiv preprint arXiv:2412.20104, 2024

Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis.arXiv preprint arXiv:2412.20104, 2024

work page arXiv 2024

[17] [17]

Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis

Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025

[18] [18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Training-free camera control for video generation

Chen Hou and Zhibo Chen. Training-free camera control for video generation. InInt. Conf. Learn. Represent., 2025. 2

work page 2025

[20] [20]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent., 2022. 2

work page 2022

[21] [21]

Hand-object interaction image generation.Adv

Hezhen Hu, Weilun Wang, Wengang Zhou, and Houqiang Li. Hand-object interaction image generation.Adv. Neural Inform. Process. Syst., 2022. 3

work page 2022

[22] [22]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[23] [23]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 2

work page arXiv 2025

[24] [24]

Personahoi: Effortlessly improving face person- alization in human-object interaction generation

Xinting Hu, Haoran Wang, Jan Eric Lenssen, and Bernt Schiele. Personahoi: Effortlessly improving face person- alization in human-object interaction generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[25] [25]

Layered controllable video generation

Jiahui Huang, Yuhe Jin, Kwang Moo Yi, and Leonid Si- gal. Layered controllable video generation. InProceedings of the European Conference on Computer Vision (ECCV),

work page

[26] [26]

VBench: Comprehensive benchmark suite for video generative mod- els

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recog...

work page 2024

[27] [27]

Hunyuanvideo- homa: Generic human-object interaction in multimodal driven human animation.arXiv preprint arXiv:2506.08797,

Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, Qinglin Lu, and Fan Tang. Hunyuanvideo- homa: Generic human-object interaction in multimodal driven human animation.arXiv preprint arXiv:2506.08797,

work page arXiv

[28] [28]

Monocular human-object reconstruction in the wild

Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human-object reconstruction in the wild. InACM Int. Conf. Multimedia, 2024. 3

work page 2024

[29] [29]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, 2015. 6

work page 2015

[30] [30]

Interactive syn- thesis of human-object interaction

Sumit Jain and C Karen Liu. Interactive syn- thesis of human-object interaction. InACM SIG- GRAPH/Eurographics Symp. Computer Animation, 2009. 3

work page 2009

[31] [31]

Full-body articulated human-object interac- tion

Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Zhiyuan Zhang, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. Full-body articulated human-object interac- tion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2023. 3

work page 2023

[32] [32]

Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis

Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2025. 2

work page 2025

[33] [33]

Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint, 2024

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint, 2024. 6, 8

work page 2024

[34] [34]

arXiv preprint arXiv:2408.12569 , year=

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models.arXiv preprint arXiv:2408.12569, 2024. 2, 3, 4, 6, 7, 8

work page arXiv 2024

[35] [35]

arXiv preprint arXiv:2503.18950 (2025)

Taeksoo Kim and Hanbyul Joo. Target-aware video diffu- sion models.arXiv preprint arXiv:2503.18950, 2025. 1, 2, 3, 6

work page arXiv 2025

[36] [36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Nifty: Neural object interaction fields for guided human motion synthesis

Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhi- jit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[38] [38]

Efficient adaptive human-object inter- action detection with concept-guided memory

Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, and Yang Liu. Efficient adaptive human-object inter- action detection with concept-guided memory. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023

work page 2023

[39] [39]

Mask2iv: Interaction-centric video generation via mask tra- jectories.arXiv preprint arXiv:2510.03135, 2025

Gen Li, Bo Zhao, Jianfei Yang, and Laura Sevilla-Lara. Mask2iv: Interaction-centric video generation via mask tra- jectories.arXiv preprint arXiv:2510.03135, 2025. 2

work page arXiv 2025

[40] [40]

Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration

Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration.arXiv preprint arXiv:2412.18600, 2024. 3

work page arXiv 2024

[41] [41]

Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023

work page 2023

[42] [42]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Con- ference on Computer Vision (ECCV), 2024. 2, 3

work page 2024

[43] [43]

GenZI: Zero-shot 3D human-scene interaction generation

Lei Li and Angela Dai. GenZI: Zero-shot 3D human-scene interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[44] [44]

Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,

Yichen Li and Antonio Torralba. Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,

work page arXiv

[45] [45]

GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, and Srinath Sridhar. Genhsi: Controllable gen- eration of human-scene interaction videos.arXiv preprint arXiv:2506.19840, 2025. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 3

work page 2022

[47] [47]

Hoigen-1m: A large- scale dataset for human-object interaction video generation

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 6

work page 2025

[48] [48]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3

work page 2022

[49] [49]

Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking

Yun Liu, Bowen Yang, Licheng Zhong, He Wang, and Li Yi. Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730, 2024. 3

work page arXiv 2024

[50] [50]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement

Yun Liu, Chengwen Zhang, Ruofan Xing, Bingda Tang, Bowen Yang, and Li Yi. Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[52] [52]

Wan-Duo Kurt Ma, J. P. Lewis, and W. Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation, 2023. 2

work page 2023

[53] [53]

Mimo: Controllable character video synthesis with spatial decomposed modeling.arXiv preprint arXiv:2409.16160,

Yifang Men, Yuan Yao, Miaomiao Cui, and Bo Liefeng. Mimo: Controllable character video synthesis with spatial decomposed modeling.arXiv preprint arXiv:2409.16160,

work page arXiv

[54] [54]

T2i-adapter: Learn- ing adapters to dig out more controllable ability for text-to- image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learn- ing adapters to dig out more controllable ability for text-to- image diffusion models. InAssociation for the Advance- ment of Artificial Intelligence, 2024. 2

work page 2024

[55] [55]

Detecting hands and recognizing physical contact in the wild.Adv

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild.Adv. Neural Inform. Process. Syst., 2020. 2, 6

work page 2020

[56] [56]

Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models

Shan Ning, Longtian Qiu, Yongfei Liu, and Xuming He. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023

[57] [57]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023

[58] [58]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 2

work page arXiv 2024

[59] [59]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAssociation for the Advance- ment of Artificial Intelligence, 2018. 4

work page 2018

[60] [60]

Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024. 2

work page 2024

[61] [61]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning,

work page

[62] [62]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos, 2024. 7

work page 2024

[63] [63]

Grounded sam: Assembling open-world models for diverse visual tasks,

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

work page

[64] [64]

Isa4d: Interspatial attention for efficient 4d human video generation.ACM Transactions on Graphics (TOG), 2025

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, and Gordon Wetzstein. Isa4d: Interspatial attention for efficient 4d human video generation.ACM Transactions on Graphics (TOG), 2025. 2

work page 2025

[65] [65]

Motion-i2v: Con- sistent and controllable image-to-video generation with ex- plicit motion modeling.SIGGRAPH Conf

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Con- sistent and controllable image-to-video generation with ex- plicit motion modeling.SIGGRAPH Conf. Pap., 2024. 2, 3, 1

work page 2024

[66] [66]

Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019

Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019. 3

work page 2019

[67] [67]

Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, and Nanxuan Zhao. Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025. 2

work page arXiv 2025

[68] [68]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 4, 7, 8

work page 2020

[69] [69]

Videoanydoor: High-fidelity video ob- ject insertion with precise motion control

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InSIGGRAPH Conf. Pap., 2025. 2

work page 2025

[70] [70]

Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions

Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 3

work page 2020

[71] [71]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new met- ric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[72] [72]

Diffusion models are real-time game engines,

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines,

work page

[73] [73]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025

Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025. 2, 3

work page 2025

[75] [75]

Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIG- GRAPH Conf. Pap., 2025. 2

work page 2025

[76] [76]

Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips

Shibo Wang, Haonan He, Maria Parelli, Christoph Geb- hardt, Zicong Fan, and Jie Song. Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 2, 3

work page 2025

[77] [77]

Videocomposer: Compositional video syn- thesis with motion controllability.Adv

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video syn- thesis with motion controllability.Adv. Neural Inform. Pro- cess. Syst., 2023. 2

work page 2023

[78] [78]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH Conf. Pap., 2024. 6

work page 2024

[79] [79]

End-to-end hoi reconstruction transformer with graph-based encoding

Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yib- ing Zhan, and Dongjiang Li. End-to-end hoi reconstruction transformer with graph-based encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[80] [80]

Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv preprint arXiv:2410.13830, 2024. 2

work page arXiv 2024