VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
Pith reviewed 2026-05-16 23:27 UTC · model grok-4.3
The pith
VHOI converts sparse human trajectories into dense color-coded masks that condition a video diffusion model to generate controllable human-object interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VHOI is a two-stage framework that first densifies sparse trajectories into HOI mask sequences via an HOI-aware motion representation using color encodings to distinguish human, object, and body-part-specific dynamics, then fine-tunes a video diffusion model conditioned on these masks to produce controllable, realistic human-object interaction videos, including full navigation sequences.
What carries the argument
The HOI-aware motion representation that applies color encodings to sparse trajectories to produce dense mask sequences distinguishing overall human motion, object motion, and body-part-specific dynamics for use as conditioning input.
If this is right
- Users can control HOI videos with simple trajectory sketches rather than expensive dense signals.
- Generation extends naturally to complete scenes that include human navigation before the interaction occurs.
- Body-part color distinctions improve the model's grasp of fine-grained dynamics like hand or foot movements during contact.
- The same pipeline supports both isolated interaction clips and longer navigation-to-interaction sequences without separate modules.
- Performance reaches state-of-the-art levels on controllable HOI video benchmarks.
Where Pith is reading between the lines
- The color-based densification might extend to multi-person or multi-object scenes if the encoding scheme is expanded.
- Real-time applications could arise by pairing the method with live skeleton tracking from cameras or wearables.
- Efficiency gains over mesh-based methods could be quantified by measuring user effort versus output quality on the same tasks.
- Testing on out-of-distribution objects or environments would reveal how much the human prior in the masks helps generalization.
Load-bearing premise
The color-encoding scheme will reliably turn sparse trajectories into clean, instance-specific masks that capture realistic interaction dynamics without introducing artifacts when fed to the diffusion model.
What would settle it
Generated videos showing motion artifacts, incorrect body-part interactions, or loss of object identity when sparse trajectories are provided for complex actions such as grasping or throwing.
Figures
read the original abstract
Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents VHOI, a two-stage framework for controllable video generation of human-object interactions from sparse trajectories. The first stage converts sparse trajectories into dense HOI mask sequences via a novel color-encoded motion representation that distinguishes human/object motion and body-part-specific dynamics, incorporating a human prior. The second stage fine-tunes a video diffusion model conditioned on these masks. The authors claim state-of-the-art results in controllable HOI video generation and demonstrate end-to-end generation of full human navigation leading to object interactions.
Significance. If the densification stage produces artifact-free, instance-specific masks that faithfully capture realistic HOI dynamics, the work would meaningfully address the sparse-vs-dense control trade-off in video synthesis by enabling easy-to-specify inputs to yield informative conditioning signals. The incorporation of body-part priors and the extension to navigation scenarios are positive aspects. The approach builds on existing diffusion models without introducing free parameters or circular derivations.
major comments (2)
- [§3.2] §3.2 (HOI-aware motion representation): The central claim depends on the color encoding reliably producing dense masks without artifacts or loss of fine dynamics in overlapping regions. The manuscript does not provide quantitative validation (e.g., mask IoU or optical-flow consistency metrics) or ablation against non-color encodings to confirm this holds for instance-specific interactions.
- [§5] §5 (Experiments): The SOTA performance assertion requires explicit comparison tables against recent baselines (e.g., trajectory-conditioned diffusion methods) with standard metrics such as FID, FVD, and controllability scores; without these, the claim that VHOI outperforms prior work on both interaction-only and navigation scenarios cannot be evaluated.
minor comments (2)
- [Abstract] The abstract and §1 could more clearly state the exact input format of the sparse trajectories (e.g., 2D keypoints per frame) to help readers assess practicality.
- [Figure 2] Figure 2 (pipeline overview) would benefit from explicit channel legends for the color encodings to illustrate how body-part distinctions are encoded.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We agree with the identified gaps in quantitative validation and experimental comparisons, and we will revise the paper accordingly to strengthen these aspects while preserving the core contributions of the VHOI framework.
read point-by-point responses
-
Referee: [§3.2] §3.2 (HOI-aware motion representation): The central claim depends on the color encoding reliably producing dense masks without artifacts or loss of fine dynamics in overlapping regions. The manuscript does not provide quantitative validation (e.g., mask IoU or optical-flow consistency metrics) or ablation against non-color encodings to confirm this holds for instance-specific interactions.
Authors: We agree that quantitative validation is needed to rigorously support the reliability of the color-encoded representation. In the revised manuscript, we will add a dedicated evaluation subsection reporting mask IoU and optical-flow consistency metrics computed on held-out test sequences. We will also include an ablation study comparing our color encoding against non-color alternatives (e.g., grayscale or channel-separated masks) to demonstrate its advantages for instance-specific HOI dynamics. revision: yes
-
Referee: [§5] §5 (Experiments): The SOTA performance assertion requires explicit comparison tables against recent baselines (e.g., trajectory-conditioned diffusion methods) with standard metrics such as FID, FVD, and controllability scores; without these, the claim that VHOI outperforms prior work on both interaction-only and navigation scenarios cannot be evaluated.
Authors: We acknowledge that the current experimental section would be strengthened by more comprehensive quantitative tables. In the revision, we will expand Section 5 with explicit comparison tables reporting FID, FVD, and controllability scores against recent trajectory-conditioned diffusion baselines. These tables will cover both interaction-only and navigation scenarios, with notes on any baseline limitations for the latter. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a practical two-stage engineering pipeline (sparse trajectory densification via color-encoded HOI motion representation followed by conditioning a pre-existing video diffusion model) without any equations, derivations, or parameter-fitting steps that reduce to the inputs by construction. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises, and the method is presented as building directly on standard diffusion models with an added conditioning signal. The approach is self-contained and externally falsifiable via the reported experiments on controllability and navigation, yielding no circularity under the specified criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video diffusion models can be fine-tuned on dense mask sequences to achieve controllable generation of human-object interactions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VHOI consists of (1) a trajectory augmentor A that converts sparse trajectories ξ into dense HOI mask sequences M_hoi
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
Reference graph
Works this paper leans on
-
[1]
Interdyn: Con- trollable interactive dynamics with video diffusion models
Rick Akkerman, Haiwen Feng, Michael J Black, Dimitrios Tzionas, and Victoria Fern´andez Abrevaya. Interdyn: Con- trollable interactive dynamics with video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 3
work page 2025
-
[2]
arXiv preprint arXiv:2503.14492 (2025)
Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 2
-
[3]
Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, and Christian Theobalt. Follow my hold: Hand-object inter- action reconstruction through geometric guidance.arXiv preprint arXiv:2508.18213, 2025. 2, 3
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
A database and evaluation methodology for optical flow.Int
Simon Baker, Daniel Scharstein, James P Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. A database and evaluation methodology for optical flow.Int. J. Com- put. Vis., 2011. 3
work page 2011
-
[6]
Behave: Dataset and method for tracking human object in- teractions
Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6
work page 2022
-
[7]
Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 6, 7, 8
work page 2025
-
[8]
Goku: Flow based video generative foundation models
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896, 2025. 2, 3
-
[9]
Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dimitrios Tzionas. Detecting human-object contact in im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[10]
Arctic: A dataset for dexterous bimanual hand-object manipulation
Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Ot- mar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[11]
Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video
Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J Black, and Otmar Hilliges. Hold: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
work page 2024
-
[12]
3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation
Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi- entity motion in video generation. InInt. Conf. Learn. Rep- resent., 2025. 2
work page 2025
-
[13]
Motion prompting: Controlling video generation with motion tra- jectories
Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion tra- jectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2
work page 2025
-
[14]
Detecting and recognizing human-object interac- tions
Georgia Gkioxari, Ross Girshick, Piotr Doll ´ar, and Kaim- ing He. Detecting and recognizing human-object interac- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2, 3
work page 2018
-
[15]
Stochas- tic scene-aware motion prediction
Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- tic scene-aware motion prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3
work page 2021
-
[16]
Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis.arXiv preprint arXiv:2412.20104, 2024
-
[17]
Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis
Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. Syncdiff: Syn- chronized motion diffusion for multi-body human-object interaction synthesis. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 3
work page 2025
-
[18]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Training-free camera control for video generation
Chen Hou and Zhibo Chen. Training-free camera control for video generation. InInt. Conf. Learn. Represent., 2025. 2
work page 2025
-
[20]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent., 2022. 2
work page 2022
-
[21]
Hand-object interaction image generation.Adv
Hezhen Hu, Weilun Wang, Wengang Zhou, and Houqiang Li. Hand-object interaction image generation.Adv. Neural Inform. Process. Syst., 2022. 3
work page 2022
-
[22]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation
Li Hu. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
work page 2024
-
[23]
Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, and Liefeng Bo. Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025. 2
-
[24]
Personahoi: Effortlessly improving face person- alization in human-object interaction generation
Xinting Hu, Haoran Wang, Jan Eric Lenssen, and Bernt Schiele. Personahoi: Effortlessly improving face person- alization in human-object interaction generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[25]
Layered controllable video generation
Jiahui Huang, Yuhe Jin, Kwang Moo Yi, and Leonid Si- gal. Layered controllable video generation. InProceedings of the European Conference on Computer Vision (ECCV),
-
[26]
VBench: Comprehensive benchmark suite for video generative mod- els
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recog...
work page 2024
-
[27]
Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, Qinglin Lu, and Fan Tang. Hunyuanvideo- homa: Generic human-object interaction in multimodal driven human animation.arXiv preprint arXiv:2506.08797,
-
[28]
Monocular human-object reconstruction in the wild
Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human-object reconstruction in the wild. InACM Int. Conf. Multimedia, 2024. 3
work page 2024
-
[29]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, 2015. 6
work page 2015
-
[30]
Interactive syn- thesis of human-object interaction
Sumit Jain and C Karen Liu. Interactive syn- thesis of human-object interaction. InACM SIG- GRAPH/Eurographics Symp. Computer Animation, 2009. 3
work page 2009
-
[31]
Full-body articulated human-object interac- tion
Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Zhiyuan Zhang, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. Full-body articulated human-object interac- tion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2023. 3
work page 2023
-
[32]
Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho. Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2025. 2
work page 2025
-
[33]
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint, 2024. 6, 8
work page 2024
-
[34]
arXiv preprint arXiv:2408.12569 , year=
Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models.arXiv preprint arXiv:2408.12569, 2024. 2, 3, 4, 6, 7, 8
-
[35]
arXiv preprint arXiv:2503.18950 (2025)
Taeksoo Kim and Hanbyul Joo. Target-aware video diffu- sion models.arXiv preprint arXiv:2503.18950, 2025. 1, 2, 3, 6
-
[36]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Nifty: Neural object interaction fields for guided human motion synthesis
Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhi- jit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. Nifty: Neural object interaction fields for guided human motion synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[38]
Efficient adaptive human-object inter- action detection with concept-guided memory
Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, and Yang Liu. Efficient adaptive human-object inter- action detection with concept-guided memory. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023
work page 2023
-
[39]
Gen Li, Bo Zhao, Jianfei Yang, and Laura Sevilla-Lara. Mask2iv: Interaction-centric video generation via mask tra- jectories.arXiv preprint arXiv:2510.03135, 2025. 2
-
[40]
Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration
Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration.arXiv preprint arXiv:2412.18600, 2024. 3
-
[41]
Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023
Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 2023
work page 2023
-
[42]
Controllable human-object interaction synthesis
Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. InProceedings of the European Con- ference on Computer Vision (ECCV), 2024. 2, 3
work page 2024
-
[43]
GenZI: Zero-shot 3D human-scene interaction generation
Lei Li and Angela Dai. GenZI: Zero-shot 3D human-scene interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[44]
Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,
Yichen Li and Antonio Torralba. Multimodal action condi- tioned video generation.arXiv preprint arXiv:2510.02287,
-
[45]
GenHSI: Controllable Generation of Human-Scene Interaction Videos
Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, and Srinath Sridhar. Genhsi: Controllable gen- eration of human-scene interaction videos.arXiv preprint arXiv:2506.19840, 2025. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection
Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2022. 3
work page 2022
-
[47]
Hoigen-1m: A large- scale dataset for human-object interaction video generation
Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, and Wu Liu. Hoigen-1m: A large- scale dataset for human-object interaction video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 6
work page 2025
-
[48]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3
work page 2022
-
[49]
Yun Liu, Bowen Yang, Licheng Zhong, He Wang, and Li Yi. Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730, 2024. 3
-
[50]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technol- ogy, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement
Yun Liu, Chengwen Zhang, Ruofan Xing, Bingda Tang, Bowen Yang, and Li Yi. Core4d: A 4d human-object- human interaction dataset for collaborative object rear- rangement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[52]
Wan-Duo Kurt Ma, J. P. Lewis, and W. Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation, 2023. 2
work page 2023
-
[53]
Yifang Men, Yuan Yao, Miaomiao Cui, and Bo Liefeng. Mimo: Controllable character video synthesis with spatial decomposed modeling.arXiv preprint arXiv:2409.16160,
-
[54]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learn- ing adapters to dig out more controllable ability for text-to- image diffusion models. InAssociation for the Advance- ment of Artificial Intelligence, 2024. 2
work page 2024
-
[55]
Detecting hands and recognizing physical contact in the wild.Adv
Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild.Adv. Neural Inform. Process. Syst., 2020. 2, 6
work page 2020
-
[56]
Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models
Shan Ning, Longtian Qiu, Yongfei Liu, and Xuming He. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[57]
Scalable diffusion mod- els with transformers
William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023. 3
work page 2023
-
[58]
Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 2
-
[59]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAssociation for the Advance- ment of Artificial Intelligence, 2018. 4
work page 2018
-
[60]
Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024
Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajec- tory control in video diffusion models, 2024. 2
work page 2024
-
[61]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning,
-
[62]
Sam 2: Segment anything in images and videos, 2024
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos, 2024. 7
work page 2024
-
[63]
Grounded sam: Assembling open-world models for diverse visual tasks,
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,
-
[64]
Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, and Gordon Wetzstein. Isa4d: Interspatial attention for efficient 4d human video generation.ACM Transactions on Graphics (TOG), 2025. 2
work page 2025
-
[65]
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Con- sistent and controllable image-to-video generation with ex- plicit motion modeling.SIGGRAPH Conf. Pap., 2024. 2, 3, 1
work page 2024
-
[66]
Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions.ACM Transactions on Graphics (TOG), 2019. 3
work page 2019
-
[67]
Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025
Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, and Nanxuan Zhao. Multicoin: Multi-modal controllable video inbe- tweening.arXiv preprint arXiv:2510.08561, 2025. 2
-
[68]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 4, 7, 8
work page 2020
-
[69]
Videoanydoor: High-fidelity video ob- ject insertion with precise motion control
Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video ob- ject insertion with precise motion control. InSIGGRAPH Conf. Pap., 2025. 2
work page 2025
-
[70]
Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions
Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. Vsgnet: Spatial attention network for detecting human ob- ject interactions using graph convolutions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 3
work page 2020
-
[71]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new met- ric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[72]
Diffusion models are real-time game engines,
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines,
-
[73]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025
Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv preprint, 2025. 2, 3
work page 2025
-
[75]
Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation
Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIG- GRAPH Conf. Pap., 2025. 2
work page 2025
-
[76]
Shibo Wang, Haonan He, Maria Parelli, Christoph Geb- hardt, Zicong Fan, and Jie Song. Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2025. 2, 3
work page 2025
-
[77]
Videocomposer: Compositional video syn- thesis with motion controllability.Adv
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video syn- thesis with motion controllability.Adv. Neural Inform. Pro- cess. Syst., 2023. 2
work page 2023
-
[78]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH Conf. Pap., 2024. 6
work page 2024
-
[79]
End-to-end hoi reconstruction transformer with graph-based encoding
Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yib- ing Zhan, and Dongjiang Li. End-to-end hoi reconstruction transformer with graph-based encoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[80]
Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Hao- nan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject- driven video customization with precise motion control. arXiv preprint arXiv:2410.13830, 2024. 2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.