pith. sign in

arxiv: 2606.30308 · v1 · pith:O4SPVR7Znew · submitted 2026-06-29 · 💻 cs.CV

The Surprising Effectiveness of Video Diffusion Models for Hand Motion Reconstruction

Pith reviewed 2026-06-30 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords video diffusion modelshand motion reconstructionegocentric video4D hand poseocclusion reasoninghand-object interactionembodied AI
0
0 comments X

The pith

Video diffusion models enable accurate 4D two-hand pose reconstruction from egocentric video by adapting their learned representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing image-based and video-based methods for 4D hand motion reconstruction struggle under occlusion and with hand-object interactions because they depend on detectors or temporal modules trained only on scarce pose annotations. Video generative models trained at internet scale must develop occlusion reasoning, motion dynamics, and interaction modeling to synthesize coherent video, and this paper shows those capabilities transfer to hand reconstruction. ViDiHand adapts a pretrained video diffusion model with a hand-overlay rendering objective that specializes its features for hands while retaining general priors, then decodes metric-scale poses directly from full frames. The approach requires no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D it substantially outperforms prior methods and points toward scalable collection of in-the-wild hand data for embodied AI.

Core claim

The paper establishes that the representations inside a pretrained video diffusion model, when specialized via a hand-overlay rendering objective, support direct reconstruction of metric-scale 4D two-hand pose from egocentric video. This pipeline operates on full frames and substantially outperforms prior detector-dependent or annotation-limited methods on ARCTIC, HOT3D, and HOI4D, demonstrating that the implicit world knowledge acquired during large-scale video synthesis can be leveraged for hand motion tasks.

What carries the argument

The hand-overlay rendering objective that adapts features from a pretrained video diffusion model for hands while preserving its world priors, followed by a decoder that extracts metric-scale 4D pose.

If this is right

  • Hand reconstruction becomes possible directly from full frames without separate detection or inpainting stages.
  • Occlusion reasoning and hand-object interaction modeling improve because they draw on priors learned from internet-scale video synthesis.
  • Test-time optimization is unnecessary, allowing efficient inference on new sequences.
  • Large-scale in-the-wild hand motion data can be collected more scalably to support embodied AI training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation strategy could be tested on full-body pose or other articulated objects where video synthesis priors are likely to encode useful 3D structure.
  • Performance may continue to improve with larger or more diverse pretrained video diffusion models, suggesting a scaling route for perception tasks.
  • The method implies that generative video models already encode extractable 3D and interaction knowledge that future work could probe on additional downstream benchmarks.

Load-bearing premise

The representations acquired by a video diffusion model during general video synthesis can be specialized for hands through a hand-overlay rendering objective without losing the priors needed for accurate metric-scale pose recovery.

What would settle it

An experiment in which ViDiHand fails to outperform prior methods on the reported metrics for ARCTIC, HOT3D, or HOI4D would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.30308 by Chengkai Jin, Siyuan Huang, Tianyi Wei, Wenqi Ouyang, Xingang Pan, Yufei Liu, Yuxi Wang, Zhiqi Shen, Zhiwei Zeng.

Figure 1
Figure 1. Figure 1: ViDiHand satisfies all three target properties of 4D hand recovery. On the same egocentric input clip, the per-image baseline WiLoR [19] is sensitive to detection dropouts and suffers from frame-wise pose flicker; the temporal baseline OmniHands [14] reduces flicker through cross￾frame attention but still struggles under heavy occlusion and large hand-object motion. ViDiHand extracts features from a hand-a… view at source ↗
Figure 2
Figure 2. Figure 2: ViDiHand pipeline. Top: the VACE branch is finetuned with hand-overlay rendering while the base DiT is frozen, producing a hand-aware video diffusion model. Middle: the dual-branch decoder reads from a single L ⋆=15, τ ⋆≈0.7 activation: a hand-token branch produces slot-aware summaries for articulated MANO pose, a parallel joint-heatmap branch produces 2D anchors and pooled descriptors for in-plane coordin… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on ARCTIC and HOT3D under severe occlusion. Top: one hand fully occluded behind a box. Middle: both hands partially occluded by manipulated objects and the image boundary. Bottom: one hand severely occluded by a bowl. Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on in-the-wild egocentric video. Top: severe occlusion by a towel and a jar. Middle: top-down camera with one hand reaching into a shelf and the other hanging at the side. Bottom: single-hand scene with grating-like shadows; many baselines hallucinate a second hand (blue) overlapping with the visible one. hand evidence rather than averaging through it, and 3.42 remains roughly 4× bel… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on ARCTIC. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-the-wild qualitative comparison on OakInk2 [35]. No ground-truth MANO is available. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) GT [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on HOT3D. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on HOT3D. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) GT [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on HOT3D. Note that the ground-truth annotations in this case are inaccurate. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on HOT3D. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) GT [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison on HOT3D. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) GT [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison on HOT3D. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison on ARCTIC. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) GT [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison on ARCTIC. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: In-the-wild qualitative on Xperience-10m [22]. No ground-truth MANO is available. Joints Mesh 3D View A 3D View B 3D View C Input InterWild HaMeR Hamba WildHands WiLoR Dyn-HaMR HaWoR OmniHands ViDiHand (Ours) [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: In-the-wild qualitative comparison. No ground-truth MANO is available. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: In-the-wild qualitative comparison on HOI4D. No ground-truth MANO is available. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
read the original abstract

4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes ViDiHand, which adapts a pretrained video diffusion model via a hand-overlay rendering objective to reconstruct 4D two-hand pose from egocentric video. It claims that internet-scale video diffusion training implicitly acquires occlusion reasoning, motion dynamics, and hand-object interaction priors that can be specialized for hands without erasure, enabling a detector-free, optimization-free pipeline that substantially outperforms prior methods on ARCTIC, HOT3D, and HOI4D.

Significance. If validated, the result would be significant for establishing video diffusion models as a foundation for hand motion reconstruction, addressing key bottlenecks in image-based detectors and annotation-scarce video methods. It is credited for the direct full-frame operation and the focus on hand-object interaction datasets. The approach suggests a scalable route to in-the-wild data for embodied AI.

major comments (1)
  1. [Experiments] Experiments/Results sections: No ablation is presented that trains a decoder on the identical hand-overlay rendering objective but using features from a non-diffusion video backbone (or a randomly initialized encoder). This comparison is load-bearing for the central claim that diffusion priors survive adaptation and drive the outperformance, as opposed to the objective or decoder alone producing the gains.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'substantial outperformance' is stated without any quantitative metrics, baseline names, or error values, which delays assessment of the result magnitude.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comment. We address it directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments/Results sections: No ablation is presented that trains a decoder on the identical hand-overlay rendering objective but using features from a non-diffusion video backbone (or a randomly initialized encoder). This comparison is load-bearing for the central claim that diffusion priors survive adaptation and drive the outperformance, as opposed to the objective or decoder alone producing the gains.

    Authors: We agree this ablation is important for isolating whether the performance gains stem from the diffusion priors rather than the hand-overlay rendering objective or decoder architecture alone. The original submission did not include it. In the revised manuscript we will add results training the identical decoder on the same objective but using features from (i) a randomly initialized video encoder and (ii) a non-diffusion video backbone (e.g., a 3D ResNet or I3D pretrained on Kinetics). These controls will clarify the contribution of the adapted diffusion representations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external pretrained backbone

full rationale

The paper advances an empirical pipeline that fine-tunes a publicly available pretrained video diffusion model via a hand-overlay rendering loss and decodes pose from the resulting features. No equations, parameter-fitting steps, or derivations appear in the provided text; performance is measured on independent external datasets (ARCTIC, HOT3D, HOI4D). The central motivation—that diffusion models acquire occlusion and interaction priors—is presented as a hypothesis tested by comparative results rather than a self-referential definition or fitted-input prediction. No self-citation chains or uniqueness theorems are invoked to close the argument. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5766 in / 1167 out tokens · 28561 ms · 2026-06-30T05:56:23.882868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    HOT3D: Hand and object tracking in 3D from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognit...

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Hamba: Single-view 3D hand reconstruction with graph-guided bi-scanning mamba

    Haoye Dong, Aviral Chharia, Wenbo Gou, Francisco Vicente Carrasco, and Fernando De la Torre. Hamba: Single-view 3D hand reconstruction with graph-guided bi-scanning mamba. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  4. [4]

    Hmp: Hand motion priors for pose and shape estimation from video

    Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan, and Michael J Black. Hmp: Hand motion priors for pose and shape estimation from video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6353–6363, 2024

  5. [5]

    Black, and Otmar Hilliges

    Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand- object manipulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  6. [6]

    Qichen Fu, Xingyu Liu, Ran Xu, Juan Carlos Niebles, and Kris M. Kitani. Deformer: Dynamic fusion transformer for robust hand pose estimation, 2023

  7. [7]

    Valentin Gabeur, Shangbang Long, Songyou Peng, Paul V oigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, and Radu...

  8. [8]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  9. [9]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  10. [10]

    Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M. Rehg. How much 3d do video foundation models encode?, 2025

  11. [11]

    V ACE: All-in- one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. V ACE: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17191–17202, 2025

  12. [12]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  13. [13]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. Oral

  14. [14]

    Omnihands: Towards robust 4d hand mesh recovery via a versatile transformer, 2024

    Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Wei Jing, Qi Yan, Qianying Wang, Yebin Liu, and Hongwen Zhang. Omnihands: Towards robust 4d hand mesh recovery via a versatile transformer, 2024. 31

  15. [15]

    HOI4D: A 4D egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D egocentric dataset for category-level human-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  16. [16]

    Bringing inputs to shared domains for 3D interacting hands recovery in the wild

    Gyeongsik Moon. Bringing inputs to shared domains for 3D interacting hands recovery in the wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  17. [17]

    Emergent temporal correspondences from video diffusion transformers, 2025

    Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffusion transformers, 2025

  18. [18]

    Reconstructing hands in 3d with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  19. [19]

    WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild

    Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  20. [20]

    3D hand pose estimation in everyday egocentric images

    Aditya Prakash, Ruisen Tu, Matthew Chang, and Saurabh Gupta. 3D hand pose estimation in everyday egocentric images. InEuropean Conference on Computer Vision (ECCV), 2024

  21. [21]

    Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics (TOG), 36(6), 2017

  22. [22]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

    Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

  23. [23]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  24. [24]

    Repurposing video diffusion transformers for robust point tracking, 2025

    Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam, Dahyun Chung, Siyoon Jin, Jung Yi, Jaewon Min, Junhwa Hur, and Seungryong Kim. Repurposing video diffusion transformers for robust point tracking, 2025

  25. [25]

    Emer- gent correspondence from image diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  26. [26]

    Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S

    Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S. M. Sajjadi. From image to video: An empirical study of diffusion representations. In IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  27. [27]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. Alibaba Group

  28. [28]

    Hand2world: Autoregressive egocentric interaction generation via free-space hand gestures.arXiv preprint arXiv:2602.09600, 2026

    Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, and Xingang Pan. Hand2world: Autoregressive egocentric interaction generation via free-space hand gestures.arXiv preprint arXiv:2602.09600, 2026

  29. [29]

    Sun, Ashley Neall, Tong Wu, Shengqu Cai, and Gordon Wetzstein

    Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, and Gordon Wetzstein. Generated reality: Human-centric world simulation using interactive video generation with hand and camera control, 2026

  30. [30]

    Egovla: Learning vision-language-action models from egocentric human videos, 2025

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, and Xiaolong Wang. Egovla: Learning vision-language-action models from egocentric human videos, 2025. 32

  31. [31]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025

  32. [32]

    Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, and Michael J. Black. Predicting 4d hand trajectory from monocular videos, 2025

  33. [33]

    Dyn-HaMR: Recovering 4D interacting hand motion from a dynamic camera

    Zhengdi Yu, Stefanos Zafeiriou, and Tolga Birdal. Dyn-HaMR: Recovering 4D interacting hand motion from a dynamic camera. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  34. [34]

    Denoise to track: Harnessing video diffusion priors for robust correspondence, 2025

    Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, and Zhuzhong Qian. Denoise to track: Harnessing video diffusion priors for robust correspondence, 2025

  35. [35]

    Oakink2: A dataset of bimanual hands-object manipulation in complex task completion, 2024

    Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, and Cewu Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion, 2024

  36. [36]

    HaWoR: World- space hand motion reconstruction from egocentric videos

    Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolandos Alexandros Potamias. HaWoR: World- space hand motion reconstruction from egocentric videos. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

  37. [37]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. 33