StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
Pith reviewed 2026-05-20 11:08 UTC · model grok-4.3
pith:YRXDA3IT Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{YRXDA3IT}
Prints a linked pith:YRXDA3IT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Dual-hand world-space motion from egocentric video improves when flow matching conditions on the quality of wrist and finger observations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. We decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. The
What carries the argument
Four-channel quality signals predicted by a learned quality network that modulate the flow-matching process via per-channel forward schedules, adjusted velocity targets, AdaLN modulation, and ODE initialization.
If this is right
- The model achieves state-of-the-art results across metrics on HOT3D and ARCTIC benchmarks.
- It reduces W-MPJPE by 20-25 percent over baselines, with largest gains on heavily occluded sequences.
- Wrist trajectories and finger articulations become reliable for supervising robot policy learning.
- The process handles long missing-hand periods from head motion by relying on the bimanual motion prior.
Where Pith is reading between the lines
- Quality-aware conditioning of this type could extend to full-body pose estimation or other partially observed video tasks.
- The four-channel split might be refined to capture object-interaction quality or arm motion as additional signals.
- Real-time deployment would benefit from faster quality prediction networks while retaining the reconstruction gains.
Load-bearing premise
The quality of hand motion observations from an off-the-shelf estimator decomposes into four independent channels that a separate network can predict accurately enough to guide the generative reconstruction.
What would settle it
Replacing the learned quality channel predictions with uniform average values and checking whether error reductions on the ARCTIC benchmark disappear.
Figures
read the original abstract
Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StableHand, a quality-aware flow-matching framework for world-space dual-hand motion estimation from egocentric video. It decomposes per-frame observation quality from an off-the-shelf hand pose estimator into four channels (wrist global translation and finger articulations for left and right hands), predicts these signals with a learned quality network, and integrates them into the generative process via per-channel forward schedules, quality-adjusted velocity targets, AdaLN modulation of the DiT denoiser, and quality-aware ODE initialization. This preserves reliable observations while imputing unreliable spans using a bimanual motion prior. Experiments on HOT3D and ARCTIC report state-of-the-art results, with 20-25% W-MPJPE reductions over baselines and largest gains on heavily occluded sequences.
Significance. If the quality predictions reliably track observation accuracy, the method offers a principled way to handle extended missing-hand spans and hand-object occlusions in egocentric settings, with direct relevance to robot policy learning from wrist trajectories and grasp poses. The specific conditioning mechanisms in flow matching represent a targeted extension of generative priors to variable-quality inputs.
major comments (1)
- [Abstract and Section 3 (method description)] The central claim that quality-aware conditioning drives the 20-25% W-MPJPE gains requires that the four-channel quality signals accurately identify reliable observations. The manuscript provides no direct validation (e.g., correlation with ground-truth world-space errors, precision-recall on occluded frames, or ablation removing the quality branch) showing that the learned quality network tracks actual reliability rather than dataset biases or weak correlations; without this, the unified generative process risks reducing to standard flow matching.
minor comments (2)
- [Experiments section] Provide more details on quality network training procedure, exact baseline re-implementations, and error analysis per occlusion level to strengthen reproducibility and support for the reported gains.
- [Method equations] Clarify notation for the per-channel forward schedule and quality-adjusted velocity target; ensure equations explicitly show how quality modulates the ODE initialization.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comment on validating the quality signals below, and we will incorporate additional analysis in the revision to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Section 3 (method description)] The central claim that quality-aware conditioning drives the 20-25% W-MPJPE gains requires that the four-channel quality signals accurately identify reliable observations. The manuscript provides no direct validation (e.g., correlation with ground-truth world-space errors, precision-recall on occluded frames, or ablation removing the quality branch) showing that the learned quality network tracks actual reliability rather than dataset biases or weak correlations; without this, the unified generative process risks reducing to standard flow matching.
Authors: We agree that explicit validation of the learned quality network would further substantiate the central claim. The manuscript trains the quality network to predict four-channel observation reliability (wrist translation and finger articulation for each hand) from the off-the-shelf estimator outputs, then integrates these signals via per-channel forward schedules, quality-adjusted velocity targets, AdaLN modulation in the DiT, and quality-aware ODE initialization. This design is intended to preserve reliable observations while imputing unreliable spans with the bimanual motion prior. The reported results show the largest W-MPJPE reductions (20-25%) precisely on the heavily occluded ARCTIC sequences, which is consistent with the quality signals enabling better handling of unreliable frames. Nevertheless, we acknowledge the absence of direct metrics such as correlation with ground-truth world-space errors or precision-recall on occluded frames, as well as a dedicated ablation isolating the quality branch. In the revised manuscript we will add (i) an ablation removing the quality conditioning to quantify its isolated contribution and (ii) correlation and precision-recall analysis against available ground-truth occlusion and error annotations. These additions will clarify that the gains arise from the quality-aware mechanisms rather than reducing to standard flow matching. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper decomposes hand observation quality into four channels predicted by a separately learned quality network, then conditions a standard flow-matching process on those signals via per-channel schedules, velocity targets, AdaLN modulation, and ODE initialization. No equations, definitions, or steps in the abstract or described method reduce the quality predictions or final motion estimates to fitted inputs by construction, nor do they rely on load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled from prior author work. The central performance gains are presented as empirical outcomes of the quality-aware conditioning rather than tautological equivalences, making the derivation independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- quality network parameters
- DiT denoiser parameters
axioms (2)
- domain assumption Hand motion observations can be decomposed into four independent quality channels for wrists and fingers of both hands.
- domain assumption A learned bimanual motion prior can reconstruct unreliable observations while preserving high-quality ones.
invented entities (1)
-
four-channel quality signals
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We decompose the quality of hand motion observations ... into four channels: wrist global translation and finger articulations for both hands. ... per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
q = exp(-e/σ) ... RBF kernel over a component-specific joint error
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient flows: in metric spaces and in the space of probability measures. Springer, 2005
work page 2005
-
[2]
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Fan Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, et al. Introducing hot3d: An egocentric dataset for 3d hand and object tracking.arXiv preprint arXiv:2406.09598, 2024
-
[3]
3d hand shape and pose from images in the wild
Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019
work page 2019
-
[4]
Reconstructing hand-object interactions in the wild
Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, and Jitendra Malik. Reconstructing hand-object interactions in the wild. InProceedings of the IEEE/CVF international conference on computer vision, pages 12417–12426, 2021
work page 2021
-
[5]
Hmp: Hand motion priors for pose and shape estimation from video
Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan, and Michael J Black. Hmp: Hand motion priors for pose and shape estimation from video. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6353–6363, 2024
work page 2024
-
[6]
Arctic: A dataset for dexterous bimanual hand-object manipulation
Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12943–12954, 2023
work page 2023
-
[7]
Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, and Bo Zhao. Egograsp: World-space hand-object interaction estimation from egocentric videos.arXiv preprint arXiv:2601.01050, 2026
-
[8]
Deformer: Dy- namic fusion transformer for robust hand pose estimation
Qichen Fu, Xingyu Liu, Ran Xu, Juan Carlos Niebles, and Kris M Kitani. Deformer: Dy- namic fusion transformer for robust hand pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23600–23611, 2023
work page 2023
-
[9]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1...
work page 2024
-
[10]
Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Krishna Bodduluri, Mike Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, et al. Dexterity from smart lenses: Multi-fingered robot manipulation with in-the-wild human demonstrations.arXiv preprint arXiv:2511.16661, 2025
-
[11]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[12]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013. 10
work page 2013
-
[14]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025
work page 2025
-
[15]
H2o: Two hands manipulating objects for first person interaction recognition
Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021
work page 2021
-
[16]
Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning
Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, and Siyuan Huang. Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6991–7003, 2025
work page 2025
-
[17]
Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025
-
[18]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
End-to-end human pose and mesh reconstruction with transformers
Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021
work page 1954
-
[20]
Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. InProceedings of the IEEE/CVF international conference on computer vision, pages 12939–12948, 2021
work page 2021
-
[21]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Semi-supervised 3d hand-object poses estimation with interactions in time
Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xiaolong Wang. Semi-supervised 3d hand-object poses estimation with interactions in time. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14687–14697, 2021
work page 2021
-
[23]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022
work page 2022
-
[24]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Repaint: Inpainting using denoising diffusion probabilistic models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022
work page 2022
-
[26]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In European Conference on Computer Vision, pages 548–564. Springer, 2020
work page 2020
-
[28]
Gyeongsik Moon, Shunsuke Saito, Weipeng Xu, Rohan Joshi, Julia Buffalini, Harley Bellan, Nicholas Rosen, Jesse Richardson, Mallorie Mize, Philippe De Bree, et al. A dataset of relighted 3d interacting hands.Advances in Neural Information Processing Systems, 36:17689–17701, 2023. 11
work page 2023
-
[29]
Stablemotion: Training motion cleanup models with unpaired corrupted data
Yuxuan Mu, Hung Yu Ling, Yi Shi, Ismael Baira Ojeda, Pengcheng Xi, Chang Shu, Fabio Zinno, and Xue Bin Peng. Stablemotion: Training motion cleanup models with unpaired corrupted data. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025
work page 2025
-
[30]
Handoc- cnet: Occlusion-robust 3d hand mesh estimation network
JoonKyu Park, Yeonguk Oh, Gyeongsik Moon, Hongsuk Choi, and Kyoung Mu Lee. Handoc- cnet: Occlusion-robust 3d hand mesh estimation network. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1496–1505, 2022
work page 2022
-
[31]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024
work page 2024
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[33]
Hd-epic: A highly-detailed egocentric video dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025
work page 2025
-
[34]
Wilor: End-to-end 3d hand localization and reconstruction in-the-wild
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025
work page 2025
-
[35]
3d hand pose estimation in everyday egocentric images
Aditya Prakash, Ruisen Tu, Matthew Chang, and Saurabh Gupta. 3d hand pose estimation in everyday egocentric images. InEuropean Conference on Computer Vision, pages 183–202. Springer, 2024
work page 2024
-
[36]
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), November 2017
work page 2017
-
[38]
Subham S Sahoo, Aaron Gokaslan, Chris De, and V olodymyr Kuleshov. Diffusion models with learned adaptive noise.Advances in Neural Information Processing Systems, 37:105730– 105779, 2024
work page 2024
-
[39]
Bernhard Schölkopf and Alexander J Smola.Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, 2002
work page 2002
-
[40]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[41]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[42]
Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, and Zuxuan Wu. Unihand: A unified model for diverse controlled 4d hand motion modeling.arXiv preprint arXiv:2602.21631, 2026
-
[43]
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021
work page 2021
-
[44]
Handdgp: Camera-space hand mesh prediction with differentiable global positioning
Eugene Valassakis and Guillermo Garcia-Hernando. Handdgp: Camera-space hand mesh prediction with differentiable global positioning. InEuropean Conference on Computer Vision, pages 479–496. Springer, 2024
work page 2024
-
[45]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 12
work page 2017
-
[46]
Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025
-
[47]
Decoupling human and camera motion from videos in the wild
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023
work page 2023
-
[48]
Yufei Ye, Jiaman Li, Ryan Rong, and C Karen Liu. Whole: World-grounded hand-object lifted from egocentric videos.arXiv preprint arXiv:2602.22209, 2026
-
[49]
Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera
Zhengdi Yu, Stefanos Zafeiriou, and Tolga Birdal. Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27716–27726, 2025
work page 2025
-
[50]
Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, and Xingxing Zuo. Flowhoi: Flow-based semantics-grounded generation of hand-object interactions for dexterous robot manipulation.arXiv preprint arXiv:2602.13444, 2026
-
[51]
Hawor: World- space hand motion reconstruction from egocentric videos
Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolandos Alexandros Potamias. Hawor: World- space hand motion reconstruction from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025
work page 2025
-
[52]
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026
-
[53]
On the continuity of rotation representations in neural networks
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019
work page 2019
-
[54]
Lawrence Y Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, and Danfei Xu. Emma: Scaling mobile manipulation via egocentric human data.IEEE Robotics and Automation Letters, 2026. 13 Supplementary Material This supplementary document provides additional details and results that complement the main paper. Sec...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.