Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
Pith reviewed 2026-05-17 22:28 UTC · model grok-4.3
The pith
Uni-Hand forecasts hand waypoints in 2D and 3D plus head motion and contact states by fusing vision-language inputs in a dual-branch diffusion model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By harmonizing multiple modalities through vision-language fusion, global context, and task-aware text embedding injection, the framework forecasts hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion model concurrently predicts human head and hand movements to capture their motion synergy. Target indicators allow forecasting of wrist or finger joints in addition to hand centers, while hand-object interaction states are predicted to aid downstream tasks. Experiments on public datasets and new benchmarks show state-of-the-art results in multi-dimensional and multi-target forecasting, with strong transfer to robotic manipulation policies and improved features for action tasks
What carries the argument
Dual-branch diffusion architecture that runs concurrent head and hand predictions while injecting target indicators and interaction states into a vision-language fused input stream.
If this is right
- Multi-target forecasts of wrist, finger, and center points become available alongside 2D and 3D outputs.
- Head and hand motions are predicted together, reflecting their coordination in egocentric scenes.
- Hand-object contact and separation states are output to support immediate use in manipulation tasks.
- The same model improves both robotic policy transfer and feature quality for action anticipation or recognition.
- Benchmarks now exist that test forecasting directly through downstream task performance rather than isolated trajectory error.
Where Pith is reading between the lines
- The same fusion and diffusion structure could be tested on full-body egocentric motion if head-hand synergy extends to torso and legs.
- Real-time versions might replace the diffusion steps with faster sampling once the core synergy is confirmed on the new benchmarks.
- Adding depth or audio channels as extra branches could further reduce modality gaps in outdoor or noisy settings.
- Cross-dataset transfer to different camera rigs or user groups would check whether the text-embedding injection remains stable.
Load-bearing premise
Vision-language fusion and task-aware text embeddings close modality gaps sufficiently, and the dual-branch diffusion captures head-hand coordination without introducing artifacts that require heavy post-tuning.
What would settle it
A controlled ablation on the new downstream robotic manipulation benchmark where removing the dual-branch head prediction or the vision-language fusion step produces no gain or lower success rates than the full model.
Figures
read the original abstract
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Uni-Hand, a framework for forecasting hand motion in egocentric views that integrates vision-language fusion, global context, task-aware text embeddings, and a dual-branch diffusion model to jointly predict head and hand trajectories in 2D/3D. It adds target indicators for wrist/finger joints and hand-object contact states, claims SOTA results on public datasets plus new downstream benchmarks, and reports gains in robotic policy transfer and action anticipation/recognition.
Significance. If the empirical claims hold, the work would be notable for being the first to evaluate hand-motion forecasting on downstream tasks and for attempting a unified multi-modal, multi-target architecture. The downstream validation and explicit handling of head-hand synergy in egocentric settings address real gaps in AR and robotics applications.
major comments (2)
- [§4.2] §4.2 (Dual-branch diffusion): The architecture description does not specify cross-branch conditioning, shared latents, or a joint loss term that would enforce motion synergy between head and hand branches. Without such mechanisms, concurrent prediction could be achieved by independent parallel heads, weakening the justification for the added complexity and the central novelty claim.
- [Table 2, §5.3] Table 2 and §5.3 (Downstream benchmarks): The reported gains in human-robot policy transfer and action anticipation are presented without ablation isolating the contribution of the dual-branch diffusion versus the vision-language fusion or target indicators; this makes it difficult to attribute the downstream improvements to the claimed synergy capture.
minor comments (2)
- [Abstract, §3.1] The abstract and §3.1 use 'multi-dimensional and multi-target' without an early explicit enumeration of the exact output dimensions (2D/3D waypoints, contact states) and targets (center, wrist, fingers); a short table or bullet list would improve readability.
- [Figure 3] Figure 3 (architecture diagram) lacks labels for the cross-branch connections or loss terms; adding these annotations would clarify how synergy is realized.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the description of our dual-branch diffusion model and to better attribute performance gains in the downstream evaluations. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Dual-branch diffusion): The architecture description does not specify cross-branch conditioning, shared latents, or a joint loss term that would enforce motion synergy between head and hand branches. Without such mechanisms, concurrent prediction could be achieved by independent parallel heads, weakening the justification for the added complexity and the central novelty claim.
Authors: We agree that §4.2 would benefit from greater specificity on the mechanisms that distinguish our dual-branch diffusion from independent parallel heads. The current text states that the model concurrently predicts head and hand movements to capture motion synergy in egocentric vision, but does not detail the implementation. In the revised manuscript we will expand §4.2 to describe the cross-branch attention-based conditioning, the shared latent representations that allow information flow between branches, and the joint loss term that explicitly regularizes consistency between predicted head and hand trajectories. These elements are present in our implementation and are what enable the model to learn the entangled head-hand dynamics that independent branches cannot capture as effectively. revision: yes
-
Referee: [Table 2, §5.3] Table 2 and §5.3 (Downstream benchmarks): The reported gains in human-robot policy transfer and action anticipation are presented without ablation isolating the contribution of the dual-branch diffusion versus the vision-language fusion or target indicators; this makes it difficult to attribute the downstream improvements to the claimed synergy capture.
Authors: We recognize that component ablations would make it easier to attribute downstream gains specifically to the dual-branch synergy modeling. Our reported results reflect the performance of the complete Uni-Hand framework. In the revised version we will add targeted ablations in §5.3 (and an expanded Table 2 or supplementary material) that compare the full model against variants that disable the dual-branch diffusion while retaining vision-language fusion and target indicators. This will provide clearer evidence linking the head-hand synergy capture to the observed improvements in robotic policy transfer and action anticipation/recognition. revision: yes
Circularity Check
No significant circularity; architecture and claims are self-contained innovations
full rationale
The paper introduces a new framework combining vision-language fusion, task-aware text embeddings, target indicators, and a dual-branch diffusion model for concurrent head-hand prediction. These are architectural proposals validated through experiments on public datasets and new downstream benchmarks, rather than any derivation that reduces outputs to fitted inputs or self-referential definitions by construction. No equations, predictions, or load-bearing steps in the abstract or described contributions collapse to prior fitted quantities or unverified self-citations; the central claims rest on novel synthesis and empirical results.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion timestep schedule
axioms (2)
- domain assumption Diffusion processes can model the distribution of future hand and head trajectories from egocentric observations
- domain assumption Vision-language embeddings can be harmonized to reduce modality gaps in motion forecasting
invented entities (2)
-
dual-branch diffusion
no independent evidence
-
target indicators
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean (Jcost uniqueness, washburn_uniqueness_aczel)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision... hybrid Mamba-Transformer module... target indicators... interaction state decoder
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, Lars Paulsen, Ge Yang, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025
-
[2]
Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024
-
[3]
Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei- Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024
-
[4]
Analysis of the hands in egocentric vision: A survey.TP AMI, 45(6):6846–6866, 2020
Andrea Bandini and Jos ´e Zariffa. Analysis of the hands in egocentric vision: A survey.TP AMI, 45(6):6846–6866, 2020
work page 2020
-
[5]
Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks
Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. InCVPR, pages 4372–4381, 2017
work page 2017
-
[6]
Bolin Lai, Miao Liu, Fiona Ryan, and James M Rehg. In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond.IJCV, 132(3):854–871, 2024
work page 2024
-
[7]
Learning to predict gaze in egocentric video
Yin Li, Alireza Fathi, and James M Rehg. Learning to predict gaze in egocentric video. InICCV, pages 3216–3223, 2013
work page 2013
-
[8]
In the eye of beholder: Joint learning of gaze and actions in first person video
Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, 2018
work page 2018
-
[9]
Interaction region visual transformer for egocentric action antici- pation
Debaditya Roy, Ramanathan Rajendiran, and Basura Fernando. Interaction region visual transformer for egocentric action antici- pation. InWACV, pages 6740–6750, 2024
work page 2024
-
[10]
Ego-topo: Environment affordances from egocentric video
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kris- ten Grauman. Ego-topo: Environment affordances from egocentric video. InCVPR, pages 163–172, 2020
work page 2020
-
[11]
Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero, Giovanni Maria Farinella, and Antonino Furnari. Aff-ttention! affordances and attention models for short-term object interaction anticipation.arXiv preprint arXiv:2406.01194, 2024
-
[12]
Miao Liu, Siyu Tang, Yin Li, and James M Rehg. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. InECCV, pages 704–721, 2020
work page 2020
-
[13]
Joint hand motion and interaction hotspots prediction from egocentric videos
Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InCVPR, pages 3282–3292, 2022. 17
work page 2022
-
[14]
Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting
Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. InICCV, 2023
work page 2023
-
[15]
Diff- ip2d: Diffusion-based hand-object interaction prediction on ego- centric videos
Junyi Ma, Jingyi Xu, Xieyuanli Chen, and Hesheng Wang. Diff- ip2d: Diffusion-based hand-object interaction prediction on ego- centric videos. InIROS, 2025
work page 2025
-
[16]
Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, and Hesh- eng Wang. Madiff: Motion-aware mamba diffusion models for hand trajectory prediction on egocentric videos.arXiv preprint arXiv:2409.02638, 2024
-
[17]
Emag: Ego- motion aware and generalizable 2d hand forecasting from egocen- tric videos
Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego- motion aware and generalizable 2d hand forecasting from egocen- tric videos. InECCVW, 2024
work page 2024
-
[18]
Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning
Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, and Siyuan Huang. Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. InCVPR, 2025
work page 2025
-
[19]
Novel diffusion models for multimodal 3d hand trajectory prediction
Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, and Hesheng Wang. Novel diffusion models for multimodal 3d hand trajectory prediction. InIROS, 2025
work page 2025
-
[20]
Attention is all you need.NeurIPS, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017
work page 2017
-
[21]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, and Homanga Bharadhwaj. Handsonvlm: Vision-language models for hand-object interaction prediction.arXiv preprint arXiv:2412.13187, 2024
-
[23]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, pages 13778–13790, 2023
work page 2023
-
[24]
Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025
-
[25]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InCVPR, pages 9826–9836, 2024
work page 2024
-
[26]
Haoye Dong, Aviral Chharia, Wenbo Gou, Francisco Vicente Car- rasco, and Fernando D De la Torre. Hamba: Single-view 3d hand reconstruction with graph-guided bi-scanning mamba.NeurIPS, 37:2127–2160, 2024
work page 2024
-
[27]
Mengcheng Li, Hongwen Zhang, Yuxiang Zhang, Ruizhi Shao, Tao Yu, and Yebin Liu. Hhmr: Holistic hand mesh recovery by enhancing the multimodal controllability of graph diffusion models. InCVPR, pages 645–654, 2024
work page 2024
-
[28]
Recov- ering 3d human mesh from monocular images: A survey.TP AMI, 45(12):15406–15425, 2023
Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Recov- ering 3d human mesh from monocular images: A survey.TP AMI, 45(12):15406–15425, 2023
work page 2023
-
[29]
Jeongwan On, Kyeonghwan Gwak, Gunyoung Kang, Junuk Cha, Soohyun Hwang, Hyein Hwang, and Seungryul Baek. Bigs: Bi- manual category-agnostic interaction reconstruction from monoc- ular videos via 3d gaussian splatting. InCVPR, 2025
work page 2025
-
[30]
What’s in your hands? 3d reconstruction of generic objects in hands
Yufei Ye, Abhinav Gupta, and Shubham Tulsiani. What’s in your hands? 3d reconstruction of generic objects in hands. InCVPR, pages 3895–3905, 2022
work page 2022
-
[31]
Yumeng Liu, Xiaoxiao Long, Zemin Yang, Yuan Liu, Marc Habermann, Christian Theobalt, Yuexin Ma, and Wenping Wang. Easyhoi: Unleashing the power of large models for recon- structing hand-object interactions in the wild.arXiv preprint arXiv:2411.14280, 2024
-
[32]
Zhifan Zhu and Dima Damen. Get a grip: Reconstructing hand-object stable grasps in egocentric videos.arXiv preprint arXiv:2312.15719, 2023
-
[33]
Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020
work page 2020
-
[34]
Diffusion models beat gans on image synthesis.NeurIPS, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.NeurIPS, 34:8780–8794, 2021
work page 2021
-
[35]
Hoidiffusion: Generating realistic 3d hand-object interaction data
Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, and Xiaolong Wang. Hoidiffusion: Generating realistic 3d hand-object interaction data. InCVPR, pages 8521–8531, 2024
work page 2024
-
[36]
Gears: Local geometry-aware hand-object interaction synthesis
Keyang Zhou, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Gears: Local geometry-aware hand-object interaction synthesis. InCVPR, pages 20634–20643, 2024
work page 2024
-
[37]
Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions
Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[38]
Pear: Phrase-based hand-object interaction anticipation.arXiv preprint arXiv:2407.21510, 2024
Zichen Zhang, Hongchen Luo, Wei Zhai, Yang Cao, and Yu Kang. Pear: Phrase-based hand-object interaction anticipation.arXiv preprint arXiv:2407.21510, 2024
-
[39]
Prompting future driven diffusion model for hand motion prediction
Bowen Tang, Kaihao Zhang, Wenhan Luo, Wei Liu, and Hongdong Li. Prompting future driven diffusion model for hand motion prediction. InECCV, pages 169–186. Springer, 2024
work page 2024
-
[40]
Diffuseq: Sequence to sequence text generation with diffusion models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Ling- peng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InICLR, 2023
work page 2023
-
[41]
Diffusion policy: Visuomotor policy learning via action diffusion.IJRR, 2023
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.IJRR, 2023
work page 2023
-
[42]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuo- motor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Where are we in the search for an artificial visual cortex for embodied intelligence?NeurIPS, 2023
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence?NeurIPS, 2023
work page 2023
-
[45]
Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022
Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022
-
[46]
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InICCV, pages 5285–5297, 2023
work page 2023
-
[47]
Okami: Teaching humanoid robots manipulation skills through single video imitation
Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Geor- gios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. InCoRL, 2024
work page 2024
-
[48]
You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,
Huayi Zhou, Ruixiang Wang, Yunxin Tai, Yueci Deng, Guiliang Liu, and Kui Jia. You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations.arXiv preprint arXiv:2501.14208, 2025
-
[49]
Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning,
Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, and Jeannette Bohg. Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning.arXiv preprint arXiv:2501.06994, 2025
-
[50]
Any-point Trajectory Modeling for Policy Learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwonjoon Lee. Can’t make an omelette without breaking some eggs: Plau- sible action anticipation using large video-language models. In CVPR, pages 18580–18590, 2024
work page 2024
-
[52]
Uncertainty-boosted robust video activity anticipation.TP AMI, 2024
Zhaobo Qi, Shuhui Wang, Weigang Zhang, and Qingming Huang. Uncertainty-boosted robust video activity anticipation.TP AMI, 2024
work page 2024
-
[53]
Anticipative feature fusion transformer for multi-modal action anticipation
Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen, and J ¨urgen Beyerer. Anticipative feature fusion transformer for multi-modal action anticipation. InWACV, pages 6068–6077, 2023
work page 2023
-
[54]
Antonino Furnari and Giovanni Maria Farinella. Rolling-unrolling lstms for action anticipation from first-person video.TP AMI, 43(11):4021–4036, 2020
work page 2020
-
[55]
The wisdom of crowds: Temporal progressive attention for early action prediction
Alexandros Stergiou and Dima Damen. The wisdom of crowds: Temporal progressive attention for early action prediction. In CVPR, pages 14709–14719, 2023
work page 2023
-
[56]
Junwu Weng, Xudong Jiang, Wei-Long Zheng, and Junsong Yuan. Early action recognition with category exclusion using policy- based reinforcement learning.TCSVT, 30(12):4626–4638, 2020
work page 2020
-
[57]
Temporal-relational crosstransformers for few-shot action recognition
Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirme- hdi, and Dima Damen. Temporal-relational crosstransformers for few-shot action recognition. InCVPR, pages 475–484, 2021
work page 2021
-
[58]
Dy- namic sampling networks for efficient action recognition in videos
Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, and Limin Wang. Dy- namic sampling networks for efficient action recognition in videos. TIP, 29:7970–7983, 2020
work page 2020
-
[59]
Yilong Wang, Zilin Gao, Qilong Wang, Zhaofeng Chen, Pei- hua Li, and Qinghua Hu. Tamt: Temporal-aware model tun- ing for cross-domain few-shot action recognition.arXiv preprint arXiv:2411.19041, 2024. 18
-
[60]
Multimodal cross-domain few-shot learning for egocentric action recognition
Masashi Hatano, Ryo Hachiuma, Ryo Fujii, and Hideo Saito. Multimodal cross-domain few-shot learning for egocentric action recognition. InECCV, pages 182–199. Springer, 2024
work page 2024
-
[61]
David G Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60:91–110, 2004
work page 2004
-
[62]
Martin A Fischler and Robert C Bolles. Random sample con- sensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981
work page 1981
-
[63]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InCVPR, pages 10965– 10975, 2022
work page 2022
-
[64]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021
work page 2021
-
[65]
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung- Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023
work page internal anchor Pith review arXiv 2023
-
[66]
Egocentric prediction of action target in 3d
Yiming Li, Ziang Cao, Andrew Liang, Benjamin Liang, Luoyao Chen, Hang Zhao, and Chen Feng. Egocentric prediction of action target in 3d. InCVPR, pages 20971–20980, 2022
work page 2022
-
[67]
Deep Image Homography Estimation
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Deep image homography estimation.arXiv preprint arXiv:1606.03798, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[68]
H2o: Two hands manipulating objects for first person interaction recognition
Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InICCV, pages 10138–10148, 2021
work page 2021
-
[69]
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Ham- pali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Foun- tain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167, 2024
-
[70]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InECCV, pages 720–736, 2018
work page 2018
-
[71]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hard- ware.arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Zero-shot temporal interaction localization for egocentric videos
Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, and Hesheng Wang. Zero-shot temporal interaction localization for egocentric videos. InIROS, 2025
work page 2025
-
[74]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[75]
Is mamba effective for time series forecasting?Neurocomputing, 619:129178, 2025
Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Han Zhao, Daling Wang, and Yifei Zhang. Is mamba effective for time series forecasting?Neurocomputing, 619:129178, 2025
work page 2025
-
[76]
Chen Wang, Fei Xia, Wenhao Yu, Tingnan Zhang, Ruohan Zhang, C Karen Liu, Li Fei-Fei, Jie Tan, and Jacky Liang. Chain- of-modality: Learning manipulation programs from multimodal human videos with vision-language-models.arXiv preprint arXiv:2504.13351, 2025
-
[77]
Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022
-
[78]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017
work page 2017
-
[79]
/Volumes/Lenovo_PS9/HOT3D-Clips/datasets/hot3d-traj-aria-joints/clip-001857.pkl
Shangchen Han, Po-chen Wu, Yubo Zhang, Beibei Liu, Linguang Zhang, Zheng Wang, Weiguang Si, Peizhao Zhang, Yujun Cai, Tomas Hodan, et al. Umetrack: Unified multi-view end-to-end hand tracking for vr. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 19 Supplementary Material A DATAORGANIZATION FORPUBLICDATASETS We follow the setups of the prior wor...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.