DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies
Pith reviewed 2026-05-22 15:16 UTC · model grok-4.3
The pith
Co-training robot policies on human hand demonstrations from diverse real-world settings and limited robot data yields policies that generalize to new environments and robot bodies far better than robot data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In DexWild, a diverse team collects hours of human hand interactions across many environments and objects using the low-cost DexWild-System device; the learning framework then co-trains on both this human data and robot demonstrations, producing robot policies that generalize to novel environments, tasks, and embodiments with minimal additional robot-specific data, reaching 68.5 percent success in unseen environments and 5.8 times better cross-embodiment generalization than robot-only training.
What carries the argument
The DexWild co-training framework that combines human hand demonstrations recorded in-the-wild with robot demonstrations to bridge the embodiment gap.
If this is right
- Robot policies reach nearly four times higher success rates in environments never seen during training.
- Cross-embodiment generalization improves by a factor of roughly 5.8.
- Effective policies for new tasks and settings require only minimal extra robot data collection.
- Data gathering for dexterous skills scales by relying on ordinary human hand use rather than full teleoperation.
Where Pith is reading between the lines
- Similar co-training could apply to other robot skills such as navigation or tool use where human examples are easy to gather.
- Larger collections of human data might further shrink the amount of robot data needed for acceptable performance.
- Testing the same co-training recipe across a wider variety of robot arms and grippers would check how far the embodiment bridging extends.
- Pairing this human data source with simulation could create training pipelines that need even less real-world robot time.
Load-bearing premise
Human hand demonstrations recorded by the DexWild-System can be effectively co-trained with robot demonstrations to bridge the embodiment gap and produce superior generalization without requiring substantial new robot data in target environments.
What would settle it
If a policy trained only on robot demonstrations achieves success rates in unseen environments comparable to or higher than the co-trained policy, or shows no meaningful gain in cross-embodiment transfer, the central claim would be falsified.
Figures
read the original abstract
Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DexWild, a framework for collecting diverse in-the-wild human hand interaction data via a low-cost mobile device (DexWild-System) and co-training dexterous manipulation policies on both human and robot demonstrations. The central claim is that this yields robust policies that generalize to novel environments, tasks, and embodiments using minimal additional robot-specific data, evidenced by a reported 68.5% success rate in unseen environments (nearly 4x higher than robot-data-only training) and 5.8x better cross-embodiment generalization.
Significance. If the empirical gains hold under rigorous scrutiny, the work could meaningfully advance scalable data collection for dexterous robotics by leveraging everyday human interactions rather than expensive teleoperation. The approach builds on embodiment-bridging ideas and reports substantial quantitative improvements, which—if reproducible—would support the broader goal of generalizable robot policies with reduced robot data requirements.
major comments (2)
- [§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.
- [§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.
minor comments (2)
- [Abstract] The abstract would benefit from one sentence summarizing the specific tasks, objects, and environments used in the quantitative evaluations.
- [Figures] Ensure all figures include clear captions describing axes, error bars, and what each curve represents (human-only, robot-only, co-trained).
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.
Authors: We appreciate the referee drawing attention to this critical detail. The current manuscript describes the co-training objective at a high level in Section 4 but does not provide an explicit account of the alignment procedure between human and robot data. We agree that specifying the mechanism (such as any retargeting, shared representation, or modality-specific losses) is necessary to support the generalization claims. In the revised manuscript we will expand Section 4 with a precise description of the alignment steps used, including the retargeting function and any shared latent components, so that readers can evaluate how transferable features are learned rather than spurious correlations. revision: yes
-
Referee: [§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.
Authors: We acknowledge that the experimental reporting in Section 5 is incomplete with respect to statistical rigor. The manuscript currently states aggregate success rates without accompanying trial counts, variance measures, or significance testing, and baseline configurations are not fully specified. In the revision we will add the number of trials per condition, standard deviations across random seeds, and statistical tests comparing DexWild to the robot-only baseline. We will also clarify that the primary baselines are trained exclusively on robot data (with an additional mixed-data ablation if space permits) to isolate the contribution of the co-training regime. revision: yes
Circularity Check
No circularity: empirical performance claims rest on experimental comparisons
full rationale
The paper advances an empirical system for collecting in-the-wild human hand demonstrations via DexWild-System and co-training policies on combined human+robot data. All headline results (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented as outcomes of controlled training-regime ablations rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional quantities, uniqueness theorems, or ansatzes appear in the provided text; the central claim is externally falsifiable via the reported success rates on held-out environments and embodiments. This is the normal, non-circular case for a data-driven robotics paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- co-training balance hyperparameters
axioms (1)
- domain assumption Human hand interactions recorded in-the-wild provide demonstrations that transfer effectively to robot embodiments when co-trained
invented entities (1)
-
DexWild-System
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The DexWild learning framework co-trains on both human and robot demonstrations... achieving a 68.5% success rate in unseen environments
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
action space alignment... optimizing robot hand kinematics to match the fingertip positions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five r...
-
OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.
Reference graph
Works this paper leans on
-
[1]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 2
work page 2023
-
[2]
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. 1
work page 1901
-
[4]
Dexycb: A benchmark for capturing hand grasping of objects
Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9044–9053, 2021. 3
work page 2021
-
[5]
Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024. 2
-
[6]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. 5, 14
work page 2023
-
[7]
Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. 2, 4
work page 2024
-
[8]
Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khaz- atsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...
work page 2023
-
[9]
Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter
Juan Antonio Corrales, Francisco A Candelas, and Fer- nando Torres. Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter. InProceedings of the 3rd ACM/IEEE international conference on Human robot interaction, pages 193–200, 2008. 2
work page 2008
-
[10]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions. InProceedings of the European Conference on Computer Vision (ECC...
-
[11]
An unbiased look at datasets for visuo- motor pre-training
Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, and Abhinav Gupta. An unbiased look at datasets for visuo- motor pre-training. InConference on Robot Learning. PMLR, 2023. 2, 3, 5, 14
work page 2023
-
[12]
Robot utility models: General policies for zero-shot deployment in new environments, 2024
Haritheja Etukuru, Norihito Naka, Zijin Hu, Seungjae Lee, Julian Mehu, Aaron Edsinger, Chris Paxton, Soumith Chintala, Lerrel Pinto, and Nur Muhammad Mahi Shafiul- lah. Robot utility models: General policies for zero-shot deployment in new environments, 2024. 2
work page 2024
-
[13]
Arctic: A dataset for dexterous bimanual hand- object manipulation
Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023. 3
work page 2023
-
[14]
Authors from UC San Diego and MIT. Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024. 4
work page 2024
-
[15]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Michael Ryoo, Aljo ˇsa Smoli ´c, Minh V o, and et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11743–11753, 2022. 1, 2
work page 2022
-
[16]
UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning,
work page 2024
-
[17]
Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system
Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020. 4
work page 2020
-
[18]
Toward general-purpose robots via foundation models: A survey and meta-analysis,
Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Zhibo Zhao, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023. 2
-
[19]
Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,
-
[20]
Egomimic: Scaling imitation learning via egocentric video, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. 2, 7
work page 2024
-
[21]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Data scaling laws in imitation learning for robotic manipulation
Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InConference on Robot Learning (CoRL), 2024. 2, 5
work page 2024
-
[24]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1
work page 2023
-
[25]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 202...
-
[27]
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, and Chelsea Finn. R3M: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Hamer: Hand mesh recovery for the egoexo4d hand pose challenge
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Hamer: Hand mesh recovery for the egoexo4d hand pose challenge. 2
-
[29]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
work page 2024
-
[30]
Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis
Alexandra Pfister, Alexandre M West, Shaw Bronner, and Jack Adam Noah. Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis. Journal of medical engineering & technology, 38(5):274– 280, 2014. 2
work page 2014
-
[31]
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988. 5
work page 1988
-
[32]
Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ˜ human policy.arXiv preprint arXiv:2503.13441, 2025. 2, 4, 7
-
[33]
Real-world robot learning with masked visual pre-training.CoRL,
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training.CoRL,
-
[34]
Nathan D. Ratliff, Jan Issac, Daniel Kappler, Stan Birch- field, and Dieter Fox. Riemannian motion policies, 2018. URL https://arxiv.org/abs/1801.02854. 14
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration
Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021. 2
work page 2021
-
[36]
A reduction of imitation learning and structured prediction to no-regret online learning
St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Confer- ence Proceedings, 2011. 5
work page 2011
-
[37]
Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999
Stefan Schaal. Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999. 5
work page 1999
-
[38]
Kenneth Shaw, Ananye Agarwal, and Deepak Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),
-
[39]
Videodex: Learning dexterity from internet videos
Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 654–665. PMLR, 14–18 Dec 2023. 2
work page 2023
-
[40]
Videodex: Learning dexterity from internet videos
Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR,
-
[41]
Bimanual dexterity for complex tasks
Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024. 2, 3, 4, 6, 8, 14
work page 2024
-
[42]
Hand-object interaction pretraining from videos, 2024
Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Jitendra Malik. Hand-object interaction pretraining from videos, 2024. URL https://arxiv.org/abs/2409.08273. 2
-
[43]
Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024. 2
-
[44]
Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022
Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022. 4
work page 2022
-
[45]
Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022
Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022. 2
work page 2022
-
[46]
Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020. 2
work page 2020
-
[47]
HRP: Human affordances for robotic pre-training
Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. HRP: Human affordances for robotic pre-training. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2, 5
work page 2024
-
[48]
PaliGemma 2: A Family of Versatile VLMs for Transfer
Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mes- nard, Ibrahim Alabdulmohsin, Lucas Beyer, and Xiaohua Zhai. PaliGemma 2: A Family of Versatile VLMs for...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
OM Team, D Ghosh, H Walke, K Pertsch, K Black, O Mees, S Dasari, J Hejna, C Xu, J Luo, et al. Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023. 2, 3, 7
work page 2023
-
[50]
Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015
Yushuang Tian, Xiaoli Meng, Dapeng Tao, Dongquan Liu, and Chen Feng. Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015. 2
work page 2015
-
[51]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
https://store.steampowered.com/ steamvr
Valve Corporation. https://store.steampowered.com/ steamvr. [Virtual reality platform]. 3
-
[53]
Attention is all you need.Advances in Neural Information Processing Systems, 2017
A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1
work page 2017
-
[54]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 1, 2
work page 2023
-
[55]
Dexcap: Scalable and portable mocap data collection system for dexterous manipulation
Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024. 2, 3
work page 2024
-
[56]
Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023
Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023. 2, 8
work page 2023
-
[57]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Han Zhang, Songbo Hu, Zhecheng Yuan, and Huazhe Xu. Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025. 2
-
[59]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023. 2, 5, 14
work page 2023
-
[60]
Freihand: A dataset for markerless capture of hand pose and shape from single rgb images
Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 813– 822, 2019. 2, 3 VII. APPENDIX Videos of our results, code to recreate our system, and hard...
work page 2019
-
[61]
and Diffusion U-Net [ 6] as policy classes, which output a sequence of actions. The network outputs actions which consists of relative end effector actions and absolute hand joint angles. We list the hyper-paramaters that we used for policy training using behavior-cloning in this Table V E. Low Level Motion Control For optimal smoothness of our policies a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.