pith. sign in

arxiv: 2505.07813 · v2 · pith:XK2EKSW5new · submitted 2025-05-12 · 💻 cs.RO · cs.AI· cs.CV· cs.LG· cs.SY· eess.SY

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

Pith reviewed 2026-05-22 15:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LGcs.SYeess.SY
keywords dexterous manipulationhuman demonstrationsin-the-wild datarobot policy learninggeneralizationembodiment gapco-training
0
0 comments X

The pith

Co-training robot policies on human hand demonstrations from diverse real-world settings and limited robot data yields policies that generalize to new environments and robot bodies far better than robot data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that people can collect useful dexterous manipulation data simply by using their own hands in everyday environments, recorded with a low-cost mobile device. The method jointly trains policies on these human demonstrations plus a smaller set of robot demonstrations. If the approach holds, it offers a scalable alternative to expensive robot teleoperation while producing policies that succeed more often in unfamiliar places, on new tasks, and even on different robot hardware. A sympathetic reader would care because it points toward using abundant natural human behavior to make robot learning practical without massive new robot datasets in every target setting.

Core claim

In DexWild, a diverse team collects hours of human hand interactions across many environments and objects using the low-cost DexWild-System device; the learning framework then co-trains on both this human data and robot demonstrations, producing robot policies that generalize to novel environments, tasks, and embodiments with minimal additional robot-specific data, reaching 68.5 percent success in unseen environments and 5.8 times better cross-embodiment generalization than robot-only training.

What carries the argument

The DexWild co-training framework that combines human hand demonstrations recorded in-the-wild with robot demonstrations to bridge the embodiment gap.

If this is right

  • Robot policies reach nearly four times higher success rates in environments never seen during training.
  • Cross-embodiment generalization improves by a factor of roughly 5.8.
  • Effective policies for new tasks and settings require only minimal extra robot data collection.
  • Data gathering for dexterous skills scales by relying on ordinary human hand use rather than full teleoperation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar co-training could apply to other robot skills such as navigation or tool use where human examples are easy to gather.
  • Larger collections of human data might further shrink the amount of robot data needed for acceptable performance.
  • Testing the same co-training recipe across a wider variety of robot arms and grippers would check how far the embodiment bridging extends.
  • Pairing this human data source with simulation could create training pipelines that need even less real-world robot time.

Load-bearing premise

Human hand demonstrations recorded by the DexWild-System can be effectively co-trained with robot demonstrations to bridge the embodiment gap and produce superior generalization without requiring substantial new robot data in target environments.

What would settle it

If a policy trained only on robot demonstrations achieves success rates in unseen environments comparable to or higher than the co-trained policy, or shows no meaningful gain in cross-embodiment transfer, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2505.07813 by Deepak Pathak, Jason Jingzhou Liu, Kenneth Shaw, Mohan Kumar Srirama, Tony Tao.

Figure 1
Figure 1. Figure 1: DexWild enables dexterous policies to generalize to new objects, scenes, and embodiments. This is achieved by leveraging large-scale, real-world human embodiment data collected in many scenes and co-trained with a smaller robot embodiment dataset for grounding. Abstract—Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to nove… view at source ↗
Figure 2
Figure 2. Figure 2: Left: DexWild efficiently capture high-fidelity data using an individual’s own hands across various environments. Right: Robot hands are equipped with cameras aligned with the human cameras. We test DexWild on two distinct robot hands and robot arms. use. Instead, DexWild, along with prior works [41, 55], adopts a lightweight glove-based solution that uses electromagnetic field (EMF) sensing to estimate fi… view at source ↗
Figure 3
Figure 3. Figure 3: DexWild aligns the visual observations between humans and robots to bridge the embodiment gap. This incentivizes the model to learn a task-centric rather than embodiment-centric representation. we use the data collected by DexWild to enable dexterous policies to generalize to in-the-wild scenarios. B. Training Data Modalities and Preprocessing Generalization in dexterous manipulation demands both scale and… view at source ↗
Figure 4
Figure 4. Figure 4: Using DexWild-System, humans can effortlessly collect accurate data with their own hands across a wide range of environments. This data is directly used to train any robot hand to perform dexterous manipulation in a human-like way in any environment. We validate this approach on five representative tasks. Please see videos of these tasks on our website at https://dexwild.github.io For bimanual tasks, the o… view at source ↗
Figure 5
Figure 5. Figure 5: We collect data using a diverse set of objects across categories. Spray Bottle Task – 25 Train, 11 Test; Toy Cleanup Task – 64 Train, 9 Test; Pour Task – 35 Train, 5 Test; Florist Task - 6 Train, 2 Test; Clothes Folding Task - 17 Train, 6 Test. A. Scaling up Data Collection Our hardware system was deployed to 10 untrained users to collect data across a wide range of real-world environments. These settings … view at source ↗
Figure 6
Figure 6. Figure 6: How does co-training help with scaling up in the wild performance? We evaluate our policy across three scenarios: (a) In-Domain scenes where robot training data was collected but with novel objects, (b) In-the-Wild scenes present in DexWild but not in robot data, and (c) In-the-Wild Extreme scenes absent from both datasets. Displayed ratio is Robot:Human. in complex scenes. However, without robot-specific … view at source ↗
Figure 7
Figure 7. Figure 7: Left: Cross-Task Performance – Evaluating DexWild on the pour task using robot data exclusively from the spray task. Middle: Cross-Embodiment Performance – Testing DexWild policy on the Original LEAP hand and a Franka robot arm. Right: Scaling Performance – Demonstrating improved DexWild performance as dataset size increases. Displayed ratio is Robot:Human. in the 25–50% range, suggesting a critical thresh… view at source ↗
Figure 8
Figure 8. Figure 8: , DexWild-System achieves an average collection rate of 201 demos/hour across five representative tasks—nearly matching the rate of demonstrations collected using bare hands and 4.6× faster than a traditional robot teleoperation system based on Gello [41, 56], which achieves just 43 demos/hour. We identify three key limitations of Gello-based collection that our system overcomes: 1) Lack of haptic feedback… view at source ↗
Figure 9
Figure 9. Figure 9: DexWild-System features a simple and easy-to-use interface for deployment by untrained data collectors. – 0.75: Grasp the bouquet, handover – 1.00: Grasp the bouquet, handover, insert into vase Clothes Folding This task tests manipulation of deformable objects using both hands. The robot must fold a clothing item placed on a surface. – 0.00: Nothing – 0.25: Tries grasp but fails – 0.50: Grasp with one hand… view at source ↗
read the original abstract

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DexWild, a framework for collecting diverse in-the-wild human hand interaction data via a low-cost mobile device (DexWild-System) and co-training dexterous manipulation policies on both human and robot demonstrations. The central claim is that this yields robust policies that generalize to novel environments, tasks, and embodiments using minimal additional robot-specific data, evidenced by a reported 68.5% success rate in unseen environments (nearly 4x higher than robot-data-only training) and 5.8x better cross-embodiment generalization.

Significance. If the empirical gains hold under rigorous scrutiny, the work could meaningfully advance scalable data collection for dexterous robotics by leveraging everyday human interactions rather than expensive teleoperation. The approach builds on embodiment-bridging ideas and reports substantial quantitative improvements, which—if reproducible—would support the broader goal of generalizable robot policies with reduced robot data requirements.

major comments (2)
  1. [§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.
  2. [§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.
minor comments (2)
  1. [Abstract] The abstract would benefit from one sentence summarizing the specific tasks, objects, and environments used in the quantitative evaluations.
  2. [Figures] Ensure all figures include clear captions describing axes, error bars, and what each curve represents (human-only, robot-only, co-trained).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.

    Authors: We appreciate the referee drawing attention to this critical detail. The current manuscript describes the co-training objective at a high level in Section 4 but does not provide an explicit account of the alignment procedure between human and robot data. We agree that specifying the mechanism (such as any retargeting, shared representation, or modality-specific losses) is necessary to support the generalization claims. In the revised manuscript we will expand Section 4 with a precise description of the alignment steps used, including the retargeting function and any shared latent components, so that readers can evaluate how transferable features are learned rather than spurious correlations. revision: yes

  2. Referee: [§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.

    Authors: We acknowledge that the experimental reporting in Section 5 is incomplete with respect to statistical rigor. The manuscript currently states aggregate success rates without accompanying trial counts, variance measures, or significance testing, and baseline configurations are not fully specified. In the revision we will add the number of trials per condition, standard deviations across random seeds, and statistical tests comparing DexWild to the robot-only baseline. We will also clarify that the primary baselines are trained exclusively on robot data (with an additional mixed-data ablation if space permits) to isolate the contribution of the co-training regime. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on experimental comparisons

full rationale

The paper advances an empirical system for collecting in-the-wild human hand demonstrations via DexWild-System and co-training policies on combined human+robot data. All headline results (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented as outcomes of controlled training-regime ablations rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional quantities, uniqueness theorems, or ansatzes appear in the provided text; the central claim is externally falsifiable via the reported success rates on held-out environments and embodiments. This is the normal, non-circular case for a data-driven robotics paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that human hand data transfers usefully to robot control via co-training; a new data collection device is introduced without independent evidence beyond the paper's claims. No explicit free parameters are detailed in the abstract.

free parameters (1)
  • co-training balance hyperparameters
    Parameters likely used to weight human versus robot data contributions during training.
axioms (1)
  • domain assumption Human hand interactions recorded in-the-wild provide demonstrations that transfer effectively to robot embodiments when co-trained
    This premise underpins the claim that minimal additional robot data suffices for generalization.
invented entities (1)
  • DexWild-System no independent evidence
    purpose: Low-cost mobile device for recording human hand interactions
    New hardware introduced to enable scalable data collection.

pith-pipeline@v0.9.0 · 5771 in / 1463 out tokens · 90007 ms · 2026-05-22T15:16:54.045647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

    cs.RO 2026-04 unverdicted novelty 7.0

    HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.

  2. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  3. X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

    cs.RO 2025-11 unverdicted novelty 6.0

    X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five r...

  4. OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

    cs.RO 2026-04 unverdicted novelty 5.0

    OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 4 Pith papers · 8 internal anchors

  1. [1]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 2

  2. [2]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. 1

  4. [4]

    Dexycb: A benchmark for capturing hand grasping of objects

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9044–9053, 2021. 3

  5. [5]

    Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

    Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024. 2

  6. [6]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. 5, 14

  7. [7]

    Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. 2, 4

  8. [8]

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khaz- atsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

  9. [9]

    Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter

    Juan Antonio Corrales, Francisco A Candelas, and Fer- nando Torres. Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter. InProceedings of the 3rd ACM/IEEE international conference on Human robot interaction, pages 193–200, 2008. 2

  10. [10]

    Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions. InProceedings of the European Conference on Computer Vision (ECC...

  11. [11]

    An unbiased look at datasets for visuo- motor pre-training

    Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, and Abhinav Gupta. An unbiased look at datasets for visuo- motor pre-training. InConference on Robot Learning. PMLR, 2023. 2, 3, 5, 14

  12. [12]

    Robot utility models: General policies for zero-shot deployment in new environments, 2024

    Haritheja Etukuru, Norihito Naka, Zijin Hu, Seungjae Lee, Julian Mehu, Aaron Edsinger, Chris Paxton, Soumith Chintala, Lerrel Pinto, and Nur Muhammad Mahi Shafiul- lah. Robot utility models: General policies for zero-shot deployment in new environments, 2024. 2

  13. [13]

    Arctic: A dataset for dexterous bimanual hand- object manipulation

    Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023. 3

  14. [14]

    Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024

    Authors from UC San Diego and MIT. Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024. 4

  15. [15]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Michael Ryoo, Aljo ˇsa Smoli ´c, Minh V o, and et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11743–11753, 2022. 1, 2

  16. [16]

    UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning,

  17. [17]

    Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system

    Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020. 4

  18. [18]

    Toward general-purpose robots via foundation models: A survey and meta-analysis,

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Zhibo Zhao, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023. 2

  19. [19]

    OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

    Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

  20. [20]

    Egomimic: Scaling imitation learning via egocentric video, 2024

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. 2, 7

  21. [21]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 1, 2, 7

  22. [22]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2

  23. [23]

    Data scaling laws in imitation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InConference on Robot Learning (CoRL), 2024. 2, 5

  24. [24]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

  25. [25]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

  26. [26]

    Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 2023

    Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 202...

  27. [27]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, and Chelsea Finn. R3M: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

  28. [28]

    Hamer: Hand mesh recovery for the egoexo4d hand pose challenge

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Hamer: Hand mesh recovery for the egoexo4d hand pose challenge. 2

  29. [29]

    Reconstructing hands in 3d with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  30. [30]

    Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis

    Alexandra Pfister, Alexandre M West, Shaw Bronner, and Jack Adam Noah. Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis. Journal of medical engineering & technology, 38(5):274– 280, 2014. 2

  31. [31]

    Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988. 5

  32. [32]

    Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ˜ human policy.arXiv preprint arXiv:2503.13441, 2025. 2, 4, 7

  33. [33]

    Real-world robot learning with masked visual pre-training.CoRL,

    Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training.CoRL,

  34. [34]

    Riemannian Motion Policies

    Nathan D. Ratliff, Jan Issac, Daniel Kappler, Stan Birch- field, and Dieter Fox. Riemannian motion policies, 2018. URL https://arxiv.org/abs/1801.02854. 14

  35. [35]

    Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration

    Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021. 2

  36. [36]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Confer- ence Proceedings, 2011. 5

  37. [37]

    Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999

    Stefan Schaal. Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999. 5

  38. [38]

    Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

    Kenneth Shaw, Ananye Agarwal, and Deepak Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

  39. [39]

    Videodex: Learning dexterity from internet videos

    Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 654–665. PMLR, 14–18 Dec 2023. 2

  40. [40]

    Videodex: Learning dexterity from internet videos

    Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR,

  41. [41]

    Bimanual dexterity for complex tasks

    Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024. 2, 3, 4, 6, 8, 14

  42. [42]

    Hand-object interaction pretraining from videos, 2024

    Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Jitendra Malik. Hand-object interaction pretraining from videos, 2024. URL https://arxiv.org/abs/2409.08273. 2

  43. [43]

    Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

    Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024. 2

  44. [44]

    Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022

    Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022. 4

  45. [45]

    Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022

    Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022. 2

  46. [46]

    Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020

    Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020. 2

  47. [47]

    HRP: Human affordances for robotic pre-training

    Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. HRP: Human affordances for robotic pre-training. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2, 5

  48. [48]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mes- nard, Ibrahim Alabdulmohsin, Lucas Beyer, and Xiaohua Zhai. PaliGemma 2: A Family of Versatile VLMs for...

  49. [49]

    Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023

    OM Team, D Ghosh, H Walke, K Pertsch, K Black, O Mees, S Dasari, J Hejna, C Xu, J Luo, et al. Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023. 2, 3, 7

  50. [50]

    Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015

    Yushuang Tian, Xiaoli Meng, Dapeng Tao, Dongquan Liu, and Chen Feng. Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015. 2

  51. [51]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

  52. [52]

    https://store.steampowered.com/ steamvr

    Valve Corporation. https://store.steampowered.com/ steamvr. [Virtual reality platform]. 3

  53. [53]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1

  54. [54]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 1, 2

  55. [55]

    Dexcap: Scalable and portable mocap data collection system for dexterous manipulation

    Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024. 2, 3

  56. [56]

    Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023. 2, 8

  57. [57]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,

  58. [58]

    Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025

    Han Zhang, Songbo Hu, Zhecheng Yuan, and Huazhe Xu. Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025. 2

  59. [59]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023. 2, 5, 14

  60. [60]

    Freihand: A dataset for markerless capture of hand pose and shape from single rgb images

    Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 813– 822, 2019. 2, 3 VII. APPENDIX Videos of our results, code to recreate our system, and hard...

  61. [61]

    The network outputs actions which consists of relative end effector actions and absolute hand joint angles

    and Diffusion U-Net [ 6] as policy classes, which output a sequence of actions. The network outputs actions which consists of relative end effector actions and absolute hand joint angles. We list the hyper-paramaters that we used for policy training using behavior-cloning in this Table V E. Low Level Motion Control For optimal smoothness of our policies a...