DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

Deepak Pathak; Jason Jingzhou Liu; Kenneth Shaw; Mohan Kumar Srirama; Tony Tao

arxiv: 2505.07813 · v2 · pith:XK2EKSW5new · submitted 2025-05-12 · 💻 cs.RO · cs.AI· cs.CV· cs.LG· cs.SY· eess.SY

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

Tony Tao , Mohan Kumar Srirama , Jason Jingzhou Liu , Kenneth Shaw , Deepak Pathak This is my paper

Pith reviewed 2026-05-22 15:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LGcs.SYeess.SY

keywords dexterous manipulationhuman demonstrationsin-the-wild datarobot policy learninggeneralizationembodiment gapco-training

0 comments

The pith

Co-training robot policies on human hand demonstrations from diverse real-world settings and limited robot data yields policies that generalize to new environments and robot bodies far better than robot data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that people can collect useful dexterous manipulation data simply by using their own hands in everyday environments, recorded with a low-cost mobile device. The method jointly trains policies on these human demonstrations plus a smaller set of robot demonstrations. If the approach holds, it offers a scalable alternative to expensive robot teleoperation while producing policies that succeed more often in unfamiliar places, on new tasks, and even on different robot hardware. A sympathetic reader would care because it points toward using abundant natural human behavior to make robot learning practical without massive new robot datasets in every target setting.

Core claim

In DexWild, a diverse team collects hours of human hand interactions across many environments and objects using the low-cost DexWild-System device; the learning framework then co-trains on both this human data and robot demonstrations, producing robot policies that generalize to novel environments, tasks, and embodiments with minimal additional robot-specific data, reaching 68.5 percent success in unseen environments and 5.8 times better cross-embodiment generalization than robot-only training.

What carries the argument

The DexWild co-training framework that combines human hand demonstrations recorded in-the-wild with robot demonstrations to bridge the embodiment gap.

If this is right

Robot policies reach nearly four times higher success rates in environments never seen during training.
Cross-embodiment generalization improves by a factor of roughly 5.8.
Effective policies for new tasks and settings require only minimal extra robot data collection.
Data gathering for dexterous skills scales by relying on ordinary human hand use rather than full teleoperation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar co-training could apply to other robot skills such as navigation or tool use where human examples are easy to gather.
Larger collections of human data might further shrink the amount of robot data needed for acceptable performance.
Testing the same co-training recipe across a wider variety of robot arms and grippers would check how far the embodiment bridging extends.
Pairing this human data source with simulation could create training pipelines that need even less real-world robot time.

Load-bearing premise

Human hand demonstrations recorded by the DexWild-System can be effectively co-trained with robot demonstrations to bridge the embodiment gap and produce superior generalization without requiring substantial new robot data in target environments.

What would settle it

If a policy trained only on robot demonstrations achieves success rates in unseen environments comparable to or higher than the co-trained policy, or shows no meaningful gain in cross-embodiment transfer, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2505.07813 by Deepak Pathak, Jason Jingzhou Liu, Kenneth Shaw, Mohan Kumar Srirama, Tony Tao.

**Figure 1.** Figure 1: DexWild enables dexterous policies to generalize to new objects, scenes, and embodiments. This is achieved by leveraging large-scale, real-world human embodiment data collected in many scenes and co-trained with a smaller robot embodiment dataset for grounding. Abstract—Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to nove… view at source ↗

**Figure 2.** Figure 2: Left: DexWild efficiently capture high-fidelity data using an individual’s own hands across various environments. Right: Robot hands are equipped with cameras aligned with the human cameras. We test DexWild on two distinct robot hands and robot arms. use. Instead, DexWild, along with prior works [41, 55], adopts a lightweight glove-based solution that uses electromagnetic field (EMF) sensing to estimate fi… view at source ↗

**Figure 3.** Figure 3: DexWild aligns the visual observations between humans and robots to bridge the embodiment gap. This incentivizes the model to learn a task-centric rather than embodiment-centric representation. we use the data collected by DexWild to enable dexterous policies to generalize to in-the-wild scenarios. B. Training Data Modalities and Preprocessing Generalization in dexterous manipulation demands both scale and… view at source ↗

**Figure 4.** Figure 4: Using DexWild-System, humans can effortlessly collect accurate data with their own hands across a wide range of environments. This data is directly used to train any robot hand to perform dexterous manipulation in a human-like way in any environment. We validate this approach on five representative tasks. Please see videos of these tasks on our website at https://dexwild.github.io For bimanual tasks, the o… view at source ↗

**Figure 5.** Figure 5: We collect data using a diverse set of objects across categories. Spray Bottle Task – 25 Train, 11 Test; Toy Cleanup Task – 64 Train, 9 Test; Pour Task – 35 Train, 5 Test; Florist Task - 6 Train, 2 Test; Clothes Folding Task - 17 Train, 6 Test. A. Scaling up Data Collection Our hardware system was deployed to 10 untrained users to collect data across a wide range of real-world environments. These settings … view at source ↗

**Figure 6.** Figure 6: How does co-training help with scaling up in the wild performance? We evaluate our policy across three scenarios: (a) In-Domain scenes where robot training data was collected but with novel objects, (b) In-the-Wild scenes present in DexWild but not in robot data, and (c) In-the-Wild Extreme scenes absent from both datasets. Displayed ratio is Robot:Human. in complex scenes. However, without robot-specific … view at source ↗

**Figure 7.** Figure 7: Left: Cross-Task Performance – Evaluating DexWild on the pour task using robot data exclusively from the spray task. Middle: Cross-Embodiment Performance – Testing DexWild policy on the Original LEAP hand and a Franka robot arm. Right: Scaling Performance – Demonstrating improved DexWild performance as dataset size increases. Displayed ratio is Robot:Human. in the 25–50% range, suggesting a critical thresh… view at source ↗

**Figure 8.** Figure 8: , DexWild-System achieves an average collection rate of 201 demos/hour across five representative tasks—nearly matching the rate of demonstrations collected using bare hands and 4.6× faster than a traditional robot teleoperation system based on Gello [41, 56], which achieves just 43 demos/hour. We identify three key limitations of Gello-based collection that our system overcomes: 1) Lack of haptic feedback… view at source ↗

**Figure 9.** Figure 9: DexWild-System features a simple and easy-to-use interface for deployment by untrained data collectors. – 0.75: Grasp the bouquet, handover – 1.00: Grasp the bouquet, handover, insert into vase Clothes Folding This task tests manipulation of deformable objects using both hands. The robot must fold a clothing item placed on a surface. – 0.00: Nothing – 0.25: Tries grasp but fails – 0.50: Grasp with one hand… view at source ↗

read the original abstract

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixing casual human hand data collected in the wild with robot demos via a cheap mobile device gives clear generalization lifts for dexterous policies, though the exact alignment between hand and robot spaces needs more detail to trust the gains fully.

read the letter

The key point from this paper is that pairing human hand data collected casually in many real places with standard robot demos can lift dexterous policy performance in new settings by a good margin. They introduce a simple device called DexWild-System that lets people record their own hand actions without special setups. A team of collectors used it to gather diverse interactions. The training then mixes these human trajectories with robot ones, and the results show policies that do better on unseen environments and transfer more easily to different robot bodies. This stands out because it tackles the data scarcity problem head on with something cheaper than teleoperation. The numbers they give, like 68.5 percent success where robot-only gets much less, and 5.8 times better cross-embodiment, suggest the mix adds useful variety. What works here is the focus on in-the-wild collection. It feels like a step toward making data gathering more like how humans actually interact with objects every day. If the co-training really lets the policy pick up general features, that could help with the embodiment gap without needing lots of new robot trials in every target spot. The softer part is how they handle turning human hand movements into something a robot can use. The abstract talks about co-training on both but does not spell out the retargeting or any special losses for the different kinematics and sensors. If that step is not careful, the improvements might come from the policy fitting to mixed noise instead of learning robust behaviors. The concern about spurious correlations is worth checking in the methods. Overall, this paper targets researchers building imitation learning systems for hands-on tasks. Anyone thinking about scaling robot data collection would find the practical device and the generalization claims relevant. I would recommend sending it to peer review. The idea has clear potential, and the experiments provide a starting point for discussion even if more details on the alignment would strengthen it.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DexWild, a framework for collecting diverse in-the-wild human hand interaction data via a low-cost mobile device (DexWild-System) and co-training dexterous manipulation policies on both human and robot demonstrations. The central claim is that this yields robust policies that generalize to novel environments, tasks, and embodiments using minimal additional robot-specific data, evidenced by a reported 68.5% success rate in unseen environments (nearly 4x higher than robot-data-only training) and 5.8x better cross-embodiment generalization.

Significance. If the empirical gains hold under rigorous scrutiny, the work could meaningfully advance scalable data collection for dexterous robotics by leveraging everyday human interactions rather than expensive teleoperation. The approach builds on embodiment-bridging ideas and reports substantial quantitative improvements, which—if reproducible—would support the broader goal of generalizable robot policies with reduced robot data requirements.

major comments (2)

[§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.
[§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.

minor comments (2)

[Abstract] The abstract would benefit from one sentence summarizing the specific tasks, objects, and environments used in the quantitative evaluations.
[Figures] Ensure all figures include clear captions describing axes, error bars, and what each curve represents (human-only, robot-only, co-trained).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (Learning Framework): The abstract and methods description state that the framework 'co-trains on both human and robot demonstrations' but provide no explicit mechanism (e.g., retargeting function, shared latent space, or contact-aware loss) for aligning differing kinematics, contact geometry, and observation modalities. This alignment step is load-bearing for the 'minimal additional robot-specific data' and 5.8x cross-embodiment claims; without it, gains may arise from spurious correlations rather than transferable features.

Authors: We appreciate the referee drawing attention to this critical detail. The current manuscript describes the co-training objective at a high level in Section 4 but does not provide an explicit account of the alignment procedure between human and robot data. We agree that specifying the mechanism (such as any retargeting, shared representation, or modality-specific losses) is necessary to support the generalization claims. In the revised manuscript we will expand Section 4 with a precise description of the alignment steps used, including the retargeting function and any shared latent components, so that readers can evaluate how transferable features are learned rather than spurious correlations. revision: yes
Referee: [§5] §5 (Experiments): The headline metrics (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented without reported trial counts, standard deviations, statistical tests, or full baseline details (including whether baselines also incorporate any form of mixed data). This omission prevents assessment of whether improvements are robustly attributable to the co-training regime rather than implementation specifics or data selection.

Authors: We acknowledge that the experimental reporting in Section 5 is incomplete with respect to statistical rigor. The manuscript currently states aggregate success rates without accompanying trial counts, variance measures, or significance testing, and baseline configurations are not fully specified. In the revision we will add the number of trials per condition, standard deviations across random seeds, and statistical tests comparing DexWild to the robot-only baseline. We will also clarify that the primary baselines are trained exclusively on robot data (with an additional mixed-data ablation if space permits) to isolate the contribution of the co-training regime. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on experimental comparisons

full rationale

The paper advances an empirical system for collecting in-the-wild human hand demonstrations via DexWild-System and co-training policies on combined human+robot data. All headline results (68.5% success in unseen environments, 5.8x cross-embodiment gain) are presented as outcomes of controlled training-regime ablations rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional quantities, uniqueness theorems, or ansatzes appear in the provided text; the central claim is externally falsifiable via the reported success rates on held-out environments and embodiments. This is the normal, non-circular case for a data-driven robotics paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that human hand data transfers usefully to robot control via co-training; a new data collection device is introduced without independent evidence beyond the paper's claims. No explicit free parameters are detailed in the abstract.

free parameters (1)

co-training balance hyperparameters
Parameters likely used to weight human versus robot data contributions during training.

axioms (1)

domain assumption Human hand interactions recorded in-the-wild provide demonstrations that transfer effectively to robot embodiments when co-trained
This premise underpins the claim that minimal additional robot data suffices for generalization.

invented entities (1)

DexWild-System no independent evidence
purpose: Low-cost mobile device for recording human hand interactions
New hardware introduced to enable scalable data collection.

pith-pipeline@v0.9.0 · 5771 in / 1463 out tokens · 90007 ms · 2026-05-22T15:16:54.045647+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The DexWild learning framework co-trains on both human and robot demonstrations... achieving a 68.5% success rate in unseen environments
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

action space alignment... optimizing robot hand kinematics to match the fingertip positions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
cs.RO 2026-04 unverdicted novelty 7.0

HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
cs.RO 2025-11 unverdicted novelty 6.0

X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five r...
OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
cs.RO 2026-04 unverdicted novelty 5.0

OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 4 Pith papers · 8 internal anchors

[1]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 2

work page 2023
[2]

Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

work page arXiv
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. 1

work page 1901
[4]

Dexycb: A benchmark for capturing hand grasping of objects

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9044–9053, 2021. 3

work page 2021
[5]

Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024. 2

work page arXiv 2024
[6]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. 5, 14

work page 2023
[7]

Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. 2, 4

work page 2024
[8]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khaz- atsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

work page 2023
[9]

Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter

Juan Antonio Corrales, Francisco A Candelas, and Fer- nando Torres. Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter. InProceedings of the 3rd ACM/IEEE international conference on Human robot interaction, pages 193–200, 2008. 2

work page 2008
[10]

Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions. InProceedings of the European Conference on Computer Vision (ECC...

work page
[11]

An unbiased look at datasets for visuo- motor pre-training

Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, and Abhinav Gupta. An unbiased look at datasets for visuo- motor pre-training. InConference on Robot Learning. PMLR, 2023. 2, 3, 5, 14

work page 2023
[12]

Robot utility models: General policies for zero-shot deployment in new environments, 2024

Haritheja Etukuru, Norihito Naka, Zijin Hu, Seungjae Lee, Julian Mehu, Aaron Edsinger, Chris Paxton, Soumith Chintala, Lerrel Pinto, and Nur Muhammad Mahi Shafiul- lah. Robot utility models: General policies for zero-shot deployment in new environments, 2024. 2

work page 2024
[13]

Arctic: A dataset for dexterous bimanual hand- object manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023. 3

work page 2023
[14]

Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024

Authors from UC San Diego and MIT. Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024. 4

work page 2024
[15]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Michael Ryoo, Aljo ˇsa Smoli ´c, Minh V o, and et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11743–11753, 2022. 1, 2

work page 2022
[16]

UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers

Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning,

work page 2024
[17]

Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020. 4

work page 2020
[18]

Toward general-purpose robots via foundation models: A survey and meta-analysis,

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Zhibo Zhao, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023. 2

work page arXiv 2023
[19]

OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

work page arXiv
[20]

Egomimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. 2, 7

work page 2024
[21]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Data scaling laws in imitation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InConference on Robot Learning (CoRL), 2024. 2, 5

work page 2024
[24]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

work page 2023
[25]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 2023

Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 202...

work page doi:10.1109/lra.2023.3270034 2023
[27]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, and Chelsea Finn. R3M: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Hamer: Hand mesh recovery for the egoexo4d hand pose challenge

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Hamer: Hand mesh recovery for the egoexo4d hand pose challenge. 2

work page
[29]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[30]

Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis

Alexandra Pfister, Alexandre M West, Shaw Bronner, and Jack Adam Noah. Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis. Journal of medical engineering & technology, 38(5):274– 280, 2014. 2

work page 2014
[31]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988. 5

work page 1988
[32]

Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ˜ human policy.arXiv preprint arXiv:2503.13441, 2025. 2, 4, 7

work page arXiv 2025
[33]

Real-world robot learning with masked visual pre-training.CoRL,

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training.CoRL,

work page
[34]

Riemannian Motion Policies

Nathan D. Ratliff, Jan Issac, Daniel Kappler, Stan Birch- field, and Dieter Fox. Riemannian motion policies, 2018. URL https://arxiv.org/abs/1801.02854. 14

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration

Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021. 2

work page 2021
[36]

A reduction of imitation learning and structured prediction to no-regret online learning

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Confer- ence Proceedings, 2011. 5

work page 2011
[37]

Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999

Stefan Schaal. Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999. 5

work page 1999
[38]

Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

Kenneth Shaw, Ananye Agarwal, and Deepak Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

work page
[39]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 654–665. PMLR, 14–18 Dec 2023. 2

work page 2023
[40]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR,

work page
[41]

Bimanual dexterity for complex tasks

Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024. 2, 3, 4, 6, 8, 14

work page 2024
[42]

Hand-object interaction pretraining from videos, 2024

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Jitendra Malik. Hand-object interaction pretraining from videos, 2024. URL https://arxiv.org/abs/2409.08273. 2

work page arXiv 2024
[43]

Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024. 2

work page arXiv 2024
[44]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022. 4

work page 2022
[45]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022. 2

work page 2022
[46]

Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020

Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020. 2

work page 2020
[47]

HRP: Human affordances for robotic pre-training

Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. HRP: Human affordances for robotic pre-training. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2, 5

work page 2024
[48]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mes- nard, Ibrahim Alabdulmohsin, Lucas Beyer, and Xiaohua Zhai. PaliGemma 2: A Family of Versatile VLMs for...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023

OM Team, D Ghosh, H Walke, K Pertsch, K Black, O Mees, S Dasari, J Hejna, C Xu, J Luo, et al. Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023. 2, 3, 7

work page 2023
[50]

Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015

Yushuang Tian, Xiaoli Meng, Dapeng Tao, Dongquan Liu, and Chen Feng. Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015. 2

work page 2015
[51]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

https://store.steampowered.com/ steamvr

Valve Corporation. https://store.steampowered.com/ steamvr. [Virtual reality platform]. 3

work page
[53]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1

work page 2017
[54]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 1, 2

work page 2023
[55]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024. 2, 3

work page 2024
[56]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023. 2, 8

work page 2023
[57]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025

Han Zhang, Songbo Hu, Zhecheng Yuan, and Huazhe Xu. Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025. 2

work page arXiv 2025
[59]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023. 2, 5, 14

work page 2023
[60]

Freihand: A dataset for markerless capture of hand pose and shape from single rgb images

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 813– 822, 2019. 2, 3 VII. APPENDIX Videos of our results, code to recreate our system, and hard...

work page 2019
[61]

The network outputs actions which consists of relative end effector actions and absolute hand joint angles

and Diffusion U-Net [ 6] as policy classes, which output a sequence of actions. The network outputs actions which consists of relative end effector actions and absolute hand joint angles. We list the hyper-paramaters that we used for policy training using behavior-cloning in this Table V E. Low Level Motion Control For optimal smoothness of our policies a...

work page arXiv 2000

[1] [1]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 2

work page 2023

[2] [2]

Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos.arXiv preprint arXiv:2411.19167,

work page arXiv

[3] [3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. 1

work page 1901

[4] [4]

Dexycb: A benchmark for capturing hand grasping of objects

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9044–9053, 2021. 3

work page 2021

[5] [5]

Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024. 2

work page arXiv 2024

[6] [6]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. 5, 14

work page 2023

[7] [7]

Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. 2, 4

work page 2024

[8] [8]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khaz- atsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, ...

work page 2023

[9] [9]

Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter

Juan Antonio Corrales, Francisco A Candelas, and Fer- nando Torres. Hybrid tracking of human operators using imu/uwb data fusion by a kalman filter. InProceedings of the 3rd ACM/IEEE international conference on Human robot interaction, pages 193–200, 2008. 2

work page 2008

[10] [10]

Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Epic-kitchens: A large-scale dataset for recognizing, anticipating, and retrieving hand- object interactions. InProceedings of the European Conference on Computer Vision (ECC...

work page

[11] [11]

An unbiased look at datasets for visuo- motor pre-training

Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, and Abhinav Gupta. An unbiased look at datasets for visuo- motor pre-training. InConference on Robot Learning. PMLR, 2023. 2, 3, 5, 14

work page 2023

[12] [12]

Robot utility models: General policies for zero-shot deployment in new environments, 2024

Haritheja Etukuru, Norihito Naka, Zijin Hu, Seungjae Lee, Julian Mehu, Aaron Edsinger, Chris Paxton, Soumith Chintala, Lerrel Pinto, and Nur Muhammad Mahi Shafiul- lah. Robot utility models: General policies for zero-shot deployment in new environments, 2024. 2

work page 2024

[13] [13]

Arctic: A dataset for dexterous bimanual hand- object manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand- object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023. 3

work page 2023

[14] [14]

Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024

Authors from UC San Diego and MIT. Open-television: An open-source immersive teleoperation system with stereo visual feedback.The Robot Report, 2024. 4

work page 2024

[15] [15]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Michael Ryoo, Aljo ˇsa Smoli ´c, Minh V o, and et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11743–11753, 2022. 1, 2

work page 2022

[16] [16]

UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers

Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning,

work page 2024

[17] [17]

Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020. 4

work page 2020

[18] [18]

Toward general-purpose robots via foundation models: A survey and meta-analysis,

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Zhibo Zhao, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023. 2

work page arXiv 2023

[19] [19]

OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

work page arXiv

[20] [20]

Egomimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. 2, 7

work page 2024

[21] [21]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Data scaling laws in imitation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InConference on Robot Learning (CoRL), 2024. 2, 5

work page 2024

[24] [24]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

work page 2023

[25] [25]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 2023

Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, June 202...

work page doi:10.1109/lra.2023.3270034 2023

[27] [27]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, and Chelsea Finn. R3M: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Hamer: Hand mesh recovery for the egoexo4d hand pose challenge

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Hamer: Hand mesh recovery for the egoexo4d hand pose challenge. 2

work page

[29] [29]

Reconstructing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[30] [30]

Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis

Alexandra Pfister, Alexandre M West, Shaw Bronner, and Jack Adam Noah. Comparative abilities of microsoft kinect and vicon 3d motion capture for gait analysis. Journal of medical engineering & technology, 38(5):274– 280, 2014. 2

work page 2014

[31] [31]

Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988. 5

work page 1988

[32] [32]

Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ˜ human policy.arXiv preprint arXiv:2503.13441, 2025. 2, 4, 7

work page arXiv 2025

[33] [33]

Real-world robot learning with masked visual pre-training.CoRL,

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training.CoRL,

work page

[34] [34]

Riemannian Motion Policies

Nathan D. Ratliff, Jan Issac, Daniel Kappler, Stan Birch- field, and Dieter Fox. Riemannian motion policies, 2018. URL https://arxiv.org/abs/1801.02854. 14

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration

Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmo- cap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021. 2

work page 2021

[36] [36]

A reduction of imitation learning and structured prediction to no-regret online learning

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Confer- ence Proceedings, 2011. 5

work page 2011

[37] [37]

Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999

Stefan Schaal. Is imitation learning the route to humanoid robots?Trends in cognitive sciences, 3(6):233–242, 1999. 5

work page 1999

[38] [38]

Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

Kenneth Shaw, Ananye Agarwal, and Deepak Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS),

work page

[39] [39]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 654–665. PMLR, 14–18 Dec 2023. 2

work page 2023

[40] [40]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR,

work page

[41] [41]

Bimanual dexterity for complex tasks

Kenneth Shaw, Yulong Li, Jiahui Yang, Mohan Kumar Srirama, Ray Liu, Haoyu Xiong, Russell Mendonca, and Deepak Pathak. Bimanual dexterity for complex tasks. In8th Annual Conference on Robot Learning, 2024. 2, 3, 4, 6, 8, 14

work page 2024

[42] [42]

Hand-object interaction pretraining from videos, 2024

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, and Jitendra Malik. Hand-object interaction pretraining from videos, 2024. URL https://arxiv.org/abs/2409.08273. 2

work page arXiv 2024

[43] [43]

Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, and Karl Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024. 2

work page arXiv 2024

[44] [44]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube, 2022. 4

work page 2022

[45] [45]

Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022

Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube.RSS, 2022. 2

work page 2022

[46] [46]

Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020

Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations.Robotics and Automation Letters, 2020. 2

work page 2020

[47] [47]

HRP: Human affordances for robotic pre-training

Mohan Kumar Srirama, Sudeep Dasari, Shikhar Bahl, and Abhinav Gupta. HRP: Human affordances for robotic pre-training. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 2, 5

work page 2024

[48] [48]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mes- nard, Ibrahim Alabdulmohsin, Lucas Beyer, and Xiaohua Zhai. PaliGemma 2: A Family of Versatile VLMs for...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023

OM Team, D Ghosh, H Walke, K Pertsch, K Black, O Mees, S Dasari, J Hejna, C Xu, J Luo, et al. Octo: An open-source generalist robot policy.Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2023. 2, 3, 7

work page 2023

[50] [50]

Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015

Yushuang Tian, Xiaoli Meng, Dapeng Tao, Dongquan Liu, and Chen Feng. Upper limb motion tracking with the integration of imu and kinect.Neurocomputing, 159: 207–218, 2015. 2

work page 2015

[51] [51]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

https://store.steampowered.com/ steamvr

Valve Corporation. https://store.steampowered.com/ steamvr. [Virtual reality platform]. 3

work page

[53] [53]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1

work page 2017

[54] [54]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 1, 2

work page 2023

[55] [55]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024. 2, 3

work page 2024

[56] [56]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators, 2023. 2, 8

work page 2023

[57] [57]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025

Han Zhang, Songbo Hu, Zhecheng Yuan, and Huazhe Xu. Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025. 2

work page arXiv 2025

[59] [59]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023. 2, 5, 14

work page 2023

[60] [60]

Freihand: A dataset for markerless capture of hand pose and shape from single rgb images

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 813– 822, 2019. 2, 3 VII. APPENDIX Videos of our results, code to recreate our system, and hard...

work page 2019

[61] [61]

The network outputs actions which consists of relative end effector actions and absolute hand joint angles

and Diffusion U-Net [ 6] as policy classes, which output a sequence of actions. The network outputs actions which consists of relative end effector actions and absolute hand joint angles. We list the hyper-paramaters that we used for policy training using behavior-cloning in this Table V E. Low Level Motion Control For optimal smoothness of our policies a...

work page arXiv 2000