pith. machine review for the scientific record. sign in

arxiv: 2402.10329 · v3 · submitted 2024-02-15 · 💻 cs.RO

Recognition: 3 theorem links

· Lean Theorem

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3

classification 💻 cs.RO
keywords universal manipulation interfacerobot learninghuman demonstrationszero-shot transfermanipulation policiesbimanual tasksimitation learning
0
0 comments X

The pith

UMI lets robots learn complex manipulation from portable human gripper demonstrations with zero-shot transfer to new settings and hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Universal Manipulation Interface as a framework that collects diverse human demonstrations using hand-held grippers and trains policies that transfer directly to robots. Careful interface design handles latency and represents actions as relative trajectories to reduce the gap between human data and robot execution. Policies become hardware-agnostic and generalize to novel objects and environments when trained on varied demonstrations. This matters because it removes the need for costly robot-specific data collection, letting new skills be added simply by gathering more human examples.

Core claim

UMI is a data collection and policy learning framework that enables direct skill transfer from in-the-wild human demonstrations collected with hand-held grippers to deployable robot policies. It adds a policy interface with inference-time latency matching and relative-trajectory action representation so that learned policies remain hardware-agnostic and work across multiple robot platforms while generalizing zero-shot to new environments and objects.

What carries the argument

Hand-held grippers for portable demonstration collection together with a policy interface that performs inference-time latency matching and encodes actions as relative trajectories to close the human-to-robot domain gap.

If this is right

  • Policies generalize zero-shot to novel environments and objects after training on diverse human demonstrations.
  • The same framework supports dynamic, bimanual, precise, and long-horizon behaviors by swapping only the training data.
  • Learned policies deploy without modification across multiple robot platforms.
  • Data collection becomes portable and low-cost because no robot hardware is required during demonstration gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Human-collected data could scale robot training far beyond what robot teleoperation currently allows.
  • Testing the same interface on tasks requiring finer finger control would reveal whether relative trajectories remain sufficient.
  • The method suggests that careful action representation can matter more for transfer than exact hardware matching.

Load-bearing premise

The gripper interface, latency matching, and relative-trajectory encoding are together sufficient to let policies trained on human data execute reliably on robots despite differences in timing and physical form.

What would settle it

A policy trained on diverse human demonstrations fails to complete the task when deployed on a robot facing a new object or environment that was not seen in training.

read the original abstract

We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Universal Manipulation Interface (UMI), a data collection and policy learning framework that uses hand-held grippers for portable in-the-wild human demonstrations of bimanual and dynamic tasks. It incorporates a policy interface with inference-time latency matching and relative-trajectory action representations to enable hardware-agnostic policies that transfer zero-shot to multiple robot platforms, with experiments showing generalization to novel environments and objects by varying only the training data.

Significance. If the zero-shot transfer results hold under rigorous validation, the work would be significant for scalable robot learning: it decouples data collection from robot hardware, enabling low-cost collection of complex manipulation demonstrations and hardware-agnostic deployment. The open-sourcing of the gripper design and software is a concrete strength that supports reproducibility.

major comments (3)
  1. [Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.
  2. [Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.
  3. [Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.
minor comments (2)
  1. Figure captions and axis labels in the results figures could be expanded to include exact success-rate definitions and number of trials per condition for immediate interpretability.
  2. The related-work section would benefit from explicit comparison to prior teleoperation interfaces that also target domain-gap reduction, to better situate the novelty of the latency-matching and relative-action choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional quantitative analysis, partial ablations, and expanded experimental details where feasible.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.

    Authors: We agree that direct quantitative metrics would strengthen the evidence for domain gap closure. In the revised manuscript, we have added end-effector trajectory error distributions and residual latency histograms comparing human demonstrations to robot executions on matched task instances in the Experiments section and supplementary material. These metrics confirm that the interface components reduce discrepancies, supporting the zero-shot transfer results. revision: yes

  2. Referee: [Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.

    Authors: We acknowledge the value of isolating each component's contribution. Full ablations are challenging due to the integrated nature of the UMI framework, particularly for the gripper design which underpins all data collection. In the revision, we have added partial ablations evaluating the effects of latency matching and relative-trajectory representations on success rates for representative tasks, with discussion of why complete isolation of the gripper is not straightforward. revision: partial

  3. Referee: [Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.

    Authors: We have expanded the Experiments section in the revised manuscript to include detailed descriptions of baseline methods (specifying imitation learning approaches and comparisons), exact success metrics (binary task completion with definitions), data collection protocols (demonstration counts, environment and object diversity), and a new failure-case analysis subsection with quantitative breakdowns and qualitative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims grounded in experiments

full rationale

The manuscript presents UMI as a data-collection and policy-learning framework whose central claims (zero-shot generalization to novel environments/objects, hardware-agnostic deployment) are supported by real-world experiments on diverse human demonstrations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described structure. Design choices (gripper interface, latency matching, relative actions) are motivated as domain-gap reducers but are not derived from or equivalent to the target results by construction. The derivation chain remains self-contained against external benchmarks via empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard robotics assumptions about demonstration transferability and a new hardware-software interface whose fidelity is not independently verified outside the reported experiments.

axioms (1)
  • domain assumption Human demonstrations captured via the handheld gripper interface can be mapped to robot actions with minimal unmodeled domain shift.
    Invoked implicitly in the policy learning and zero-shot transfer claims.
invented entities (1)
  • Universal Manipulation Interface (UMI) gripper and policy interface no independent evidence
    purpose: Portable data collection device and deployable policy representation
    New hardware and software components introduced to enable the claimed transfer.

pith-pipeline@v0.9.0 · 5509 in / 1328 out tokens · 69333 ms · 2026-05-14T21:31:10.296068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  2. Tune to Learn: How Controller Gains Shape Robot Policy Learning

    cs.RO 2026-04 conditional novelty 7.0

    Controller gains affect learnability differently for behavior cloning, RL from scratch, and sim-to-real transfer, so optimal gains depend on the learning paradigm rather than desired task behavior.

  3. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  4. Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    A tactile-aware hierarchical policy for quadrupedal loco-manipulation improves real-world contact-rich task performance by 28.54% over vision-only and visuotactile baselines.

  5. FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

    cs.RO 2026-04 conditional novelty 6.0

    FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.

  6. UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

    cs.RO 2026-04 unverdicted novelty 6.0

    UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.

  7. XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

    cs.RO 2026-04 unverdicted novelty 6.0

    XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.

  8. WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.

  9. ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

    cs.RO 2026-04 unverdicted novelty 6.0

    ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.

  10. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  11. TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.

  12. RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

    cs.RO 2026-04 unverdicted novelty 6.0

    RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.

  13. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  14. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  15. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  16. FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.

  17. Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

    cs.RO 2026-04 unverdicted novelty 5.0

    A hierarchical tactile-aware policy combines human-demonstration training for contact cue prediction with sim-to-real reinforcement learning to improve quadrupedal loco-manipulation performance by 28.54% over vision b...

  18. OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

    cs.RO 2026-04 unverdicted novelty 5.0

    OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.

  19. Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision

    cs.RO 2026-05 unverdicted novelty 4.0

    Behavior cloning produces active perception in a plant-centering task where a robot arm uses low-resolution egocentric RGB images to predict joint movements, with relative deltas outperforming absolute positions.

  20. Towards Robotic Dexterous Hand Intelligence: A Survey

    cs.RO 2026-05 unverdicted novelty 4.0

    A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

  21. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  22. EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

    cs.RO 2026-04 unverdicted novelty 4.0

    EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...

  23. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    cs.RO 2026-04 unverdicted novelty 3.0

    A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 22 Pith papers · 2 internal anchors

  1. [1]

    Human-to-robot imitation in the wild

    Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. In Proceedings of Robotics: Science and Systems (RSS) , 2022

  2. [2]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  3. [3]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Proceedings of Robotics: Science and Systems (RSS) , 2023

  4. [4]

    Humanoid robot teleoperation with vibrotactile based balancing feedback

    Anais Brygo, Ioannis Sarakoglou, Nadia Garcia- Hernandez, and Nikolaos Tsagarakis. Humanoid robot teleoperation with vibrotactile based balancing feedback. In Haptics: Neuroscience, Devices, Modeling, and Ap- plications: 9th International Conference, EuroHaptics 2014, Versailles, France, June 24-26, 2014, Proceedings, Part II 9, pages 266–275. Springer, 2014

  5. [5]

    Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR) , pages 510–517, 2015. doi: 10.1109/ICAR.2015.7251504

  6. [6]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021

  7. [7]

    G ´omez Rodr´ıguez, Jos´e M

    Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO. 2021.3075644

  8. [8]

    in-the- wild

    Annie S Chen, Suraj Nair, and Chelsea Finn. Learn- ing generalizable robotic reward functions from “in-the- wild” human videos. In Proceedings of Robotics: Science and Systems (RSS) , 2021

  9. [9]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  10. [10]

    On hand-held grippers and the morphological gap in human manipulation demonstration

    Kiran Doshi, Yijiang Huang, and Stelian Coros. On hand-held grippers and the morphological gap in human manipulation demonstration. arXiv preprint arXiv:2311.01832, 2023

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  12. [12]

    Ar2-d2: Training a robot without a robot

    Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2: Training a robot without a robot. 2023

  13. [13]

    Bridge data: Boosting generalization of robotic skills with cross- domain datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022

  14. [14]

    Low-cost exoskeletons for learning whole-arm manipulation in the wild

    Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Low-cost exoskeletons for learning whole-arm ma- nipulation in the wild. arXiv preprint arXiv:2309.14975, 2023

  15. [15]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mo- bile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  16. [16]

    Garrido-Jurado, R

    S. Garrido-Jurado, R. Mu ˜noz-Salinas, F.J. Madrid- Cuevas, and M.J. Mar ´ın-Jim´enez. Automatic genera- tion and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292,

  17. [17]

    doi: https://doi.org/10.1016/ j.patcog.2014.01.005

    ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2014.01.005. URL https://www.sciencedirect. com/science/article/pii/S0031320314000235

  18. [18]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015

  19. [19]

    Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras

    GoPro Inc. Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras. https: //gopro.github.io/gpmf-parser/. Accesssed: 2023-01-31

  20. [20]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), volume 164, pages 991–1002. PMLR, 2022

  21. [21]

    Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations

    Moo Jin Kim, Jiajun Wu, and Chelsea Finn. Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations. In Deep Reinforce- ment Learning Workshop NeurIPS , 2022

  22. [22]

    VIP: Towards universal visual reward and representation via value-implicit pre-training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2023

  23. [23]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (CoRL) , volume 87, pages 879–893. PMLR, 2018

  24. [24]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 892–909. PMLR, 2022

  25. [25]

    Tax-pose: Task-specific cross-pose estimation for robot manipulation

    Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, and David Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1783–1792. PMLR, 2023

  26. [26]

    The surprising ef- fectiveness of representation learning for visual imitation

    Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising ef- fectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS) , 2022

  27. [27]

    Learning of compliant human–robot interaction using full-body haptic interface

    Luka Peternel and Jan Babi ˇc. Learning of compliant human–robot interaction using full-body haptic interface. Advanced Robotics, 27(13):1003–1012, 2013

  28. [28]

    Characterizing input methods for human-to-robot demonstrations

    Pragathi Praveena, Guru Subramani, Bilge Mutlu, and Michael Gleicher. Characterizing input methods for human-to-robot demonstrations. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 344–353. IEEE, 2019

  29. [29]

    Dexmv: Im- itation learning for dexterous manipulation from human videos

    Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Im- itation learning for dexterous manipulation from human videos. In European Conference on Computer Vision , pages 570–587. Springer, 2022

  30. [30]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning, pages 8748–8763. PMLR, 2021

  31. [31]

    Recent advances in robot learning from demonstration

    Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems , 3:297–330, 2020

  32. [32]

    Latent plans for task- agnostic offline reinforcement learning

    Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of The 6th Conference on Robot Learning (CoRL) , vol- ume 205, pages 1838–1849. PMLR, 2023

  33. [33]

    Scalable

    Felipe Sanches, Geng Gao, Nathan Elangovan, Ricardo V Godoy, Jayden Chapman, Ke Wang, Patrick Jarvis, and Minas Liarokapis. Scalable. intuitive human to robot skill transfer with wearable human machine interfaces: On complex, dexterous tasks. In 2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 6318–6325. IEEE, 2023

  34. [34]

    Learning predictive models from observation and interaction

    Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020

  35. [35]

    Reinforcement learn- ing with videos: Combining offline observations with interaction

    Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learn- ing with videos: Combining offline observations with interaction. In Proceedings of the 2020 Conference on Robot Learning (CoRL) , volume 155, pages 339–354. PMLR, 2021

  36. [36]

    Deep imitation learning for humanoid loco-manipulation through human teleoperation

    Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Hu- manoids), pages 1–8. IEEE, 2023

  37. [37]

    On bringing robots home

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

  38. [38]

    Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions

    Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. The International Journal of Robotics Research , 40(12-14):1419–1434, 2021

  39. [39]

    Videodex: Learning dexterity from internet videos

    Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 654–665. PMLR, 2023

  40. [40]

    Distilled feature fields enable few-shot language-guided manipulation

    William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 405–424. PMLR, 2023

  41. [41]

    Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion

    Anthony Simeonov, Yilun Du, Andrea Tagliasac- chi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022

  42. [42]

    Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations

    Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations. Robotics and Automation Letters , 2020

  43. [43]

    Trajectory Optimization and Following for a Three Degrees of Freedom Overactuated Floating Platform

    H.J. Terry Suh, Naveen Kuppuswamy, Tao Pang, Paul Mitiguy, Alex Alspach, and Russ Tedrake. SEED: Series elastic end effectors in 6d for visuotactile tool use. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4684–4691, 2022. doi: 10.1109/IROS47612.2022.9982092

  44. [44]

    A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics

    Alexander Toedtheide, Xiao Chen, Hamid Sadeghian, Abdeldjallil Naceri, and Sami Haddadin. A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 12624–12630. IEEE, 2023

  45. [45]

    Mimicplay: Long-horizon imitation learning by watching human play

    Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. In Proceedings of The 7th Conference on Robot Learning (CoRL) , volume 229, pages 201–221. PMLR, 2023

  46. [46]

    Error-aware imitation learning from teleoperation data for mobile manipulation

    Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Man- dlekar, Li Fei-Fei, Silvio Savarese, and Roberto Mart ´ın- Mart´ın. Error-aware imitation learning from teleoperation data for mobile manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL) , volume 164, pages 1367–1378. PMLR, 2022

  47. [47]

    GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators

    Philipp Wu, Fred Shentu, Xingyu Lin, and Pieter Abbeel. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisi- tion @ CoRL2023 , 2023

  48. [48]

    Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot

    Keenan A Wyrobek, Eric H Berger, HF Machiel Van der Loos, and J Kenneth Salisbury. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot. In 2008 IEEE International Conference on Robotics and Automation , pages 2165–2170. IEEE, 2008

  49. [49]

    Masked visual pre-training for motor control,

    Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv:2203.06173, 2022

  50. [50]

    Learn- ing by watching: Physical imitation of manipulation skills from human videos

    Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021

  51. [51]

    Visual imitation made easy

    Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy. In Conference on Robot Learning (CoRL) , volume 155, pages 1992–2005. PMLR, 2021

  52. [52]

    Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

    Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 5628–5635. IEEE, 2018

  53. [53]

    Benefit of large field-of-view cam- eras for visual odometry

    Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Benefit of large field-of-view cam- eras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation (ICRA) , pages 801–808, 2016. doi: 10.1109/ICRA.2016.7487210

  54. [54]

    Learning fine-grained bimanual manipulation with low-cost hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Proceedings of Robotics: Science and Systems (RSS) , 2023

  55. [55]

    Viola: Imitation learning for vision-based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1199–1210. PMLR, 2023. APPENDIX Please check out our website (https://umi-gripper.github.io) for additional results and comparisons. In...

  56. [56]

    Camera Latency Measurement : For policy observation across both the UR5 and Franka FR2 platforms, we employ each robot arm with a single wrist-mounted GoPro Hero 9 camera. To obtain real-time video streams from the GoPro, we use a combination of GoPro Media Mod 1.0 (to convert usb-c to HDMI) and Elgato HD60X external capture card (to convert HDMI to USB-3...

  57. [57]

    Proprioception Latency Measurement : When the robotic hardware directly reports global timestamps, such is the case for Franka FR2 robot, we measure the proprioception latency by subtracting the robot sending timestamp trobot from the policy-received timestamp trecv: lobs = trecv −trobot When the robotic hardware timestamp is unavailable, such as the UR5 ...

  58. [58]

    To measure le2e, we send a sequence of sinusoidal position commands to the gripper, and then record a sequence of gripper width preconceptions

    Gripper Execution Latency Measurement : To obtain the gripper execution latency laction, we subtract the end-to- end latency le2e by the proprioception latency lobs. To measure le2e, we send a sequence of sinusoidal position commands to the gripper, and then record a sequence of gripper width preconceptions. The le2e can be obtained by computing the optim...

  59. [59]

    GoPro Labs

    Robot Execution Latency Measurement : Similar to the gripper, we also measure the execution latency of the robot (ether UR5 or Franka) by calculating le2e, as the optimal alignment between a sequence of desired end-effector poses and the measured actual end-effector poses. Due to safety concerns, we directly teleoperate the robot to generate the desired e...

  60. [60]

    During evaluation, we manually match the initial states with a third-person camera to be close to pixel-perfect

    Initial State Selection : For all tasks, we manually select a set of initial states with diverse pose coverage across task scenes (for both the robot and the environment) that are shared across all evaluated methods. During evaluation, we manually match the initial states with a third-person camera to be close to pixel-perfect. We ensure the initial state...

  61. [61]

    An evaluation episode can be terminated due to: • Safety Concern

    Termination Criteria : During evaluation, an operator supervises the robot at all times. An evaluation episode can be terminated due to: • Safety Concern. When the operator deems the robot is about to perform dangerous actions that could potentially break the setup/robot or do any other harm, the episode will be terminated immediately. • Robot Fault. When...

  62. [62]

    espresso cup with saucer

    Success Criteria : It is difficult to define automatic and compact success metrics for complex manipulation tasks reported in this paper. Therefore, the operator manually judges the success or failure of each episode using the rubric de- scribed below. While we try to create a concise and objective rubric, it inevitability contains subjective elements. As...

  63. [63]

    We found this feature to significantly increase mapping robustness in-the-wild

    with known sizes to disambiguate possible explanations of feature matches. We found this feature to significantly increase mapping robustness in-the-wild. Note that demonstra- tion videos will not contain these fiducial markers, they are only used for mapping. E. Policy Implementation Details We use Diffusion Policy [9] for all tasks. Detailed hyper- para...

  64. [64]

    Notably, the dataset collected for each task lacks the scale required for training ViT from scratch

    Vision encoder : We utilize the Vision Transformer (ViT) [11] as the vision encoder due to its substantial ca- pacity in comparison to ResNet [17], which proves crucial for tasks demanding intricate perceptual capabilities. Notably, the dataset collected for each task lacks the scale required for training ViT from scratch. To address this limitation, we e...

  65. [65]

    However, a frequency of 20Hz is employed for the dynamic tossing task, which requires highly reactive behaviors

    Frequency: For most quasi-static tasks, a frequency of 10Hz proves sufficient for both observation and action. However, a frequency of 20Hz is employed for the dynamic tossing task, which requires highly reactive behaviors

  66. [66]

    However, during execution, we are not bound to follow the same dt

    Speed: The output of Diffusion Policy is a sequence of actions, specifically the target pose, with an implicit dtout put between two steps determined by the demonstration dataset. However, during execution, we are not bound to follow the same dt. By adjusting the dtexecution, we can achieve different execution speeds compared to the human demonstration. I...

  67. [67]

    Image Augmentation : We employ a set of image aug- mentations to enhance the diversity of our training data, thereby improving the robustness and generalization capa- bilities of our policy. The augmentation pipeline includes a RandomCrop operation with a ratio of 0.95, a RandomRotation operation with degrees ranging from -5.0 to 5.0, and a Color- Jitter ...

  68. [68]

    Printed with 95A TPU material, the rib- like pattern on the finger maintains rigidity on the fingertip while conforming to the object geometry for a more secure grasp (Fig

    Soft Compliant Fingers: We used the same soft fingers on both UMI data collection grippers as well as deployed robotic grippers. Printed with 95A TPU material, the rib- like pattern on the finger maintains rigidity on the fingertip while conforming to the object geometry for a more secure grasp (Fig. A3). When deployed to robots that lack force- torque co...

  69. [69]

    Franka Mount: Due to FR2’s limited end-effector pitch (FR2 is designed for top-down pick and place, while the UMI gripper is mostly held horizontally), we had to design and 3D print a custom mounting adapter that rotates WSG50 gripper 90-degree rotation with respect to the robot’s end-effector flange