arxiv: 2402.10329 · v3 · submitted 2024-02-15 · 💻 cs.RO

Recognition: 3 theorem links

· Lean Theorem

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi , Zhenjia Xu , Chuer Pan , Eric Cousineau , Benjamin Burchfiel , Siyuan Feng , Russ Tedrake , Shuran Song

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3

classification 💻 cs.RO

keywords universal manipulation interfacerobot learninghuman demonstrationszero-shot transfermanipulation policiesbimanual tasksimitation learning

0 comments

The pith

UMI lets robots learn complex manipulation from portable human gripper demonstrations with zero-shot transfer to new settings and hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Universal Manipulation Interface as a framework that collects diverse human demonstrations using hand-held grippers and trains policies that transfer directly to robots. Careful interface design handles latency and represents actions as relative trajectories to reduce the gap between human data and robot execution. Policies become hardware-agnostic and generalize to novel objects and environments when trained on varied demonstrations. This matters because it removes the need for costly robot-specific data collection, letting new skills be added simply by gathering more human examples.

Core claim

UMI is a data collection and policy learning framework that enables direct skill transfer from in-the-wild human demonstrations collected with hand-held grippers to deployable robot policies. It adds a policy interface with inference-time latency matching and relative-trajectory action representation so that learned policies remain hardware-agnostic and work across multiple robot platforms while generalizing zero-shot to new environments and objects.

What carries the argument

Hand-held grippers for portable demonstration collection together with a policy interface that performs inference-time latency matching and encodes actions as relative trajectories to close the human-to-robot domain gap.

If this is right

Policies generalize zero-shot to novel environments and objects after training on diverse human demonstrations.
The same framework supports dynamic, bimanual, precise, and long-horizon behaviors by swapping only the training data.
Learned policies deploy without modification across multiple robot platforms.
Data collection becomes portable and low-cost because no robot hardware is required during demonstration gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Human-collected data could scale robot training far beyond what robot teleoperation currently allows.
Testing the same interface on tasks requiring finer finger control would reveal whether relative trajectories remain sufficient.
The method suggests that careful action representation can matter more for transfer than exact hardware matching.

Load-bearing premise

The gripper interface, latency matching, and relative-trajectory encoding are together sufficient to let policies trained on human data execute reliably on robots despite differences in timing and physical form.

What would settle it

A policy trained on diverse human demonstrations fails to complete the task when deployed on a robot facing a new object or environment that was not seen in training.

read the original abstract

We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UMI gives a practical way to collect bimanual and dynamic manipulation data in the wild with a handheld gripper and transfer policies zero-shot across robots via latency matching and relative actions, but the domain gap claims rest on indirect evidence.

read the letter

The main advance is the combination of a portable handheld gripper for low-cost in-the-wild data collection with a policy interface that adds inference-time latency matching and relative-trajectory actions. This setup lets human demonstrations train policies that run on multiple robot platforms without retraining, and the paper shows real-world results where those policies handle novel objects and environments on tasks that include bimanual and dynamic elements. The open-source release of both hardware and code is a clear positive for the community. The experiments appear comprehensive enough on the surface to support the versatility claim, and the design choices are motivated by the practical bottlenecks in robot data collection. The soft spot is the lack of direct measurements that would confirm the interface actually closes the human-robot gap. There are no reported trajectory error distributions, latency histograms, or component ablations that isolate how much each element (gripper, latency match, relative actions) contributes to the zero-shot success. Without those, it remains possible that results reflect task selection or demonstration quality rather than reliable transfer. The central argument still holds up as plausible given the reported outcomes, but tighter validation would strengthen it. This work is for researchers in imitation learning and scalable robot manipulation who need better data pipelines. It deserves peer review because the idea is concrete, the artifacts are available, and the problem it targets is central to the field.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Universal Manipulation Interface (UMI), a data collection and policy learning framework that uses hand-held grippers for portable in-the-wild human demonstrations of bimanual and dynamic tasks. It incorporates a policy interface with inference-time latency matching and relative-trajectory action representations to enable hardware-agnostic policies that transfer zero-shot to multiple robot platforms, with experiments showing generalization to novel environments and objects by varying only the training data.

Significance. If the zero-shot transfer results hold under rigorous validation, the work would be significant for scalable robot learning: it decouples data collection from robot hardware, enabling low-cost collection of complex manipulation demonstrations and hardware-agnostic deployment. The open-sourcing of the gripper design and software is a concrete strength that supports reproducibility.

major comments (3)

[Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.
[Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.
[Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.

minor comments (2)

Figure captions and axis labels in the results figures could be expanded to include exact success-rate definitions and number of trials per condition for immediate interpretability.
The related-work section would benefit from explicit comparison to prior teleoperation interfaces that also target domain-gap reduction, to better situate the novelty of the latency-matching and relative-action choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional quantitative analysis, partial ablations, and expanded experimental details where feasible.

read point-by-point responses

Referee: [Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.

Authors: We agree that direct quantitative metrics would strengthen the evidence for domain gap closure. In the revised manuscript, we have added end-effector trajectory error distributions and residual latency histograms comparing human demonstrations to robot executions on matched task instances in the Experiments section and supplementary material. These metrics confirm that the interface components reduce discrepancies, supporting the zero-shot transfer results. revision: yes
Referee: [Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.

Authors: We acknowledge the value of isolating each component's contribution. Full ablations are challenging due to the integrated nature of the UMI framework, particularly for the gripper design which underpins all data collection. In the revision, we have added partial ablations evaluating the effects of latency matching and relative-trajectory representations on success rates for representative tasks, with discussion of why complete isolation of the gripper is not straightforward. revision: partial
Referee: [Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.

Authors: We have expanded the Experiments section in the revised manuscript to include detailed descriptions of baseline methods (specifying imitation learning approaches and comparisons), exact success metrics (binary task completion with definitions), data collection protocols (demonstration counts, environment and object diversity), and a new failure-case analysis subsection with quantitative breakdowns and qualitative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims grounded in experiments

full rationale

The manuscript presents UMI as a data-collection and policy-learning framework whose central claims (zero-shot generalization to novel environments/objects, hardware-agnostic deployment) are supported by real-world experiments on diverse human demonstrations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described structure. Design choices (gripper interface, latency matching, relative actions) are motivated as domain-gap reducers but are not derived from or equivalent to the target results by construction. The derivation chain remains self-contained against external benchmarks via empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard robotics assumptions about demonstration transferability and a new hardware-software interface whose fidelity is not independently verified outside the reported experiments.

axioms (1)

domain assumption Human demonstrations captured via the handheld gripper interface can be mapped to robot actions with minimal unmodeled domain shift.
Invoked implicitly in the policy learning and zero-shot transfer claims.

invented entities (1)

Universal Manipulation Interface (UMI) gripper and policy interface no independent evidence
purpose: Portable data collection device and deployable policy representation
New hardware and software components introduced to enable the claimed transfer.

pith-pipeline@v0.9.0 · 5509 in / 1328 out tokens · 69333 ms · 2026-05-14T21:31:10.296068+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation.
Foundation.LawOfExistence existence_economically_inevitable unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Tune to Learn: How Controller Gains Shape Robot Policy Learning
cs.RO 2026-04 conditional novelty 7.0

Controller gains affect learnability differently for behavior cloning, RL from scratch, and sim-to-real transfer, so optimal gains depend on the learning paradigm rather than desired task behavior.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
cs.RO 2026-04 unverdicted novelty 6.0

A tactile-aware hierarchical policy for quadrupedal loco-manipulation improves real-world contact-rich task performance by 28.54% over vision-only and visuotactile baselines.
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
cs.RO 2026-04 conditional novelty 6.0

FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
cs.RO 2026-04 unverdicted novelty 6.0

UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
cs.RO 2026-04 unverdicted novelty 6.0

XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models
cs.RO 2026-04 unverdicted novelty 6.0

WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
cs.RO 2026-04 unverdicted novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks
cs.RO 2026-04 unverdicted novelty 6.0

TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.
RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
cs.RO 2026-04 unverdicted novelty 6.0

RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems
cs.RO 2026-04 unverdicted novelty 5.0

FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
cs.RO 2026-04 unverdicted novelty 5.0

A hierarchical tactile-aware policy combines human-demonstration training for contact cue prediction with sim-to-real reinforcement learning to improve quadrupedal loco-manipulation performance by 28.54% over vision b...
OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
cs.RO 2026-04 unverdicted novelty 5.0

OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.
Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision
cs.RO 2026-05 unverdicted novelty 4.0

Behavior cloning produces active perception in a plant-centering task where a robot arm uses low-resolution egocentric RGB images to predict joint movements, with relative deltas outperforming absolute positions.
Towards Robotic Dexterous Hand Intelligence: A Survey
cs.RO 2026-05 unverdicted novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
cs.RO 2026-04 unverdicted novelty 4.0

EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
cs.RO 2026-04 unverdicted novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 22 Pith papers · 2 internal anchors

[1]

Human-to-robot imitation in the wild

Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. In Proceedings of Robotics: Science and Systems (RSS) , 2022

work page 2022
[2]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

work page 2023
[3]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Proceedings of Robotics: Science and Systems (RSS) , 2023

work page 2023
[4]

Humanoid robot teleoperation with vibrotactile based balancing feedback

Anais Brygo, Ioannis Sarakoglou, Nadia Garcia- Hernandez, and Nikolaos Tsagarakis. Humanoid robot teleoperation with vibrotactile based balancing feedback. In Haptics: Neuroscience, Devices, Modeling, and Ap- plications: 9th International Conference, EuroHaptics 2014, Versailles, France, June 24-26, 2014, Proceedings, Part II 9, pages 266–275. Springer, 2014

work page 2014
[5]

Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR) , pages 510–517, 2015. doi: 10.1109/ICAR.2015.7251504

work page doi:10.1109/icar.2015.7251504 2015
[6]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam

Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021

work page 2021
[7]

G ´omez Rodr´ıguez, Jos´e M

Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO. 2021.3075644

work page doi:10.1109/tro 2021
[8]

in-the- wild

Annie S Chen, Suraj Nair, and Chelsea Finn. Learn- ing generalizable robotic reward functions from “in-the- wild” human videos. In Proceedings of Robotics: Science and Systems (RSS) , 2021

work page 2021
[9]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[10]

On hand-held grippers and the morphological gap in human manipulation demonstration

Kiran Doshi, Yijiang Huang, and Stelian Coros. On hand-held grippers and the morphological gap in human manipulation demonstration. arXiv preprint arXiv:2311.01832, 2023

work page arXiv 2023
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[12]

Ar2-d2: Training a robot without a robot

Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2: Training a robot without a robot. 2023

work page 2023
[13]

Bridge data: Boosting generalization of robotic skills with cross- domain datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022

work page 2022
[14]

Low-cost exoskeletons for learning whole-arm manipulation in the wild

Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Low-cost exoskeletons for learning whole-arm ma- nipulation in the wild. arXiv preprint arXiv:2309.14975, 2023

work page arXiv 2023
[15]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mo- bile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review arXiv 2024
[16]

Garrido-Jurado, R

S. Garrido-Jurado, R. Mu ˜noz-Salinas, F.J. Madrid- Cuevas, and M.J. Mar ´ın-Jim´enez. Automatic genera- tion and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292,

work page
[17]

doi: https://doi.org/10.1016/ j.patcog.2014.01.005

ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2014.01.005. URL https://www.sciencedirect. com/science/article/pii/S0031320314000235

work page 2014
[18]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras

GoPro Inc. Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras. https: //gopro.github.io/gpmf-parser/. Accesssed: 2023-01-31

work page 2023
[20]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), volume 164, pages 991–1002. PMLR, 2022

work page 2022
[21]

Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations

Moo Jin Kim, Jiajun Wu, and Chelsea Finn. Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations. In Deep Reinforce- ment Learning Workshop NeurIPS , 2022

work page 2022
[22]

VIP: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2023

work page 2023
[23]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (CoRL) , volume 87, pages 879–893. PMLR, 2018

work page 2018
[24]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 892–909. PMLR, 2022

work page 2022
[25]

Tax-pose: Task-specific cross-pose estimation for robot manipulation

Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, and David Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1783–1792. PMLR, 2023

work page 2023
[26]

The surprising ef- fectiveness of representation learning for visual imitation

Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising ef- fectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS) , 2022

work page 2022
[27]

Learning of compliant human–robot interaction using full-body haptic interface

Luka Peternel and Jan Babi ˇc. Learning of compliant human–robot interaction using full-body haptic interface. Advanced Robotics, 27(13):1003–1012, 2013

work page 2013
[28]

Characterizing input methods for human-to-robot demonstrations

Pragathi Praveena, Guru Subramani, Bilge Mutlu, and Michael Gleicher. Characterizing input methods for human-to-robot demonstrations. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 344–353. IEEE, 2019

work page 2019
[29]

Dexmv: Im- itation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Im- itation learning for dexterous manipulation from human videos. In European Conference on Computer Vision , pages 570–587. Springer, 2022

work page 2022
[30]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning, pages 8748–8763. PMLR, 2021

work page 2021
[31]

Recent advances in robot learning from demonstration

Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems , 3:297–330, 2020

work page 2020
[32]

Latent plans for task- agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of The 6th Conference on Robot Learning (CoRL) , vol- ume 205, pages 1838–1849. PMLR, 2023

work page 2023
[33]

Scalable

Felipe Sanches, Geng Gao, Nathan Elangovan, Ricardo V Godoy, Jayden Chapman, Ke Wang, Patrick Jarvis, and Minas Liarokapis. Scalable. intuitive human to robot skill transfer with wearable human machine interfaces: On complex, dexterous tasks. In 2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 6318–6325. IEEE, 2023

work page 2023
[34]

Learning predictive models from observation and interaction

Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020

work page 2020
[35]

Reinforcement learn- ing with videos: Combining offline observations with interaction

Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learn- ing with videos: Combining offline observations with interaction. In Proceedings of the 2020 Conference on Robot Learning (CoRL) , volume 155, pages 339–354. PMLR, 2021

work page 2020
[36]

Deep imitation learning for humanoid loco-manipulation through human teleoperation

Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Hu- manoids), pages 1–8. IEEE, 2023

work page 2023
[37]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[38]

Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions

Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. The International Journal of Robotics Research , 40(12-14):1419–1434, 2021

work page 2021
[39]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 654–665. PMLR, 2023

work page 2023
[40]

Distilled feature fields enable few-shot language-guided manipulation

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 405–424. PMLR, 2023

work page 2023
[41]

Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion

Anthony Simeonov, Yilun Du, Andrea Tagliasac- chi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022

work page 2022
[42]

Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations

Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations. Robotics and Automation Letters , 2020

work page 2020
[43]

Trajectory Optimization and Following for a Three Degrees of Freedom Overactuated Floating Platform

H.J. Terry Suh, Naveen Kuppuswamy, Tao Pang, Paul Mitiguy, Alex Alspach, and Russ Tedrake. SEED: Series elastic end effectors in 6d for visuotactile tool use. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4684–4691, 2022. doi: 10.1109/IROS47612.2022.9982092

work page doi:10.1109/iros47612.2022.9982092 2022
[44]

A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics

Alexander Toedtheide, Xiao Chen, Hamid Sadeghian, Abdeldjallil Naceri, and Sami Haddadin. A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 12624–12630. IEEE, 2023

work page 2023
[45]

Mimicplay: Long-horizon imitation learning by watching human play

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. In Proceedings of The 7th Conference on Robot Learning (CoRL) , volume 229, pages 201–221. PMLR, 2023

work page 2023
[46]

Error-aware imitation learning from teleoperation data for mobile manipulation

Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Man- dlekar, Li Fei-Fei, Silvio Savarese, and Roberto Mart ´ın- Mart´ın. Error-aware imitation learning from teleoperation data for mobile manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL) , volume 164, pages 1367–1378. PMLR, 2022

work page 2022
[47]

GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators

Philipp Wu, Fred Shentu, Xingyu Lin, and Pieter Abbeel. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisi- tion @ CoRL2023 , 2023

work page 2023
[48]

Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot

Keenan A Wyrobek, Eric H Berger, HF Machiel Van der Loos, and J Kenneth Salisbury. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot. In 2008 IEEE International Conference on Robotics and Automation , pages 2165–2170. IEEE, 2008

work page 2008
[49]

Masked visual pre-training for motor control,

Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv:2203.06173, 2022

work page arXiv 2022
[50]

Learn- ing by watching: Physical imitation of manipulation skills from human videos

Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021

work page 2021
[51]

Visual imitation made easy

Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy. In Conference on Robot Learning (CoRL) , volume 155, pages 1992–2005. PMLR, 2021

work page 1992
[52]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 5628–5635. IEEE, 2018

work page 2018
[53]

Benefit of large field-of-view cam- eras for visual odometry

Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Benefit of large field-of-view cam- eras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation (ICRA) , pages 801–808, 2016. doi: 10.1109/ICRA.2016.7487210

work page doi:10.1109/icra.2016.7487210 2016
[54]

Learning fine-grained bimanual manipulation with low-cost hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Proceedings of Robotics: Science and Systems (RSS) , 2023

work page 2023
[55]

Viola: Imitation learning for vision-based manipulation with object proposal priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1199–1210. PMLR, 2023. APPENDIX Please check out our website (https://umi-gripper.github.io) for additional results and comparisons. In...

work page 2023
[56]

Camera Latency Measurement : For policy observation across both the UR5 and Franka FR2 platforms, we employ each robot arm with a single wrist-mounted GoPro Hero 9 camera. To obtain real-time video streams from the GoPro, we use a combination of GoPro Media Mod 1.0 (to convert usb-c to HDMI) and Elgato HD60X external capture card (to convert HDMI to USB-3...

work page
[57]

Proprioception Latency Measurement : When the robotic hardware directly reports global timestamps, such is the case for Franka FR2 robot, we measure the proprioception latency by subtracting the robot sending timestamp trobot from the policy-received timestamp trecv: lobs = trecv −trobot When the robotic hardware timestamp is unavailable, such as the UR5 ...

work page
[58]

To measure le2e, we send a sequence of sinusoidal position commands to the gripper, and then record a sequence of gripper width preconceptions

Gripper Execution Latency Measurement : To obtain the gripper execution latency laction, we subtract the end-to- end latency le2e by the proprioception latency lobs. To measure le2e, we send a sequence of sinusoidal position commands to the gripper, and then record a sequence of gripper width preconceptions. The le2e can be obtained by computing the optim...

work page
[59]

GoPro Labs

Robot Execution Latency Measurement : Similar to the gripper, we also measure the execution latency of the robot (ether UR5 or Franka) by calculating le2e, as the optimal alignment between a sequence of desired end-effector poses and the measured actual end-effector poses. Due to safety concerns, we directly teleoperate the robot to generate the desired e...

work page
[60]

During evaluation, we manually match the initial states with a third-person camera to be close to pixel-perfect

Initial State Selection : For all tasks, we manually select a set of initial states with diverse pose coverage across task scenes (for both the robot and the environment) that are shared across all evaluated methods. During evaluation, we manually match the initial states with a third-person camera to be close to pixel-perfect. We ensure the initial state...

work page
[61]

An evaluation episode can be terminated due to: • Safety Concern

Termination Criteria : During evaluation, an operator supervises the robot at all times. An evaluation episode can be terminated due to: • Safety Concern. When the operator deems the robot is about to perform dangerous actions that could potentially break the setup/robot or do any other harm, the episode will be terminated immediately. • Robot Fault. When...

work page
[62]

espresso cup with saucer

Success Criteria : It is difficult to define automatic and compact success metrics for complex manipulation tasks reported in this paper. Therefore, the operator manually judges the success or failure of each episode using the rubric de- scribed below. While we try to create a concise and objective rubric, it inevitability contains subjective elements. As...

work page
[63]

We found this feature to significantly increase mapping robustness in-the-wild

with known sizes to disambiguate possible explanations of feature matches. We found this feature to significantly increase mapping robustness in-the-wild. Note that demonstra- tion videos will not contain these fiducial markers, they are only used for mapping. E. Policy Implementation Details We use Diffusion Policy [9] for all tasks. Detailed hyper- para...

work page
[64]

Notably, the dataset collected for each task lacks the scale required for training ViT from scratch

Vision encoder : We utilize the Vision Transformer (ViT) [11] as the vision encoder due to its substantial ca- pacity in comparison to ResNet [17], which proves crucial for tasks demanding intricate perceptual capabilities. Notably, the dataset collected for each task lacks the scale required for training ViT from scratch. To address this limitation, we e...

work page
[65]

However, a frequency of 20Hz is employed for the dynamic tossing task, which requires highly reactive behaviors

Frequency: For most quasi-static tasks, a frequency of 10Hz proves sufficient for both observation and action. However, a frequency of 20Hz is employed for the dynamic tossing task, which requires highly reactive behaviors

work page
[66]

However, during execution, we are not bound to follow the same dt

Speed: The output of Diffusion Policy is a sequence of actions, specifically the target pose, with an implicit dtout put between two steps determined by the demonstration dataset. However, during execution, we are not bound to follow the same dt. By adjusting the dtexecution, we can achieve different execution speeds compared to the human demonstration. I...

work page
[67]

Image Augmentation : We employ a set of image aug- mentations to enhance the diversity of our training data, thereby improving the robustness and generalization capa- bilities of our policy. The augmentation pipeline includes a RandomCrop operation with a ratio of 0.95, a RandomRotation operation with degrees ranging from -5.0 to 5.0, and a Color- Jitter ...

work page
[68]

Printed with 95A TPU material, the rib- like pattern on the finger maintains rigidity on the fingertip while conforming to the object geometry for a more secure grasp (Fig

Soft Compliant Fingers: We used the same soft fingers on both UMI data collection grippers as well as deployed robotic grippers. Printed with 95A TPU material, the rib- like pattern on the finger maintains rigidity on the fingertip while conforming to the object geometry for a more secure grasp (Fig. A3). When deployed to robots that lack force- torque co...

work page
[69]

Franka Mount: Due to FR2’s limited end-effector pitch (FR2 is designed for top-down pick and place, while the UMI gripper is mostly held horizontally), we had to design and 3D print a custom mounting adapter that rotates WSG50 gripper 90-degree rotation with respect to the robot’s end-effector flange

work page