HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Botao He; Furong Huang; Kelin Yu; Ruohan Gao; Seungjae Lee; Yiannis Aloimonos; Zhi Wang

arxiv: 2605.24934 · v2 · pith:FKO37AYXnew · submitted 2026-05-24 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Zhi Wang , Botao He , Kelin Yu , Seungjae Lee , Ruohan Gao , Furong Huang , Yiannis Aloimonos This is my paper

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords humanhumanegorobotminuteszero-shotacrossegocentricembodiment

0 comments

read the original abstract

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ForceBand: Learning Forceful Manipulation with sEMG
cs.RO 2026-06 unverdicted novelty 6.0

ForceBand uses sEMG and IMU signals to predict fingertip forces from human demos, producing force-augmented data that lets robot policies reach 87% success on pick-squeeze-place tasks across varied objects.
LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition
cs.RO 2026-06 unverdicted novelty 5.0

LUCID learns embodiment-agnostic intent models from unstructured human videos to train dexterous robot policies in simulation, enabling zero-shot transfer on real-world tasks like stirring and wiping.