pith. sign in

arxiv: 2602.10106 · v2 · pith:TMORVASZnew · submitted 2026-02-10 · 💻 cs.RO

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration

classification 💻 cs.RO
keywords datahumanalignmentegocentricloco-manipulationactioncollectiondemonstrations
0
0 comments X
read the original abstract

Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.

  2. BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.

  3. ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

    cs.RO 2026-04 unverdicted novelty 6.0

    ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.

  4. HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...

  5. HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.

  6. TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.

  7. Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

    cs.CV 2026-03 unverdicted novelty 6.0

    A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.

  8. EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

    cs.RO 2026-04 unverdicted novelty 4.0

    EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...