HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Aditya Bhat; Dian Wang; Eric Cousineau; Han Zhang; Jeannette Bohg; Jisang Park; Jose Barreiros; Shuran Song; Xiaomeng Xu

arxiv: 2603.03243 · v2 · pith:7QLGLLWGnew · submitted 2026-03-03 · 💻 cs.RO

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu , Jisang Park , Han Zhang , Eric Cousineau , Aditya Bhat , Jose Barreiros , Dian Wang , Jeannette Bohg

show 1 more author

Shuran Song

This is my paper

classification 💻 cs.RO

keywords whole-bodymanipulationmobilepolicyactioncollectiondatademonstrations

0 comments

read the original abstract

We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
cs.CV 2026-05 unverdicted novelty 6.0

EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
cs.RO 2026-04 unverdicted novelty 6.0

UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
cs.RO 2026-04 unverdicted novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.