EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 4polarities
background 4representative citing papers
BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
citing papers explorer
-
EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
-
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
-
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.