pith. sign in

arxiv: 2603.03243 · v2 · pith:7QLGLLWGnew · submitted 2026-03-03 · 💻 cs.RO

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Pith reviewed 2026-05-21 11:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords whole-body mobile manipulationhuman demonstrationsimitation learningcross-embodiment transferegocentric sensingrobot-free data collectionbimanual coordinationwhole-body control
0
0 comments X

The pith

A cross-embodiment hand-eye policy design lets whole-body mobile manipulation policies be learned directly from robot-free human demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HoMMI, a framework that collects human demonstrations using portable UMI interfaces augmented with egocentric cameras to capture the global context needed for mobile tasks. It tackles the resulting larger human-to-robot embodiment gap through three specific choices: an embodiment-agnostic visual representation, a relaxed head action representation, and a whole-body controller that converts hand-eye trajectories into coordinated robot motion while respecting physical limits. These elements together support policy transfer for long-horizon tasks that combine bimanual coordination, navigation, and active perception. A reader would care because the method removes the need for robot-specific demonstration data, making data collection cheaper and more scalable for complex mobile manipulation.

Core claim

The central claim is that augmenting human demonstration interfaces with egocentric sensing and then applying a cross-embodiment hand-eye policy design—consisting of an embodiment-agnostic visual representation, relaxed head actions, and a whole-body controller—bridges the observation and action gaps well enough for direct policy transfer, enabling successful execution of long-horizon mobile manipulation that requires bimanual coordination, navigation, and active perception.

What carries the argument

The cross-embodiment hand-eye policy design, which converts human egocentric demonstrations into robot-executable whole-body trajectories via agnostic visual features, relaxed head commands, and constraint-aware motion coordination.

If this is right

  • Long-horizon mobile manipulation tasks become learnable without any robot demonstrations.
  • Data collection for such tasks can be performed portably with human operators and standard cameras.
  • Whole-body controllers can realize hand-eye trajectories while obeying robot-specific physical constraints.
  • Policies gain the ability to combine manipulation with navigation and active perception in one model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same design pattern could be tested on additional robot morphologies to check how much the controller must be retuned.
  • Removing the need for robot data might lower overall training costs enough to support larger-scale imitation learning datasets.
  • The approach opens a route to collecting demonstrations in unstructured homes or outdoor settings where robots are hard to deploy.
  • Future extensions could explore whether the agnostic visual representation transfers across different camera hardware.

Load-bearing premise

The larger embodiment gap from egocentric human sensing can be closed adequately by the three design choices alone, without robot-specific demonstrations or large performance drops.

What would settle it

A robot executing the learned policy fails to complete long-horizon tasks that require switching between navigation, bimanual grasping, and viewpoint adjustment when tested in the same environments used for human data collection.

Figures

Figures reproduced from arXiv: 2603.03243 by Aditya Bhat, Dian Wang, Eric Cousineau, Han Zhang, Jeannette Bohg, Jisang Park, Jose Barreiros, Shuran Song, Xiaomeng Xu.

Figure 1
Figure 1. Figure 1: Whole-Body Mobile Manipulation Interface (HoMMI). (a) We extend UMI with egocentric sensing to enable scalable mobile manipulation with active perception – capabilities that cannot be achieved with the original UMI. (b) However, the new egocentric view creates a substantial embodiment gap in both observation and action space, making policy transfer difficult. (c) We bridge this embodiment gap by carefully … view at source ↗
Figure 2
Figure 2. Figure 2: HoMMI System Overview. We learn whole-body mobile manipulation from human demonstrations with an intuitive data collection interface (§ IV), a cross-embodiment policy design with an embodiment-agnostic visual representation and a relaxed head action representation (§ V), and a whole-body controller that achieves hand-eye tracking through whole-body motions respecting physical constraints (§ VI-B). data sca… view at source ↗
Figure 3
Figure 3. Figure 3: Embodiment-Agnostic Visual Representation. We use a 3D representation for egocentric observations that allows using an embodiment-agnostic gripper coordinate frame, and masking out embodiment-specific arms and body observations. the operator and avoiding the motion-sickness often associated with VR-based data collection [43, 45, 36]. V. CROSS-EMBODIMENT HAND-EYE POLICY Leveraging the collected data, we tra… view at source ↗
Figure 4
Figure 4. Figure 4: Look-at Point Action Rep￾resentation. To bridge the kine￾matic gap (e.g., height and neck DoF), we relax the head action constraint by representing the robot gaze as a “3D look-at point”. This representation allows effective ac￾tive perception for gathering task￾relevant information without over￾constraining the robot to mimic hu￾man head motions exactly. Mobile robots have dif￾ferent kinematics than hu￾ma… view at source ↗
Figure 5
Figure 5. Figure 5: HoMMI Whole-Body Controller is designed to achieve precise end-effector tracking for accurate manipulation and effective active perception for information gathering. To do so, it uses (a) a relaxed head look-at point action representation that allows accurate bimanual end-effectors SE(3) tracking, circumventing the infeasibility and increased error associated with simultaneous 6-DoF head-hand tracking. In … view at source ↗
Figure 7
Figure 7. Figure 7: Laundry Task. (a) Our cross-embodiment hand-eye policy rollout, highlighting our system’s capability of whole-body coordination and active perception. (b) Different test scenarios with different objects and bin locations. (c) Typical failure cases of the baselines. Wrist-Only RGB-Only Head-Only w/o Neck Ours 0 20 40 60 80 100 Success Rate (%) 20 0 0 75 90 Laundry Wrist-Only RGB-Only Head-Only w/o Neck Ours… view at source ↗
Figure 8
Figure 8. Figure 8: Quantitative Results. Ours consistently outperforms base￾lines across all three long-horizon mobile manipulation tasks. • Cross-embodiment transfer: deploying policies learned from robot-free human demonstrations on a robot with a different appearance and kinematics. Required for all tasks. • Bimanual / Whole-body coordination: coordinating two arms, mobile base, torso, and head for mobile manipulation. • … view at source ↗
Figure 9
Figure 9. Figure 9: Delivery Task. (a) Our policy rollout, demonstrating long-horizon navigation over a large workspace and active perception. (b) Different test scenarios with different trolley locations and initial base positions and orientations. (c) Typical failure cases of the baselines. in human and robot observations, causing the policy to go OOD. (3) Head-Only’s success rate is also 0%, failing due to missing the clot… view at source ↗
Figure 10
Figure 10. Figure 10: Tablescape Task. (a) Our policy rollout, demonstrating precise bimanual and whole-body coordination. (b) Different test scenarios with different initial base positions and mat placement. (c) Typical failure cases of the baselines. the initial alignment is inaccurate. The remaining failures are due to slight misalignment at the end after long navigation. Typical baseline failure modes are shown in Fig. 9c.… view at source ↗
Figure 11
Figure 11. Figure 11: Egocentric Attention Comparison. We visualize attention maps for egocentric observations with yellow representing higher attention values. Ours exhibits clean attention highlighted around task-relevant objects, while baselines’ attentions are less informative. observations as policy input and jointly finetuning the vision encoder on both wrist and egocentric images helps the policy learn cleaner egocentri… view at source ↗
Figure 12
Figure 12. Figure 12: Hardware Schematic. multi-camera streaming and high-frequency robot closed-loop control [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HoMMI, a data collection and policy learning framework for whole-body mobile manipulation that uses augmented UMI interfaces with egocentric sensing to capture human demonstrations without robots. It identifies a larger embodiment gap from egocentric sensing and proposes to bridge it via a cross-embodiment hand-eye policy consisting of an embodiment-agnostic visual representation, a relaxed head action representation, and a whole-body controller that converts hand-eye trajectories into feasible robot motion respecting kinematics, dynamics, and constraints. The central claim is that these three design choices together enable successful policy transfer and execution on long-horizon tasks requiring bimanual coordination, navigation, and active perception.

Significance. If the empirical results hold, the work offers a practical route to scalable imitation learning for complex mobile manipulation by removing the need for robot-specific demonstrations. The explicit decomposition of the embodiment gap into observation, action, and execution layers, together with the whole-body controller as the final realization step, provides a concrete engineering template that could be reused across platforms. The project page is referenced for qualitative results, which is a positive step toward reproducibility.

major comments (2)
  1. [§4] §4 (Whole-Body Controller): The controller is described as realizing hand-eye trajectories under robot-specific physical constraints, yet the manuscript provides no quantitative metrics (success rate, failure modes, or horizon length) on how often the controller rejects or deviates from policy outputs in long-horizon active-perception scenarios. This is load-bearing for the central claim because the other two design choices only generate target trajectories; transfer success ultimately depends on reliable execution over extended sequences.
  2. [Evaluation] Evaluation section: The abstract and results point to a project page for demonstrations, but the main text lacks reported quantitative comparisons against baselines, ablations of the three design choices, or robot-specific demonstration data. Without these numbers, it is difficult to assess whether the embodiment gap is closed sufficiently or whether performance degrades on tasks with tight navigation-manipulation coupling.
minor comments (2)
  1. [§3.3] Notation for the relaxed head action representation could be clarified with an explicit equation or pseudocode showing how the relaxation is implemented during policy inference.
  2. [Figures] Figure captions should explicitly state whether the visualized trajectories are from human demonstrations, policy rollouts, or controller outputs to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting key areas where additional evidence would strengthen the manuscript. We address the two major comments point by point below and commit to incorporating the requested quantitative analysis in a revised version.

read point-by-point responses
  1. Referee: [§4] §4 (Whole-Body Controller): The controller is described as realizing hand-eye trajectories under robot-specific physical constraints, yet the manuscript provides no quantitative metrics (success rate, failure modes, or horizon length) on how often the controller rejects or deviates from policy outputs in long-horizon active-perception scenarios. This is load-bearing for the central claim because the other two design choices only generate target trajectories; transfer success ultimately depends on reliable execution over extended sequences.

    Authors: We agree that quantitative characterization of the whole-body controller is necessary to substantiate reliable long-horizon execution. In the revised manuscript we will add a dedicated subsection to §4 that reports (i) success rate of trajectory realization, (ii) frequency and types of rejections or deviations (e.g., kinematic, dynamic, or collision constraints), and (iii) performance as a function of horizon length on the active-perception tasks. These metrics will be obtained from the same evaluation rollouts already collected for the policy results. revision: yes

  2. Referee: [Evaluation] Evaluation section: The abstract and results point to a project page for demonstrations, but the main text lacks reported quantitative comparisons against baselines, ablations of the three design choices, or robot-specific demonstration data. Without these numbers, it is difficult to assess whether the embodiment gap is closed sufficiently or whether performance degrades on tasks with tight navigation-manipulation coupling.

    Authors: We acknowledge that the current main-text evaluation relies primarily on qualitative results hosted on the project page. In the revision we will move quantitative results into the Evaluation section, adding (a) success-rate tables comparing HoMMI against relevant baselines, (b) ablations that isolate the contribution of each of the three design choices (embodiment-agnostic visual representation, relaxed head actions, and whole-body controller), and (c) where feasible, a comparison against policies trained on robot-specific demonstrations. These additions will directly address questions of embodiment-gap closure and performance under tight navigation-manipulation coupling. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering design framework with independent experimental validation

full rationale

The paper describes HoMMI as a practical data-collection and policy-learning framework that augments UMI interfaces with egocentric sensing and bridges the human-to-robot gap via three explicit design choices (embodiment-agnostic visual representation, relaxed head action representation, and a whole-body controller). No equations, fitted parameters, or predictions are presented that reduce the claimed transfer success to a self-referential quantity or to a self-citation chain. The central claim is supported by experimental results on long-horizon tasks rather than by internal re-derivation of inputs, satisfying the criterion for a self-contained engineering contribution against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that egocentric sensing plus the three policy components suffice to close the embodiment gap; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Augmenting UMI interfaces with egocentric sensing captures the global context required for mobile manipulation.
    Stated directly in the abstract as the motivation for the augmentation.
invented entities (1)
  • Cross-embodiment hand-eye policy no independent evidence
    purpose: Bridge the human-to-robot embodiment gap in observation and action spaces for whole-body control.
    Introduced as the key technical contribution that realizes hand-eye trajectories under robot constraints.

pith-pipeline@v0.9.0 · 5713 in / 1415 out tokens · 43280 ms · 2026-05-21T11:27:36.679301+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.

  2. BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.

  3. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  4. UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

    cs.RO 2026-04 unverdicted novelty 6.0

    UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.

  5. ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

    cs.RO 2026-04 unverdicted novelty 6.0

    ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.

  6. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 6 Pith papers · 5 internal anchors

  1. [1]

    Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manip- ulation

    Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Mart ´ın-Mart´ın. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manip- ulation. InRSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities & Contemporary Challenges, 2025

  2. [2]

    A careful examination of large behavior models for multitask dexterous manipulation

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

  3. [3]

    In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

    Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the- wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

  4. [4]

    Open-television: Teleoperation with immersive active visual feedback

    Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. InConference on Robot Learning, pages 2729–2749. PMLR, 2025

  5. [5]

    Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  7. [7]

    In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026

    Hojung Choi, Yifan Hou, Chuer Pan, Seongheon Hong, Austin Patel, Xiaomeng Xu, Mark R Cutkosky, and Shuran Song. In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026

  8. [8]

    Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transform- ers.arXiv preprint arXiv:2507.15833, 2025

    Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, and Iman Soltani. Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transform- ers.arXiv preprint arXiv:2507.15833, 2025

  9. [9]

    Telemoma: A modular and versatile teleoperation system for mobile manipulation

    Shivin Dass, Wensi Ai, Yuqian Jiang, Samik Singh, Jia- heng Hu, Ruohan Zhang, Peter Stone, Ben Abbatematteo, and Roberto Mart ´ın-Mart´ın. Telemoma: A modular and versatile teleoperation system for mobile manipulation. InRSS 2024 Workshop: Data Generation for Robotics

  10. [10]

    Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Con- ference on Robot Learning, 2024

  11. [11]

    Humanplus: Humanoid shadowing and imitation from humans

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025

  12. [12]

    Umi-on-legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi-on-legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers. InConference on Robot Learning, pages 5254–5270. PMLR, 2025

  13. [13]

    Stream- ingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Stream- ingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568– 2577, 2025

  14. [14]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Siva- purapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  15. [15]

    BEHA VIOR robot suite: Stream- lining real-world whole-body manipulation for everyday household activities

    Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. BEHA VIOR robot suite: Stream- lining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025

  16. [16]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 13226– 13233. IEEE, 2025

  17. [17]

    Generative-ai-driven jumping robot design using dif- fusion models

    Byungchul Kim, Tsun-Hsuan Wang, and Daniela Rus. Generative-ai-driven jumping robot design using dif- fusion models. In2025 International Conference on Robotics and Automation (ICRA), 2025

  18. [18]

    Mas- querade: Learning from in-the-wild human videos using data-editing

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Mas- querade: Learning from in-the-wild human videos using data-editing. InHuman to Robot: Workshop on Sensoriz- ing, Modeling, and Learning from Humans, 2025

  19. [19]

    Momagen: Gener- ating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation

    Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yun- fan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, et al. Momagen: Gener- ating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation. InRSS 2025 Workshop: Mobile Manipulation: Emerging Opportuni- ties and Contemporary Challenges

  20. [20]

    Vitamin: Learning contact- rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

    Fangchen Liu, Chuanyu Li, Yihua Qin, Jing Xu, Pieter Abbeel, and Rui Chen. Vitamin: Learning contact- rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

  21. [21]

    Enhancing generalizable 6d pose tracking of an in-hand object with tactile sensing.IEEE Robotics and Automation Letters, 9(2):1106–1113, 2023

    Yun Liu, Xiaomeng Xu, Weihang Chen, Haocheng Yuan, He Wang, Jing Xu, Rui Chen, and Li Yi. Enhancing generalizable 6d pose tracking of an in-hand object with tactile sensing.IEEE Robotics and Automation Letters, 9(2):1106–1113, 2023

  22. [22]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  23. [23]

    Lookout: Real- world humanoid egocentric navigation

    Boxiao Pan, Adam W Harley, Francis Engelmann, C Karen Liu, and Leonidas J Guibas. Lookout: Real- world humanoid egocentric navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24977–24988, 2025

  24. [24]

    Egobridge: Domain adaptation for generalizable imitation from egocentric human data

    Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025

  25. [25]

    Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ∼human policy. In9th Annual Conference on Robot Learning, 2025

  26. [26]

    Mv-umi: A scalable multi- view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757, 2025

    Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares Abu-Dakka. Mv-umi: A scalable multi- view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757, 2025

  27. [27]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  28. [28]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  29. [29]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

  30. [30]

    Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control.arXiv preprint arXiv:2506.01185, 2025

    Priya Sundaresan, Rhea Malhotra, Phillip Miao, Jingyun Yang, Jimmy Wu, Hengyuan Hu, Rika Antonova, Fran- cis Engelmann, Dorsa Sadigh, and Jeannette Bohg. Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control.arXiv preprint arXiv:2506.01185, 2025

  31. [31]

    Spin: Simultaneous percep- tion interaction and navigation

    Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, and Deepak Pathak. Spin: Simultaneous percep- tion interaction and navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18133–18142, 2024

  32. [32]

    Foundationstereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 5249–5260, 2025

  33. [33]

    Adapt3r: Adaptive 3d scene representation for domain transfer in imitation learning

    Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, and Animesh Garg. Adapt3r: Adaptive 3d scene representation for domain transfer in imitation learning. In9th Annual Conference on Robot Learning, 2025

  34. [34]

    Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12156–12163. IEEE, 2024

  35. [35]

    Adaptive mobile manipulation for ar- ticulated objects in the open world.arXiv preprint arXiv:2401.14403, 2024

    Haoyu Xiong, Russell Mendonca, Kenneth Shaw, and Deepak Pathak. Adaptive mobile manipulation for ar- ticulated objects in the open world.arXiv preprint arXiv:2401.14403, 2024

  36. [36]

    Vision in action: Learning active perception from human demonstrations

    Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, and Shuran Song. Vision in action: Learning active perception from human demonstrations. In9th Annual Conference on Robot Learning, 2025

  37. [37]

    Jacobinerf: Nerf shaping with mutual information gradients

    Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, and Leonidas Guibas. Jacobinerf: Nerf shaping with mutual information gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16498–16507, 2023

  38. [38]

    RoboPanoptes: The All-Seeing Robot with Whole-body Dexterity

    Xiaomeng Xu, Dominik Bauer, and Shuran Song. RoboPanoptes: The All-Seeing Robot with Whole-body Dexterity. InProceedings of Robotics: Science and Systems, 2025

  39. [39]

    Dynamics- guided diffusion model for sensor-less robot manipulator design

    Xiaomeng Xu, Huy Ha, and Shuran Song. Dynamics- guided diffusion model for sensor-less robot manipulator design. InConference on Robot Learning, pages 4446–

  40. [40]

    Compliant residual DAgger: Improving real-world contact-rich manipulation with human corrections

    Xiaomeng Xu, Yifan Hou, Zeyi Liu, and Shuran Song. Compliant residual DAgger: Improving real-world contact-rich manipulation with human corrections. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  41. [41]

    Mobi-π: Mobilizing your robot learning policy

    Jingyun Yang, Isabella Huang, Brandon Vu, Max Ba- jracharya, Rika Antonova, and Jeannette Bohg. Mobi-π: Mobilizing your robot learning policy. In9th Annual Conference on Robot Learning, 2025

  42. [42]

    EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision- language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025

  43. [43]

    Egomi: Learning active vision and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

    Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. Egomi: Learning active vision and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

  44. [44]

    Mink: Python inverse kinematics based on MuJoCo, December 2025

    Kevin Zakka. Mink: Python inverse kinematics based on MuJoCo, December 2025. URL https://github.com/ kevinzakka/mink

  45. [45]

    Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025

    Qiyuan Zeng, Chengmeng Li, Jude St John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025

  46. [46]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  47. [47]

    Fastumi: A scalable and hardware-independent universal manip- ulation interface with dataset

    Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan CHEN, Pingrui Zhang, et al. Fastumi: A scalable and hardware-independent universal manip- ulation interface with dataset. InConference on Robot Learning, pages 3069–3093. PMLR, 2025

  48. [48]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

  49. [49]

    Emma: Scaling mobile manipulation via egocentric human data.IEEE Robotics and Automation Letters, 2026

    Lawrence Y Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, and Danfei Xu. Emma: Scaling mobile manipulation via egocentric human data.IEEE Robotics and Automation Letters, 2026

  50. [50]

    Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper.arXiv preprint arXiv:2507.15062, 2025

    Xinyue Zhu, Binghao Huang, and Yunzhu Li. Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper.arXiv preprint arXiv:2507.15062, 2025. . APPENDIX I. POLICYTRAININGDETAILS Observations and actions.We use a short observation history ofT o=2 steps and predict an action horizon ofT p=32 steps at 20 Hz (downsampled from 60...