HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
Pith reviewed 2026-05-21 11:27 UTC · model grok-4.3
The pith
A cross-embodiment hand-eye policy design lets whole-body mobile manipulation policies be learned directly from robot-free human demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that augmenting human demonstration interfaces with egocentric sensing and then applying a cross-embodiment hand-eye policy design—consisting of an embodiment-agnostic visual representation, relaxed head actions, and a whole-body controller—bridges the observation and action gaps well enough for direct policy transfer, enabling successful execution of long-horizon mobile manipulation that requires bimanual coordination, navigation, and active perception.
What carries the argument
The cross-embodiment hand-eye policy design, which converts human egocentric demonstrations into robot-executable whole-body trajectories via agnostic visual features, relaxed head commands, and constraint-aware motion coordination.
If this is right
- Long-horizon mobile manipulation tasks become learnable without any robot demonstrations.
- Data collection for such tasks can be performed portably with human operators and standard cameras.
- Whole-body controllers can realize hand-eye trajectories while obeying robot-specific physical constraints.
- Policies gain the ability to combine manipulation with navigation and active perception in one model.
Where Pith is reading between the lines
- The same design pattern could be tested on additional robot morphologies to check how much the controller must be retuned.
- Removing the need for robot data might lower overall training costs enough to support larger-scale imitation learning datasets.
- The approach opens a route to collecting demonstrations in unstructured homes or outdoor settings where robots are hard to deploy.
- Future extensions could explore whether the agnostic visual representation transfers across different camera hardware.
Load-bearing premise
The larger embodiment gap from egocentric human sensing can be closed adequately by the three design choices alone, without robot-specific demonstrations or large performance drops.
What would settle it
A robot executing the learned policy fails to complete long-horizon tasks that require switching between navigation, bimanual grasping, and viewpoint adjustment when tested in the same environments used for human data collection.
Figures
read the original abstract
We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HoMMI, a data collection and policy learning framework for whole-body mobile manipulation that uses augmented UMI interfaces with egocentric sensing to capture human demonstrations without robots. It identifies a larger embodiment gap from egocentric sensing and proposes to bridge it via a cross-embodiment hand-eye policy consisting of an embodiment-agnostic visual representation, a relaxed head action representation, and a whole-body controller that converts hand-eye trajectories into feasible robot motion respecting kinematics, dynamics, and constraints. The central claim is that these three design choices together enable successful policy transfer and execution on long-horizon tasks requiring bimanual coordination, navigation, and active perception.
Significance. If the empirical results hold, the work offers a practical route to scalable imitation learning for complex mobile manipulation by removing the need for robot-specific demonstrations. The explicit decomposition of the embodiment gap into observation, action, and execution layers, together with the whole-body controller as the final realization step, provides a concrete engineering template that could be reused across platforms. The project page is referenced for qualitative results, which is a positive step toward reproducibility.
major comments (2)
- [§4] §4 (Whole-Body Controller): The controller is described as realizing hand-eye trajectories under robot-specific physical constraints, yet the manuscript provides no quantitative metrics (success rate, failure modes, or horizon length) on how often the controller rejects or deviates from policy outputs in long-horizon active-perception scenarios. This is load-bearing for the central claim because the other two design choices only generate target trajectories; transfer success ultimately depends on reliable execution over extended sequences.
- [Evaluation] Evaluation section: The abstract and results point to a project page for demonstrations, but the main text lacks reported quantitative comparisons against baselines, ablations of the three design choices, or robot-specific demonstration data. Without these numbers, it is difficult to assess whether the embodiment gap is closed sufficiently or whether performance degrades on tasks with tight navigation-manipulation coupling.
minor comments (2)
- [§3.3] Notation for the relaxed head action representation could be clarified with an explicit equation or pseudocode showing how the relaxation is implemented during policy inference.
- [Figures] Figure captions should explicitly state whether the visualized trajectories are from human demonstrations, policy rollouts, or controller outputs to aid reader interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting key areas where additional evidence would strengthen the manuscript. We address the two major comments point by point below and commit to incorporating the requested quantitative analysis in a revised version.
read point-by-point responses
-
Referee: [§4] §4 (Whole-Body Controller): The controller is described as realizing hand-eye trajectories under robot-specific physical constraints, yet the manuscript provides no quantitative metrics (success rate, failure modes, or horizon length) on how often the controller rejects or deviates from policy outputs in long-horizon active-perception scenarios. This is load-bearing for the central claim because the other two design choices only generate target trajectories; transfer success ultimately depends on reliable execution over extended sequences.
Authors: We agree that quantitative characterization of the whole-body controller is necessary to substantiate reliable long-horizon execution. In the revised manuscript we will add a dedicated subsection to §4 that reports (i) success rate of trajectory realization, (ii) frequency and types of rejections or deviations (e.g., kinematic, dynamic, or collision constraints), and (iii) performance as a function of horizon length on the active-perception tasks. These metrics will be obtained from the same evaluation rollouts already collected for the policy results. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract and results point to a project page for demonstrations, but the main text lacks reported quantitative comparisons against baselines, ablations of the three design choices, or robot-specific demonstration data. Without these numbers, it is difficult to assess whether the embodiment gap is closed sufficiently or whether performance degrades on tasks with tight navigation-manipulation coupling.
Authors: We acknowledge that the current main-text evaluation relies primarily on qualitative results hosted on the project page. In the revision we will move quantitative results into the Evaluation section, adding (a) success-rate tables comparing HoMMI against relevant baselines, (b) ablations that isolate the contribution of each of the three design choices (embodiment-agnostic visual representation, relaxed head actions, and whole-body controller), and (c) where feasible, a comparison against policies trained on robot-specific demonstrations. These additions will directly address questions of embodiment-gap closure and performance under tight navigation-manipulation coupling. revision: yes
Circularity Check
No circularity: engineering design framework with independent experimental validation
full rationale
The paper describes HoMMI as a practical data-collection and policy-learning framework that augments UMI interfaces with egocentric sensing and bridges the human-to-robot gap via three explicit design choices (embodiment-agnostic visual representation, relaxed head action representation, and a whole-body controller). No equations, fitted parameters, or predictions are presented that reduce the claimed transfer success to a self-referential quantity or to a self-citation chain. The central claim is supported by experimental results on long-horizon tasks rather than by internal re-derivation of inputs, satisfying the criterion for a self-contained engineering contribution against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Augmenting UMI interfaces with egocentric sensing captures the global context required for mobile manipulation.
invented entities (1)
-
Cross-embodiment hand-eye policy
no independent evidence
Forward citations
Cited by 6 Pith papers
-
EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
-
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
-
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manip- ulation
Arpit Bahety, Arnav Balaji, Ben Abbatematteo, and Roberto Mart ´ın-Mart´ın. Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manip- ulation. InRSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities & Contemporary Challenges, 2025
work page 2025
-
[2]
A careful examination of large behavior models for multitask dexterous manipulation
Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025
-
[3]
Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the- wild and on-task data.arXiv preprint arXiv:2511.15704, 2025
-
[4]
Open-television: Teleoperation with immersive active visual feedback
Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. InConference on Robot Learning, pages 2729–2749. PMLR, 2025
work page 2025
-
[5]
Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024
work page 2024
-
[6]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
work page 2025
-
[7]
In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026
Hojung Choi, Yifan Hou, Chuer Pan, Seongheon Hong, Austin Patel, Xiaomeng Xu, Mark R Cutkosky, and Shuran Song. In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026
-
[8]
Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, and Iman Soltani. Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transform- ers.arXiv preprint arXiv:2507.15833, 2025
-
[9]
Telemoma: A modular and versatile teleoperation system for mobile manipulation
Shivin Dass, Wensi Ai, Yuqian Jiang, Samik Singh, Jia- heng Hu, Ruohan Zhang, Peter Stone, Ben Abbatematteo, and Roberto Mart ´ın-Mart´ın. Telemoma: A modular and versatile teleoperation system for mobile manipulation. InRSS 2024 Workshop: Data Generation for Robotics
work page 2024
-
[10]
Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Con- ference on Robot Learning, 2024
work page 2024
-
[11]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025
work page 2025
-
[12]
Umi-on-legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi-on-legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers. InConference on Robot Learning, pages 5254–5270. PMLR, 2025
work page 2025
-
[13]
Stream- ingt2v: Consistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Stream- ingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568– 2577, 2025
work page 2025
-
[14]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David J Yoon, Mouli Siva- purapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, and Li Fei-Fei. BEHA VIOR robot suite: Stream- lining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[16]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 13226– 13233. IEEE, 2025
work page 2025
-
[17]
Generative-ai-driven jumping robot design using dif- fusion models
Byungchul Kim, Tsun-Hsuan Wang, and Daniela Rus. Generative-ai-driven jumping robot design using dif- fusion models. In2025 International Conference on Robotics and Automation (ICRA), 2025
work page 2025
-
[18]
Mas- querade: Learning from in-the-wild human videos using data-editing
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Mas- querade: Learning from in-the-wild human videos using data-editing. InHuman to Robot: Workshop on Sensoriz- ing, Modeling, and Learning from Humans, 2025
work page 2025
-
[19]
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yun- fan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, et al. Momagen: Gener- ating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation. InRSS 2025 Workshop: Mobile Manipulation: Emerging Opportuni- ties and Contemporary Challenges
work page 2025
-
[20]
Fangchen Liu, Chuanyu Li, Yihua Qin, Jing Xu, Pieter Abbeel, and Rui Chen. Vitamin: Learning contact- rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025
-
[21]
Yun Liu, Xiaomeng Xu, Weihang Chen, Haocheng Yuan, He Wang, Jing Xu, Rui Chen, and Li Yi. Enhancing generalizable 6d pose tracking of an in-hand object with tactile sensing.IEEE Robotics and Automation Letters, 9(2):1106–1113, 2023
work page 2023
-
[22]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Lookout: Real- world humanoid egocentric navigation
Boxiao Pan, Adam W Harley, Francis Engelmann, C Karen Liu, and Leonidas J Guibas. Lookout: Real- world humanoid egocentric navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24977–24988, 2025
work page 2025
-
[24]
Egobridge: Domain adaptation for generalizable imitation from egocentric human data
Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025
work page 2025
-
[25]
Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy ∼human policy. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[26]
Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares Abu-Dakka. Mv-umi: A scalable multi- view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757, 2025
-
[27]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[29]
Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025
-
[30]
Priya Sundaresan, Rhea Malhotra, Phillip Miao, Jingyun Yang, Jimmy Wu, Hengyuan Hu, Rika Antonova, Fran- cis Engelmann, Dorsa Sadigh, and Jeannette Bohg. Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control.arXiv preprint arXiv:2506.01185, 2025
-
[31]
Spin: Simultaneous percep- tion interaction and navigation
Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, and Deepak Pathak. Spin: Simultaneous percep- tion interaction and navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18133–18142, 2024
work page 2024
-
[32]
Foundationstereo: Zero-shot stereo matching
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 5249–5260, 2025
work page 2025
-
[33]
Adapt3r: Adaptive 3d scene representation for domain transfer in imitation learning
Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, and Animesh Garg. Adapt3r: Adaptive 3d scene representation for domain transfer in imitation learning. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[34]
Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators
Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12156–12163. IEEE, 2024
work page 2024
-
[35]
Haoyu Xiong, Russell Mendonca, Kenneth Shaw, and Deepak Pathak. Adaptive mobile manipulation for ar- ticulated objects in the open world.arXiv preprint arXiv:2401.14403, 2024
-
[36]
Vision in action: Learning active perception from human demonstrations
Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, and Shuran Song. Vision in action: Learning active perception from human demonstrations. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[37]
Jacobinerf: Nerf shaping with mutual information gradients
Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, and Leonidas Guibas. Jacobinerf: Nerf shaping with mutual information gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16498–16507, 2023
work page 2023
-
[38]
RoboPanoptes: The All-Seeing Robot with Whole-body Dexterity
Xiaomeng Xu, Dominik Bauer, and Shuran Song. RoboPanoptes: The All-Seeing Robot with Whole-body Dexterity. InProceedings of Robotics: Science and Systems, 2025
work page 2025
-
[39]
Dynamics- guided diffusion model for sensor-less robot manipulator design
Xiaomeng Xu, Huy Ha, and Shuran Song. Dynamics- guided diffusion model for sensor-less robot manipulator design. InConference on Robot Learning, pages 4446–
-
[40]
Compliant residual DAgger: Improving real-world contact-rich manipulation with human corrections
Xiaomeng Xu, Yifan Hou, Zeyi Liu, and Shuran Song. Compliant residual DAgger: Improving real-world contact-rich manipulation with human corrections. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[41]
Mobi-π: Mobilizing your robot learning policy
Jingyun Yang, Isabella Huang, Brandon Vu, Max Ba- jracharya, Rika Antonova, and Jeannette Bohg. Mobi-π: Mobilizing your robot learning policy. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[42]
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision- language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. Egomi: Learning active vision and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025
-
[44]
Mink: Python inverse kinematics based on MuJoCo, December 2025
Kevin Zakka. Mink: Python inverse kinematics based on MuJoCo, December 2025. URL https://github.com/ kevinzakka/mink
work page 2025
-
[45]
Qiyuan Zeng, Chengmeng Li, Jude St John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025
-
[46]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
work page 2019
-
[47]
Fastumi: A scalable and hardware-independent universal manip- ulation interface with dataset
Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan CHEN, Pingrui Zhang, et al. Fastumi: A scalable and hardware-independent universal manip- ulation interface with dataset. InConference on Robot Learning, pages 3069–3093. PMLR, 2025
work page 2025
-
[48]
On the continuity of rotation representations in neural networks
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019
work page 2019
-
[49]
Lawrence Y Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, and Danfei Xu. Emma: Scaling mobile manipulation via egocentric human data.IEEE Robotics and Automation Letters, 2026
work page 2026
-
[50]
Xinyue Zhu, Binghao Huang, and Yunzhu Li. Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper.arXiv preprint arXiv:2507.15062, 2025. . APPENDIX I. POLICYTRAININGDETAILS Observations and actions.We use a short observation history ofT o=2 steps and predict an action horizon ofT p=32 steps at 20 Hz (downsampled from 60...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.