HumanoidUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
Pith reviewed 2026-06-26 04:53 UTC · model grok-4.3
The pith
A VR-based system collects whole-body humanoid robot demonstrations without using any robot hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HumanoidUMI shows that sparse human keypoint trajectories collected via VR devices can be retargeted to robot-native whole-body references after policy prediction, producing demonstrations that train effective whole-body manipulation policies without requiring robot teleoperation during data collection.
What carries the argument
The HumanoidUMI data-collection pipeline, which records human keypoints and actions with VR then uses a high-level policy to predict and retarget them into whole-body robot references.
If this is right
- Demonstration collection for humanoid robots becomes portable and independent of robot hardware availability.
- High-level policies trained on the collected data produce retargeted actions that a whole-body controller can execute in real scenarios.
- Whole-body skills involving coordinated locomotion and manipulation become learnable from human demonstrations rather than robot teleoperation.
- The same pipeline supports data collection for multiple humanoid platforms through retargeting.
Where Pith is reading between the lines
- The approach could lower the barrier for researchers without access to expensive humanoid hardware to contribute training data.
- If retargeting generalizes, similar VR pipelines might apply to collecting data for other robot types beyond humanoids.
- Faster iteration on whole-body behaviors becomes possible by separating the data-gathering step from robot execution.
Load-bearing premise
Sparse human keypoint trajectories from VR can be retargeted to robot references and executed by a whole-body controller without major performance loss.
What would settle it
A direct comparison in which policies trained on HumanoidUMI data achieve substantially lower success rates than policies trained on equivalent robot-teleoperated data across the same five scenarios.
Figures
read the original abstract
High-quality demonstration data are essential for humanoid robot skill learning, especially for whole-body behaviors that require coordinated perception, locomotion, and manipulation. Existing data-collection methods largely rely on robot teleoperation, which is constrained by hardware accessibility, operator expertise, and limited efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose HumanoidUMI, a portable and robot-free framework for humanoid whole-body data collection. HumanoidUMI uses lightweight VR devices and UMI-inspired grippers to collect sparse human keypoint trajectories, wrist-view observations, and gripper actions. These demonstrations train a high-level policy to predict future keypoints, which are retargeted to robot-native whole-body references and executed by a whole-body controller. Experiments in five real-world scenarios demonstrate the effectiveness of the proposed framework and validate the collected demonstrations for transferable humanoid whole-body skill learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HumanoidUMI, a portable VR-based framework for collecting robot-free demonstrations of humanoid whole-body manipulation. Sparse human keypoint trajectories, wrist-view images, and gripper actions are captured with lightweight VR devices and UMI-inspired grippers; a high-level policy is trained to predict future keypoints; these are retargeted to robot-native whole-body references and executed by a whole-body controller. The central claim is that experiments across five real-world scenarios demonstrate the framework's effectiveness and confirm that the collected demonstrations support transferable whole-body skill learning.
Significance. If the retargeting step preserves task performance, the method would provide a scalable, hardware-light alternative to teleoperation for whole-body humanoid data collection, addressing a key bottleneck in learning coordinated locomotion-manipulation behaviors.
major comments (2)
- [Abstract (pipeline description) and Experiments section] The central claim that sparse VR keypoints can be retargeted and executed 'without significant performance loss' is load-bearing, yet the manuscript supplies no quantitative validation of retargeting fidelity (e.g., trajectory RMSE, joint-angle error, or success-rate delta relative to teleoperation baselines). Without these metrics it is impossible to determine whether the five scenarios succeed because the retargeting works or because the tasks are tolerant of kinematic mismatch.
- [Method (retargeting and control pipeline)] The whole-body controller is described only at the level of 'executed by a whole-body controller'; no formulation (IK objective, balance constraints, kinematic scaling, or contact handling) is given, making it impossible to assess whether the retargeting step is the performance-limiting factor or is masked by controller robustness.
minor comments (1)
- [Abstract and Experiments] The abstract states 'Experiments in five real-world scenarios demonstrate...' but provides no task names, success criteria, number of trials, or statistical reporting; these details should appear in the Experiments section with accompanying tables or figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract (pipeline description) and Experiments section] The central claim that sparse VR keypoints can be retargeted and executed 'without significant performance loss' is load-bearing, yet the manuscript supplies no quantitative validation of retargeting fidelity (e.g., trajectory RMSE, joint-angle error, or success-rate delta relative to teleoperation baselines). Without these metrics it is impossible to determine whether the five scenarios succeed because the retargeting works or because the tasks are tolerant of kinematic mismatch.
Authors: We acknowledge that the manuscript does not provide explicit quantitative metrics on retargeting fidelity such as trajectory RMSE or direct comparisons to teleoperation. Our evaluation centers on end-to-end task success rates in five real-world scenarios to demonstrate practical effectiveness. In the revised manuscript we will add a dedicated subsection in Experiments reporting retargeting accuracy (e.g., keypoint trajectory RMSE and joint-angle deviations post-retargeting) using the collected data. Direct teleoperation baselines are outside the robot-free scope of the work, but we will discuss potential kinematic mismatch effects on the reported success rates. revision: yes
-
Referee: [Method (retargeting and control pipeline)] The whole-body controller is described only at the level of 'executed by a whole-body controller'; no formulation (IK objective, balance constraints, kinematic scaling, or contact handling) is given, making it impossible to assess whether the retargeting step is the performance-limiting factor or is masked by controller robustness.
Authors: We agree the controller description is high-level and insufficient for evaluating its contribution. The manuscript prioritizes the data-collection pipeline, but we will revise the Method section to include the full formulation: the IK objective function, balance and CoM constraints, kinematic scaling between human and robot references, and contact-force handling. This will clarify the interface between retargeted references and low-level execution. revision: yes
Circularity Check
No circularity: pipeline relies on external data collection, retargeting, and experimental validation
full rationale
The paper presents a data-collection pipeline (VR keypoints + grippers), a high-level policy trained on collected trajectories, retargeting to robot references, and a whole-body controller. These steps are described as sequential engineering components validated by five real-world scenarios. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or method outline. The central claim (transferable skill learning from robot-free demos) is supported by external empirical results rather than reducing to its own inputs by construction. This matches the expected non-circular case for a systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human keypoint trajectories collected via VR can be retargeted to robot whole-body references.
Reference graph
Works this paper leans on
-
[1]
Droid: A large-scale in-the-wild robot manipulation dataset,
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis,et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024
2024
-
[2]
Twist2: Scalable, portable, and holistic humanoid data collection system,
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026
2026
-
[3]
Humanplus: Humanoid shadowing and imitation from humans,
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 2828–2844
2025
-
[4]
Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,
Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang, “Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2506.08931
arXiv 2025
-
[5]
Mobile-television: Predictive motion priors for humanoid whole-body control,
C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang, “Mobile-television: Predictive motion priors for humanoid whole-body control,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025
2025
-
[6]
Learning versatile humanoid manipulation with touch dreaming,
Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, and D. Zhao, “Learning versatile humanoid manipulation with touch dreaming,” 2026. [Online]. Available: https://arxiv.org/abs/2604.13015
Pith/arXiv arXiv 2026
-
[7]
Learning human-to-humanoid real-time whole-body teleoperation,
T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8944–8951
2024
-
[8]
Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning,
T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. M. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 1516–1540
2025
-
[9]
Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[10]
Omniumi: Towards physically grounded robot learning via human-aligned multimodal interaction,
S. Luo, Y . Li, Y . Hu, C. Yu, C. Xu, J. Zhang, G. Yao, T. Huang, R. He, and Z. Wang, “Omniumi: Towards physically grounded robot learning via human-aligned multimodal interaction,”arXiv preprint arXiv:2604.10647, 2026
Pith/arXiv arXiv 2026
-
[11]
Data scaling laws in imitation learning for robotic manipulation,
F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao, “Data scaling laws in imitation learning for robotic manipulation,” inInternational Conference on Learning Representations (ICLR), 2025
2025
-
[12]
In-the-wild compliant manipulation with umi-ft,
H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song, “In-the-wild compliant manipulation with umi-ft,” in 2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026
2026
-
[13]
Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,
Q. Zeng, C. Li, J. St. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu, “Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,” 2025. [Online]. Available: https://arxiv.org/abs/2510.01607
arXiv 2025
-
[14]
Umi-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers,
H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song, “Umi-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 5254–5270
2025
-
[15]
Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,
H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi, “Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,” in2026 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2026
2026
-
[16]
Hommi: Learning whole-body mobile manipulation from human demonstrations,
X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, J. Bohg, and S. Song, “Hommi: Learning whole-body mobile manipulation from human demonstrations,” inProceedings of Robotics: Science and Systems (RSS), 2026
2026
-
[17]
Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,
R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen,et al., “Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,” arXiv preprint arXiv:2602.06643, 2026
arXiv 2026
-
[18]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023
2023
-
[19]
Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation,
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation,” in Proceedings of The 8th Conference on Robot Learning, ser. Proceed- ings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 4066–4083
2025
-
[20]
Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,
M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen, “Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10106
Pith/arXiv arXiv 2026
-
[21]
Xrobotoolkit: A cross-platform framework for robot teleoperation,
Z. Zhao, L. Yu, K. Jing, and N. Yang, “Xrobotoolkit: A cross-platform framework for robot teleoperation,” in2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026, pp. 15–20
2026
-
[22]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[23]
On the continuity of rotation representations in neural networks,
Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753
2019
-
[24]
mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,
K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel, “mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,” 2026
2026
-
[25]
Retargeting matters: General motion retargeting for humanoid motion tracking,
J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting matters: General motion retargeting for humanoid motion tracking,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.