pith. sign in

arxiv: 2606.27239 · v1 · pith:5YK7W5VPnew · submitted 2026-06-25 · 💻 cs.RO

HumanoidUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

Pith reviewed 2026-06-26 04:53 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid robotwhole-body manipulationdemonstration collectionVR interfaceimitation learningretargetingrobot-free data
0
0 comments X

The pith

A VR-based system collects whole-body humanoid robot demonstrations without using any robot hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HumanoidUMI as a robot-free method to gather demonstration data for humanoid whole-body manipulation. It relies on lightweight VR devices and UMI-inspired grippers to record sparse human keypoint trajectories, wrist-view images, and gripper actions. A high-level policy is then trained on these recordings to predict future keypoints, which are retargeted into robot-native whole-body references and executed by a controller. Experiments across five real-world scenarios test whether this data supports transferable skill learning on actual humanoid platforms.

Core claim

HumanoidUMI shows that sparse human keypoint trajectories collected via VR devices can be retargeted to robot-native whole-body references after policy prediction, producing demonstrations that train effective whole-body manipulation policies without requiring robot teleoperation during data collection.

What carries the argument

The HumanoidUMI data-collection pipeline, which records human keypoints and actions with VR then uses a high-level policy to predict and retarget them into whole-body robot references.

If this is right

  • Demonstration collection for humanoid robots becomes portable and independent of robot hardware availability.
  • High-level policies trained on the collected data produce retargeted actions that a whole-body controller can execute in real scenarios.
  • Whole-body skills involving coordinated locomotion and manipulation become learnable from human demonstrations rather than robot teleoperation.
  • The same pipeline supports data collection for multiple humanoid platforms through retargeting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower the barrier for researchers without access to expensive humanoid hardware to contribute training data.
  • If retargeting generalizes, similar VR pipelines might apply to collecting data for other robot types beyond humanoids.
  • Faster iteration on whole-body behaviors becomes possible by separating the data-gathering step from robot execution.

Load-bearing premise

Sparse human keypoint trajectories from VR can be retargeted to robot references and executed by a whole-body controller without major performance loss.

What would settle it

A direct comparison in which policies trained on HumanoidUMI data achieve substantially lower success rates than policies trained on equivalent robot-teleoperated data across the same five scenarios.

Figures

Figures reproduced from arXiv: 2606.27239 by Chenhao Yu, Hongwu Wang, Jiachen Zhang, Shaqi Luo, Youhao Hu, Yuanyuan Li.

Figure 1
Figure 1. Figure 1: Overview of HumanoidUMI. HumanoidUMI provides a robot-free data collection and learning framework for humanoid whole-body skills. Human demonstrations are collected with portable VR–UMI devices and represented as sparse task-relevant spatial keypoints. A high-level policy predicts future keypoint motions and gripper actions, a Spatial Keypoint Retargeting module converts them into feasible humanoid whole-b… view at source ↗
Figure 2
Figure 2. Figure 2: HumanoidUMI Data Acquisition System. The data acquisition platform consists of a PICO-based motion capture setup, including two foot￾mounted trackers and one waist-mounted tracker, together with two instrumented grippers, each equipped with a fisheye camera. The system synchronously records multimodal observations, including wrist-view images from the fisheye cameras, human keypoint states obtained via the… view at source ↗
Figure 3
Figure 3. Figure 3: HumanoidUMI Hierarchical Visuomotor Control. HumanoidUMI formulates humanoid visuomotor control as a three-stage hierarchy. A diffusion-based high-level policy predicts task-space keypoint trajectories and gripper commands from wrist-view images and lower-body proprioception. The spatial keypoint retargeting bridge maps these commands into a 36-dimensional robot-native motion representation, including the … view at source ↗
Figure 4
Figure 4. Figure 4: Conditional diffusion policy architecture. Wrist-view RGB images are encoded by DINOv2 and fused with lower-body proprioception and the diffusion step as the global condition. The diffusion model predicts action trajectories for TCPs and body-support keypoints. Action space. Each action aτ ∈ R 47 consists of the 6- DoF poses of five keypoints, represented by a 3-D translation and a continuous 6-D rotation … view at source ↗
Figure 5
Figure 5. Figure 5: Spatial Keypoint Retargeting (SKR). SKR bridges high-level keypoint prediction and low-level whole-body control by converting five task-space keypoints, including the pelvis, two TCPs, and two feet, into robot-native whole-body references. Unlike global motion rescaling, SKR preserves metric spatial relationships among the keypoints and only scales the vertical pelvis-to-foot distance to compensate for hum… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world evaluation and ablation analysis of HumanoidUMI on three humanoid manipulation tasks. (a) Cluttered tabletop pick-and-place: the robot grasps bread and places it onto a target plate. (b) Bimanual vegetable collection: the robot uses the right hand to place an eggplant and the left hand to place a cucumber. (c) Dynamic ball throwing: the robot performs a wind-up, releases the ball, and hits the t… view at source ↗
Figure 7
Figure 7. Figure 7: Real-world evaluation of the augmented seven-keypoint variant of HumanoidUMI on two humanoid whole-body manipulation tasks with a Unitree G1 robot. (a) Under-table waste disposal: the robot grasps a crumpled paper ball from the tabletop, steps backward, bends its knees and torso, reaches toward a waste bin underneath the table, and releases the object into the bin. (b) Walking coffee delivery: the robot wa… view at source ↗
read the original abstract

High-quality demonstration data are essential for humanoid robot skill learning, especially for whole-body behaviors that require coordinated perception, locomotion, and manipulation. Existing data-collection methods largely rely on robot teleoperation, which is constrained by hardware accessibility, operator expertise, and limited efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose HumanoidUMI, a portable and robot-free framework for humanoid whole-body data collection. HumanoidUMI uses lightweight VR devices and UMI-inspired grippers to collect sparse human keypoint trajectories, wrist-view observations, and gripper actions. These demonstrations train a high-level policy to predict future keypoints, which are retargeted to robot-native whole-body references and executed by a whole-body controller. Experiments in five real-world scenarios demonstrate the effectiveness of the proposed framework and validate the collected demonstrations for transferable humanoid whole-body skill learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes HumanoidUMI, a portable VR-based framework for collecting robot-free demonstrations of humanoid whole-body manipulation. Sparse human keypoint trajectories, wrist-view images, and gripper actions are captured with lightweight VR devices and UMI-inspired grippers; a high-level policy is trained to predict future keypoints; these are retargeted to robot-native whole-body references and executed by a whole-body controller. The central claim is that experiments across five real-world scenarios demonstrate the framework's effectiveness and confirm that the collected demonstrations support transferable whole-body skill learning.

Significance. If the retargeting step preserves task performance, the method would provide a scalable, hardware-light alternative to teleoperation for whole-body humanoid data collection, addressing a key bottleneck in learning coordinated locomotion-manipulation behaviors.

major comments (2)
  1. [Abstract (pipeline description) and Experiments section] The central claim that sparse VR keypoints can be retargeted and executed 'without significant performance loss' is load-bearing, yet the manuscript supplies no quantitative validation of retargeting fidelity (e.g., trajectory RMSE, joint-angle error, or success-rate delta relative to teleoperation baselines). Without these metrics it is impossible to determine whether the five scenarios succeed because the retargeting works or because the tasks are tolerant of kinematic mismatch.
  2. [Method (retargeting and control pipeline)] The whole-body controller is described only at the level of 'executed by a whole-body controller'; no formulation (IK objective, balance constraints, kinematic scaling, or contact handling) is given, making it impossible to assess whether the retargeting step is the performance-limiting factor or is masked by controller robustness.
minor comments (1)
  1. [Abstract and Experiments] The abstract states 'Experiments in five real-world scenarios demonstrate...' but provides no task names, success criteria, number of trials, or statistical reporting; these details should appear in the Experiments section with accompanying tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract (pipeline description) and Experiments section] The central claim that sparse VR keypoints can be retargeted and executed 'without significant performance loss' is load-bearing, yet the manuscript supplies no quantitative validation of retargeting fidelity (e.g., trajectory RMSE, joint-angle error, or success-rate delta relative to teleoperation baselines). Without these metrics it is impossible to determine whether the five scenarios succeed because the retargeting works or because the tasks are tolerant of kinematic mismatch.

    Authors: We acknowledge that the manuscript does not provide explicit quantitative metrics on retargeting fidelity such as trajectory RMSE or direct comparisons to teleoperation. Our evaluation centers on end-to-end task success rates in five real-world scenarios to demonstrate practical effectiveness. In the revised manuscript we will add a dedicated subsection in Experiments reporting retargeting accuracy (e.g., keypoint trajectory RMSE and joint-angle deviations post-retargeting) using the collected data. Direct teleoperation baselines are outside the robot-free scope of the work, but we will discuss potential kinematic mismatch effects on the reported success rates. revision: yes

  2. Referee: [Method (retargeting and control pipeline)] The whole-body controller is described only at the level of 'executed by a whole-body controller'; no formulation (IK objective, balance constraints, kinematic scaling, or contact handling) is given, making it impossible to assess whether the retargeting step is the performance-limiting factor or is masked by controller robustness.

    Authors: We agree the controller description is high-level and insufficient for evaluating its contribution. The manuscript prioritizes the data-collection pipeline, but we will revise the Method section to include the full formulation: the IK objective function, balance and CoM constraints, kinematic scaling between human and robot references, and contact-force handling. This will clarify the interface between retargeted references and low-level execution. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline relies on external data collection, retargeting, and experimental validation

full rationale

The paper presents a data-collection pipeline (VR keypoints + grippers), a high-level policy trained on collected trajectories, retargeting to robot references, and a whole-body controller. These steps are described as sequential engineering components validated by five real-world scenarios. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or method outline. The central claim (transferable skill learning from robot-free demos) is supported by external empirical results rather than reducing to its own inputs by construction. This matches the expected non-circular case for a systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper may introduce additional free parameters or assumptions not visible here. The central claim rests on the transferability of human demonstrations to robot execution via retargeting.

axioms (1)
  • domain assumption Human keypoint trajectories collected via VR can be retargeted to robot whole-body references.
    The framework relies on this conversion step to turn human data into executable robot commands.

pith-pipeline@v0.9.1-grok · 5688 in / 1263 out tokens · 34604 ms · 2026-06-26T04:53:31.263562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 3 linked inside Pith

  1. [1]

    Droid: A large-scale in-the-wild robot manipulation dataset,

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis,et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  2. [2]

    Twist2: Scalable, portable, and holistic humanoid data collection system,

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  3. [3]

    Humanplus: Humanoid shadowing and imitation from humans,

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 2828–2844

  4. [4]

    Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,

    Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang, “Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2506.08931

  5. [5]

    Mobile-television: Predictive motion priors for humanoid whole-body control,

    C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang, “Mobile-television: Predictive motion priors for humanoid whole-body control,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025

  6. [6]

    Learning versatile humanoid manipulation with touch dreaming,

    Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, and D. Zhao, “Learning versatile humanoid manipulation with touch dreaming,” 2026. [Online]. Available: https://arxiv.org/abs/2604.13015

  7. [7]

    Learning human-to-humanoid real-time whole-body teleoperation,

    T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8944–8951

  8. [8]

    Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning,

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. M. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal and dexterous human-to- humanoid whole-body teleoperation and learning,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 1516–1540

  9. [9]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

  10. [10]

    Omniumi: Towards physically grounded robot learning via human-aligned multimodal interaction,

    S. Luo, Y . Li, Y . Hu, C. Yu, C. Xu, J. Zhang, G. Yao, T. Huang, R. He, and Z. Wang, “Omniumi: Towards physically grounded robot learning via human-aligned multimodal interaction,”arXiv preprint arXiv:2604.10647, 2026

  11. [11]

    Data scaling laws in imitation learning for robotic manipulation,

    F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao, “Data scaling laws in imitation learning for robotic manipulation,” inInternational Conference on Learning Representations (ICLR), 2025

  12. [12]

    In-the-wild compliant manipulation with umi-ft,

    H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song, “In-the-wild compliant manipulation with umi-ft,” in 2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  13. [13]

    Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,

    Q. Zeng, C. Li, J. St. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu, “Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,” 2025. [Online]. Available: https://arxiv.org/abs/2510.01607

  14. [14]

    Umi-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers,

    H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song, “Umi-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 5254–5270

  15. [15]

    Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,

    H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi, “Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,” in2026 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2026

  16. [16]

    Hommi: Learning whole-body mobile manipulation from human demonstrations,

    X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, J. Bohg, and S. Song, “Hommi: Learning whole-body mobile manipulation from human demonstrations,” inProceedings of Robotics: Science and Systems (RSS), 2026

  17. [17]

    Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,

    R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen,et al., “Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,” arXiv preprint arXiv:2602.06643, 2026

  18. [18]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  19. [19]

    Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation,

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation,” in Proceedings of The 8th Conference on Robot Learning, ser. Proceed- ings of Machine Learning Research, vol. 270. PMLR, 2025, pp. 4066–4083

  20. [20]

    Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

    M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen, “Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10106

  21. [21]

    Xrobotoolkit: A cross-platform framework for robot teleoperation,

    Z. Zhao, L. Yu, K. Jing, and N. Yang, “Xrobotoolkit: A cross-platform framework for robot teleoperation,” in2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026, pp. 15–20

  22. [22]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023

  23. [23]

    On the continuity of rotation representations in neural networks,

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753

  24. [24]

    mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,

    K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel, “mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,” 2026

  25. [25]

    Retargeting matters: General motion retargeting for humanoid motion tracking,

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting matters: General motion retargeting for humanoid motion tracking,” in2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026