Recognition: unknown
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
Pith reviewed 2026-05-07 15:45 UTC · model grok-4.3
The pith
BifrostUMI uses VR-captured human keypoints and wrist visuals to train policies that predict and retarget motions for humanoid whole-body manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces BifrostUMI, a framework that captures multimodal data from human demonstrations using VR devices, consisting of sparse keypoint trajectories and wrist-mounted visual recordings. These are used to train a high-level policy that predicts future keypoint trajectories conditioned on the visual features. Through a keypoint retargeting pipeline, the trajectories are mapped onto the humanoid robot's morphology and executed via a whole-body controller, enabling the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments.
What carries the argument
The high-level policy network for predicting future keypoint trajectories from wrist visual features, integrated with a robust keypoint retargeting pipeline to the humanoid morphology.
Load-bearing premise
The sparse keypoint trajectories from VR can be retargeted to the humanoid robot while preserving agility and task success, and the policy generalizes from wrist visual features to accurate future keypoint predictions in practice.
What would settle it
Deploying the policy on the humanoid robot in a novel task and checking if the executed motions complete the task successfully at rates similar to or better than those from robot teleoperation data; significantly lower performance would show the transfer does not hold.
Figures
read the original abstract
High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose BifrostUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. BifrostUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across two distinct experimental scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BifrostUMI, a robot-free data collection framework for training humanoid whole-body visuomotor policies. It uses lightweight VR devices to record sparse keypoint trajectories from natural human demonstrations alongside wrist-mounted visual observations. These data train a high-level policy that predicts future keypoints conditioned on visual features; the predicted trajectories are then retargeted onto the humanoid morphology and executed by a whole-body controller. Efficacy is claimed across two experimental scenarios, with the central assertion being seamless transfer of diverse, agile behaviors without teleoperation.
Significance. If the retargeting fidelity and policy generalization claims hold with supporting evidence, the framework would offer a meaningful advance in scalable, accessible data acquisition for humanoid manipulation by removing hardware and efficiency bottlenecks of teleoperation. It extends the UMI paradigm to whole-body humanoid settings and could facilitate broader collection of agile behaviors. The current manuscript, however, supplies no quantitative results, so the practical significance cannot yet be assessed.
major comments (2)
- [Abstract] Abstract: the claim that the pipeline 'enables the seamless transfer of diverse and agile behaviors' and demonstrates 'efficacy and versatility' across two scenarios is unsupported by any reported success rates, keypoint prediction errors, retargeting deviation metrics, or task-completion statistics, rendering the central claim unverifiable.
- [Method / Experiments] Method / Experiments (pipeline description): no architecture details, loss functions, or training procedure are given for the high-level policy network, nor are any quantitative measures of retargeting accuracy (e.g., end-effector or joint-angle deviation) or policy generalization error provided; these omissions directly undermine evaluation of the two load-bearing assumptions identified in the skeptic note.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly named the two experimental scenarios and the specific tasks performed.
- Notation for the keypoint representation and retargeting mapping is introduced without an accompanying diagram or formal definition, which could be added for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript lacks the quantitative results, architectural details, and performance metrics necessary to fully substantiate the claims. We will revise the paper to include these elements, which will allow proper evaluation of the framework.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the pipeline 'enables the seamless transfer of diverse and agile behaviors' and demonstrates 'efficacy and versatility' across two scenarios is unsupported by any reported success rates, keypoint prediction errors, retargeting deviation metrics, or task-completion statistics, rendering the central claim unverifiable.
Authors: We acknowledge that the abstract advances strong claims without accompanying quantitative evidence in the submitted manuscript. In the revision we will update the abstract language to be more precise and will add a concise summary of quantitative results (success rates, keypoint prediction errors, retargeting deviations, and task-completion statistics) drawn from the two experimental scenarios. This will make the central claims verifiable. revision: yes
-
Referee: [Method / Experiments] Method / Experiments (pipeline description): no architecture details, loss functions, or training procedure are given for the high-level policy network, nor are any quantitative measures of retargeting accuracy (e.g., end-effector or joint-angle deviation) or policy generalization error provided; these omissions directly undermine evaluation of the two load-bearing assumptions identified in the skeptic note.
Authors: We agree that the manuscript omits the requested details on the high-level policy. The revised version will specify the network architecture, loss functions (including the objective used for keypoint trajectory prediction), and full training procedure. We will also report quantitative retargeting accuracy (end-effector and joint-angle deviations) and policy generalization error. These additions will directly address the assumptions concerning retargeting fidelity and generalization that the referee highlights. revision: yes
Circularity Check
No circularity: linear pipeline from VR capture to retargeted execution
full rationale
The manuscript presents BifrostUMI as a sequential framework: VR capture of sparse keypoints plus wrist visuals, training a high-level policy to predict future keypoints from visuals, followed by retargeting and whole-body control. No equations, fitted parameters, or self-citations are described that would reduce any claimed prediction or transfer result to its own inputs by construction. The central claims rest on the empirical efficacy of the pipeline rather than tautological redefinitions or load-bearing self-references. This matches the reader's assessment of score 2.0 with no reduction to circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky and K. Pertsch, “Droid: A large-scale in-the- wild robot manipulation dataset,” 2025. [Online]. Available: https://arxiv.org/abs/2403.12945
work page internal anchor Pith review arXiv 2025
-
[2]
Twist2: Scalable, portable, and holistic humanoid data collection system,
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,”arXiv preprint arXiv:2511.02832, 2025
-
[3]
Humanplus: Humanoid shadowing and imitation from humans,
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” 2024. [Online]. Available: https://arxiv.org/abs/2406.10454
-
[4]
Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,
Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang, “Clone: Closed-loop whole-body humanoid teleoperation for long-horizon tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2506.08931
-
[5]
Amo: Adaptive motion optimization for hyper- dexterous humanoid whole-body control,
J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang, “Amo: Adaptive motion optimization for hyper- dexterous humanoid whole-body control,” 2025. [Online]. Available: https://arxiv.org/abs/2505.03738
-
[6]
Mobile-television: Predictive motion priors for humanoid whole-body control,
C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang, “Mobile-television: Predictive motion priors for humanoid whole-body control,” 2025. [Online]. Available: https://arxiv.org/abs/2412.07773
-
[7]
Learning Versatile Humanoid Manipulation with Touch Dreaming
Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, and D. Zhao, “Learning versatile humanoid manipulation with touch dreaming,” 2026. [Online]. Available: https://arxiv.org/abs/2604.13015
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Learning human-to-humanoid real-time whole-body teleoperation,
T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,”
-
[9]
Available: https://arxiv.org/abs/2403.04436
[Online]. Available: https://arxiv.org/abs/2403.04436
-
[10]
Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning,
T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning,” 2024. [Online]. Available: https://arxiv.org/abs/2406.08858
-
[11]
Mimicplay: Long- horizon imitation learning by watching human play,
C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long-horizon imitation learning by watching human play,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12422
-
[12]
Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,
R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen,et al., “Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,” arXiv preprint arXiv:2602.06643, 2026
-
[13]
Hdmi: Learning interactive humanoid whole-body control from human videos,
H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi, “Hdmi: Learning interactive humanoid whole-body control from human videos,” 2025. [Online]. Available: https://arxiv.org/abs/2509.16757
-
[14]
Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[15]
OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
S. Luo, Y . Li, Y . Hu, C. Yu, C. Xu, J. Zhang, G. Yao, T. Huang, R. He, and Z. Wang, “Omniumi: Towards physically grounded robot learning via human-aligned multimodal interaction,”arXiv preprint arXiv:2604.10647, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Data scaling laws in im- itation learning for robotic manipulation
F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao, “Data scaling laws in imitation learning for robotic manipulation,”arXiv preprint arXiv:2410.18647, 2024
-
[17]
In-the-wild compliant manipulation with umi-ft,
H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song, “In-the-wild compliant manipulation with umi-ft,” IEEE,
-
[18]
In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026
[Online]. Available: https://arxiv.org/abs/2601.09988
-
[19]
John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu
Q. Zeng, C. Li, J. S. John, Z. Zhou, J. Wen, G. Feng, Y . Zhu, and Y . Xu, “Activeumi: Robotic manipulation with active perception from robot-free human demonstrations,” 2025. [Online]. Available: https://arxiv.org/abs/2510.01607
-
[20]
H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song, “Umi on legs: Making manipulation policies mobile with manipulation- centric whole-body controllers,” 2024. [Online]. Available: https://arxiv.org/abs/2407.10353
-
[21]
Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,
H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi, “Umi-on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies,” 2026. [Online]. Available: https://arxiv.org/abs/2510.02614
-
[22]
Hommi: Learning whole-body mobile manipulation from human demonstrations, 2026
X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, and S. Song, “Hommi: Learning whole-body mobile manipulation from human demonstrations,” 2026. [Online]. Available: https://arxiv.org/abs/2603.03243
-
[23]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705
work page internal anchor Pith review arXiv 2023
-
[24]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” 2024. [Online]. Available: https://arxiv.org/abs/2401.02117
work page internal anchor Pith review arXiv 2024
-
[25]
arXiv preprint arXiv:2602.10106 (2026) 5
M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen, “Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10106
-
[26]
Xrobotoolkit: A cross-platform framework for robot teleoperation,
Z. Zhao, L. Yu, K. Jing, and N. Yang, “Xrobotoolkit: A cross-platform framework for robot teleoperation,” in2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026, pp. 15–20
2026
-
[27]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[28]
On the continuity of rotation representations in neural networks,
Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753
2019
- [29]
-
[30]
Mink: Python inverse kinematics based on MuJoCo,
K. Zakka, “Mink: Python inverse kinematics based on MuJoCo,” https://github.com/kevinzakka/mink, Feb. 2026, version 1.1.0
2026
-
[31]
Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,
Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,” 2025. [Online]. Available: https://arxiv.org/abs/2508.08241
-
[32]
Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,
T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi, “Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,”
-
[33]
Available: https://arxiv.org/abs/2502.01143
[Online]. Available: https://arxiv.org/abs/2502.01143
-
[34]
Gentlehumanoid: Learning upper-body compliance for contact- rich human and object interaction,
Q. Lu, Y . Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu, “Gentlehumanoid: Learning upper-body compliance for contact- rich human and object interaction,” 2025. [Online]. Available: https://arxiv.org/abs/2511.04679
-
[35]
Thor: Towards human-level whole-body reactions for intense contact-rich environments,
G. Li, Q. Shi, Y . Hu, J. Hu, Z. Wang, X. Wang, and S. Luo, “Thor: Towards human-level whole-body reactions for intense contact-rich environments,” 2025. [Online]. Available: https://arxiv.org/abs/2510.26280
-
[36]
mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,
K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel, “mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,” 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.