pith. sign in

arxiv: 2606.22174 · v1 · pith:TDTYHG4Dnew · submitted 2026-06-20 · 💻 cs.RO

OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

Pith reviewed 2026-06-26 11:36 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid loco-manipulationwhole-body controlvision-language-actionteleoperation interfaceheterogeneous co-trainingempirical roadmappolicy generalization
0
0 comments X

The pith

A phased empirical roadmap yields a whole-body humanoid VLA that outperforms prior models while using less than half the demonstration data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine the minimal set of design choices needed to train a vision-language-action model that directly maps pixels and language to every degree of freedom on a humanoid robot. It organizes the inquiry as three sequential phases of controlled experiments: selecting a teleoperation interface that exposes the full kinematic chain, pretraining the model on data from static and wheeled platforms, and co-training on additional humanoid demonstrations collected with a simpler interface. If these choices prove sufficient, the resulting policy can handle long-horizon loco-manipulation tasks that require coordinated upper- and lower-body motion across a wide vertical range, without collecting new whole-body demonstrations for every new object or instruction. A reader would care because the approach replaces the common practice of decoupling upper and lower bodies with a single native controller trained on far less target-specific data.

Core claim

The authors show that a joint-based whole-body teleoperation interface, combined with pretraining on static and wheeled dual-arm data and co-training on humanoid-specific demonstrations, produces a policy that coordinates the entire kinematic chain. In a challenging long-horizon task, this policy exceeds the performance of two existing humanoid VLA baselines while requiring less than half the total demonstration time and without any additional whole-body teleoperation on the evaluation objects and instructions.

What carries the argument

The three-phase empirical roadmap of one-variable-at-a-time experiments that isolates the contribution of teleoperation interface, pretraining sources, and heterogeneous co-training to whole-body policy performance.

If this is right

  • Joint-based teleoperation that exposes all degrees of freedom produces higher-quality demonstrations than interfaces that hide part of the kinematic chain.
  • A model pretrained on static and wheeled platforms transfers directly to a humanoid's full action space without architecture changes.
  • Co-training with humanoid data collected via a simpler interface extends the policy to new objects and language instructions without new whole-body demonstrations.
  • The resulting policy achieves higher success rates than prior humanoid VLAs on long-horizon tasks that require coordinated motion across a wide vertical range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phased testing sequence could be applied to test whether other robot morphologies benefit from mixed-platform pretraining.
  • If the co-training benefit holds under stricter distribution-shift controls, it would lower the data-collection barrier for deploying humanoids in new environments.
  • The roadmap's emphasis on isolating one variable at a time offers a template for future empirical studies of whole-body control that avoid confounding multiple design changes.

Load-bearing premise

That co-training on data collected from static, wheeled, and humanoid platforms will extend policy performance to new objects and instructions without any further whole-body teleoperation on those targets.

What would settle it

A side-by-side comparison in which the co-trained model shows no improvement, or a decline, in success rate on new objects and instructions relative to a model trained only on the original humanoid demonstrations when total data volume and task difficulty are held constant.

Figures

Figures reproduced from arXiv: 2606.22174 by Boyuan Zheng, Haodong Zhu, Junming Zhao, Ruiqian Nai, Tong Zhang, Yang Gao, Yihang Hu, Yingdong Hu, Zunhao Chen.

Figure 1
Figure 1. Figure 1: Overview of OpenHLM. A roadmap of controlled experiments in three phases: (I) we compare teleop interfaces for a low-level whole-body controller and adopt a joint-based interface; (II) we adapt a manipulation VLA to the humanoid’s full action space along several design axes; (III) we extend the policy to new objects and instructions by co-training the full loco-manipulation data with stationary teleop or H… view at source ↗
Figure 2
Figure 2. Figure 2: The HLM-12 Benchmark. Tasks fall into four capability families targeting different aspects of whole-body behavior, with one representative per family shown. Full task specifications are in Appendix A. 2 Design Goals and Task Suite Before launching into the roadmap of §3, we first lay out the design goals the system aims to meet, introduce the HLM-12 benchmark, and describe the evaluation protocol. Design g… view at source ↗
Figure 3
Figure 3. Figure 3: Joint-based vs. SMPL-based whole-body teleop. Joint-space training data reaches 88% average task progress against 75% for SMPL. Two failure modes account for most of the gap. On Bottle Disposal, the SMPL-trained policy lifts the heel without sufficiently lift￾ing the toes, leaving inadequate clear￾ance to depress the pedal. On Cola Placement, it occasionally walks too close to the table and knocks the can … view at source ↗
Figure 4
Figure 4. Figure 4: Future-frame preview latency sweep. Future-frame preview latency: 0.2 s bal￾ances locomotion and manipulation. A whole-body controller trained via motion track￾ing exposes a tunable preview latency ∆t, con￾trolling how far into the future it sees the refer￾ence motion. Longer preview yields smoother motion but adds delay between the operator’s command and its enactment. We sweep ∆t ∈ {0, 0.2, 0.4, 0.6} s o… view at source ↗
Figure 5
Figure 5. Figure 5: VLA design ablations on the 4-task subset. Amber: interface ablations (one choice flipped per bar); drops are minor and no single choice is the bottleneck. Rose: pretraining ablations; robot pretraining (π0.5) dominates, with PaliGemma and from-scratch collapsing sharply. Sage: one-step action generation; both underperform the 10-step baseline by ∼20 points despite lower validation action MSE. humanoid-spe… view at source ↗
Figure 6
Figure 6. Figure 6: Whole-body teleop data scaling. At this point we have a humanoid-adapted VLA: a π0.5-initialized backbone with weight￾surgery action projection, the pretrained biman￾ual ordering, absolute joint targets, propriocep￾tion as input, and multi-step flow matching in￾ference. Carrying it to the 8 training tasks (40 demonstrations each), the system reaches 89% average task progress ( [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 7
Figure 7. Figure 7: Heterogeneous co-training results. Per-task progress on the 4 held-out tasks and aggregate aver￾ages over the 8 training tasks (Tasks 1–8 (avg)), the 3 motion-reuse held-out tasks (Tasks 9–11 (avg)), and all 4 held-out tasks (Tasks 9–12 (avg)). Per-task breakdown for all 12 tasks is in Appendix E.1. Stationary co-training delivers both new motions and new semantic understanding. Stationary teleoperation op… view at source ↗
Figure 8
Figure 8. Figure 8: Long-horizon language-conditioned task. Task and data. The hu￾manoid performs a long-horizon language-conditioned task span￾ning its large vertical workspace ( [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Humanoid robot hardware. The Unitree G1 is equipped with wrist-mounted grippers and onboard cameras. UMI grippers with GoPro cameras HTC VIVE Ultimate trackers [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Whole-body teleoperation scene and HMD snapshot, of the same frame. Left: the PICO4U kit provides the HMD, two handheld controllers, and two leg trackers for live teleoperation. Right: an in-headset egocentric camera view streamed to the operator during teleoperation. The fisheye head view is placed in the center, with the 2 wrist views placed on its top-left and top-right corners. C Implementation Detail… view at source ↗
Figure 12
Figure 12. Figure 12: Per-task breakdown of heterogeneous co-training results. Task progress for all 12 tasks and three aggregates (rightmost). Four conditions: 8-task baseline, stationary co-training, HuMI co-training, and 12-task oracle. Co-training does not regress training tasks. On held-out tasks, both methods reach near-oracle on motion-reuse tasks (9–11); stationary succeeds on the new-motion task (12), HuMI does not. O… view at source ↗
Figure 13
Figure 13. Figure 13: examines how HuMI demonstration count affects performance on the three motion-reuse held-out tasks (Tasks 9–11), with the 8-task whole-body teleop set fixed. 5 10 20 40 HuMI Demos per Task 50% 75% 100% Task Progress (%) 42% 67% 76% 84% [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
read the original abstract

Whole-body humanoid loco-manipulation requires coordinating the robot's entire kinematic chain. However, most existing systems typically decouple the upper and lower bodies into separate controllers, limiting such coordination and yielding behaviors similar to those of a wheeled dual-arm platform. In this paper, we ask what it takes to build a whole-body native vision-language-action (VLA) model that maps language and pixels directly to all of the humanoid's degrees of freedom. We conduct a systematic empirical study organized as a roadmap of one-variable-at-a-time experiments across three phases: whole-body teleoperation, VLA model design, and heterogeneous co-training. Our study yields several intriguing findings: a joint-based whole-body teleoperation interface outperforms alternatives that only partially expose the humanoid's degrees of freedom; a VLA pretrained on static and wheeled dual-arm platforms transfers surprisingly well to a humanoid's full action space; and co-training with HuMI, the humanoid analog of UMI, extends the policy to new objects and instructions without additional whole-body teleoperation on those targets. Following this roadmap yields OpenHLM, an open-source recipe for whole-body humanoid loco-manipulation. In a challenging long-horizon task that spans a wide vertical range of the humanoid, OpenHLM outperforms two state-of-the-art humanoid VLA baselines (GR00T N1.6 and $\Psi_0$) using less than half the total demonstration time. Our code, training data, and model checkpoints are available at [https://openhlm-project.github.io/].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper conducts an empirical study organized as a one-variable-at-a-time roadmap across whole-body teleoperation, VLA model design, and heterogeneous co-training phases. It reports that a joint-based teleoperation interface, pretraining on static/wheeled platforms, and co-training with HuMI data together produce OpenHLM, an open-source whole-body VLA that outperforms GR00T N1.6 and Ψ0 on a long-horizon loco-manipulation task spanning wide vertical range while using less than half the total demonstration time. Code, training data, and checkpoints are released.

Significance. If the central efficiency result holds under controlled conditions, the work would supply a practical, reproducible recipe for scaling whole-body humanoid VLA policies by leveraging heterogeneous data sources, thereby lowering the barrier of whole-body teleoperation for new objects and instructions. The explicit release of code, training data, and model checkpoints is a clear strength that supports direct replication and extension by the community.

major comments (1)
  1. [Abstract] Abstract: the central claim that heterogeneous co-training on static, wheeled, and HuMI data extends policy performance to new objects/instructions without any additional whole-body teleoperation demonstrations underpins the reported <half demonstration-time advantage over GR00T N1.6 and Ψ0. No ablation that removes the co-training data while holding humanoid teleoperation fixed, no distribution-shift metrics (e.g., feature-space distance or held-out source-task success gap), and no explicit task-difficulty matching (vertical range, horizon length, object variability) between co-training and evaluation are described. This omission leaves open whether observed gains arise from the co-training mechanism or from model architecture, teleoperation interface, or evaluation conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that heterogeneous co-training on static, wheeled, and HuMI data extends policy performance to new objects/instructions without any additional whole-body teleoperation demonstrations underpins the reported <half demonstration-time advantage over GR00T N1.6 and Ψ0. No ablation that removes the co-training data while holding humanoid teleoperation fixed, no distribution-shift metrics (e.g., feature-space distance or held-out source-task success gap), and no explicit task-difficulty matching (vertical range, horizon length, object variability) between co-training and evaluation are described. This omission leaves open whether observed gains arise from the co-training mechanism or from model architecture, teleoperation interface, or evaluation conditions.

    Authors: We agree that the manuscript would benefit from a more explicit isolation of the heterogeneous co-training effect. Our empirical study is organized as a one-variable-at-a-time roadmap, with the VLA model design phase incorporating pretraining on static and wheeled platforms, followed by the co-training phase with HuMI data. The performance advantage is demonstrated relative to GR00T N1.6 and Ψ0 baselines. However, we did not include an ablation that trains a model using only the humanoid teleoperation demonstrations without the heterogeneous data sources, nor did we report distribution-shift metrics or detailed task-difficulty matching. We will incorporate these analyses in the revised version to more rigorously support the contribution of the co-training mechanism. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivations or fitted predictions; results are direct experimental outcomes.

full rationale

The paper conducts a systematic empirical study organized as one-variable-at-a-time experiments across teleoperation, VLA model design, and heterogeneous co-training phases. All reported findings, including the outperformance of OpenHLM over baselines with less demonstration time, are presented as direct results from data collection and training runs. No equations, parameter fits, or derivations appear that could reduce a claimed prediction to its inputs by construction. Self-citations, if present, are not load-bearing for any central result, and the work is self-contained against external benchmarks without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical robotics paper with no mathematical derivations, free parameters, or axioms in the abstract; HuMI is referenced as an analog dataset but no new physical entities are postulated.

pith-pipeline@v0.9.1-grok · 5834 in / 1270 out tokens · 30032 ms · 2026-06-26T11:36:34.382488+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

  1. [1]

    R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

  2. [2]

    Cheng, Y

    X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

  3. [3]

    M. Liu, Z. Chen, X. Cheng, Y . Ji, R.-Z. Qiu, R. Yang, and X. Wang. Visual whole-body control for legged loco-manipulation.arXiv preprint arXiv:2403.16967, 2024

  4. [4]

    C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang. Mobile- television: Predictive motion priors for humanoid whole-body control. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025

  5. [5]

    Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 12

  6. [6]

    J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion optimiza- tion for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738, 2025

  7. [7]

    Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec

    NVIDIA GEAR Team. Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec. 2025. NVIDIA Research Blog, Accessed: 2026-05-06

  8. [8]

    S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

  9. [9]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

  10. [10]

    Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan

    Figure AI. Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan. 2026. Figure AI Blog, Accessed: 2026-05-06

  11. [11]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  12. [12]

    T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

  13. [13]

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

  14. [14]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  15. [15]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  16. [16]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  17. [17]

    M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

  18. [18]

    Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026

    NVIDIA GEAR Team. Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026. GR00T- WholeBodyControl Documentation. Last updated: 2026-05-07. Accessed: 2026-05-14

  19. [19]

    PICO Immersive Pte. Ltd. PICO 4 Ultra.https://www.picoxr.com/global/products/ pico4-ultra, 2024. Product webpage. Accessed: 2026-05-14

  20. [20]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

  21. [21]

    ACM Trans

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6), Oct. 2015. doi:10.1145/ 2816795.2818013. URLhttps://doi.org/10.1145/2816795.2818013. 13

  22. [22]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  23. [23]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  24. [24]

    Lipman, R

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  25. [25]

    M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

  26. [26]

    VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/

    HTC VIVE. VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/. Accessed: 2026-05-16

  27. [27]

    Cosmos-reason2: Physical ai common sense and embodied reasoning models

    NVIDIA. Cosmos-reason2: Physical ai common sense and embodied reasoning models. https://huggingface.co/nvidia/Cosmos-Reason2-8B, 2025. Accessed: 2026-05-17

  28. [28]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  29. [29]

    Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

  30. [30]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

  31. [31]

    Zhang, B

    T. Zhang, B. Zheng, R. Nai, Y . Hu, Y .-J. Wang, G. Chen, F. Lin, J. Li, C. Hong, K. Sreenath, and Y . Gao. Hub: Learning extreme humanoid balance.arXiv preprint arXiv:2505.07294, 2025

  32. [32]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  33. [33]

    W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

  34. [34]

    Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

  35. [35]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  36. [36]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA That Learns From Experience. arXiv preprint arXiv:2511.14759, 2025

  37. [37]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities.arXiv preprint arXiv:2604.15483, 2026

  38. [38]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  39. [39]

    K. Liu, C. Guan, Z. Jia, Z. Wu, X. Liu, T. Wang, S. Liang, P. Chen, P. Zhang, H. Song, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. arXiv preprint arXiv:2409.19499, 2024. 14

  40. [40]

    F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 54877–54910, 2025

  41. [41]

    Engel, K

    J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023

  42. [42]

    Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026

    Apple Inc. Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026. Accessed: 2026-05-18

  43. [43]

    Meta quest 3.https://www.meta.com/quest/quest-3/, 2026

    Meta Platforms, Inc. Meta quest 3.https://www.meta.com/quest/quest-3/, 2026. Ac- cessed: 2026-05-18

  44. [44]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

  45. [45]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  46. [46]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  47. [47]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  48. [48]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  49. [49]

    S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

  50. [50]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  51. [51]

    C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

  52. [52]

    R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

  53. [53]

    CTAG2F90-D Electric Parallel Gripper

    ChangingTek Robotics Technology (Suzhou) Co., Ltd. CTAG2F90-D Electric Parallel Gripper. https://en.changingtek.com/diandong/147. Accessed: 2026-05-29

  54. [54]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 15 Appendices A The HLM-12 Benchmark 17 A.1 The 12 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Overall Evaluation Protocol . . . . . . . . . . . . . . . . ...

  55. [55]

    Four conditions: 8-task baseline, stationary co-training, HuMI co-training, and 12-task oracle

    Pouring Tasks 1 8 (avg) Tasks 9 11 (avg) Tasks 9 12 (avg) 0 25 50 75 100T ask Progress (%) 100 93 87 87 93 87 80 93 85 100 95 85 87 100100100 92 92 72 88 85 70 95 95 90 75 85 80 80 80 80 67 20 100 80 100 47 93 100100 40 73 73 87 25 80 15 90 89 87 87 87 36 89 84 96 33 87 67 94 8-task baseline Stationary co-training HuMI co-training 12-task oracle Figure 12...