pith. sign in

arxiv: 2606.17200 · v1 · pith:SGQTYOWKnew · submitted 2026-06-15 · 💻 cs.RO

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Pith reviewed 2026-06-27 03:21 UTC · model grok-4.3

classification 💻 cs.RO
keywords VLA pretrainingegocentric human videospseudo-action trajectoriesreliability-aware weightingunified action representationrobotic manipulationhuman-robot data unificationbimanual manipulation
0
0 comments X

The pith

A unified VLA framework turns egocentric human videos into pseudo robot actions and uses reliability-aware weighting to improve pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that abundant egocentric human videos can supply useful training signals for vision-language-action models once converted into robot-compatible pseudo trajectories. It does this by building a scalable video-to-action pipeline and aligning the data through camera-space actions, morphology conditioning, and chunked timing. A reliability-aware objective then down-weights noisy human labels while adding an auxiliary loss on the human data. If successful, this approach lets models scale pretraining with cheap human footage rather than only costly robot demonstrations, producing better results on manipulation benchmarks after both joint pretraining and fine-tuning.

Core claim

ACE-EGO-0 establishes that joint pretraining on 4.53K hours of robot and simulation data plus 1.48K hours of pseudo-action-labeled egocentric human data, achieved through a unified camera-space action representation with morphology conditioning and time-aligned chunking together with a reliability-aware training objective and human auxiliary loss, consistently improves both unified joint pretraining and supervised fine-tuning and reaches state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0 while transferring to real-world bimanual manipulation.

What carries the argument

The reliability-aware training objective with human auxiliary loss that concentrates supervision on reliable pseudo-action signals from human videos after conversion via the egocentric video-to-action pipeline.

If this is right

  • Joint pretraining that adds the weighted human data improves performance over robot-only baselines.
  • The same weighted human signals also improve results after supervised fine-tuning.
  • The resulting model reaches state-of-the-art scores on RoboCasa GR1 TableTop.
  • The resulting model reaches state-of-the-art scores on RoboTwin 2.0.
  • The pretrained model transfers effectively to real-world bimanual manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method points toward training pipelines that can draw on far larger volumes of everyday human video instead of depending primarily on robot-collected trajectories.
  • Similar conversion and weighting steps could be tested on other sources of human movement data such as third-person videos or motion-capture archives.
  • If the reliability weighting proves robust, it may allow incremental addition of new noisy data sources without retraining the entire model from scratch.

Load-bearing premise

The pseudo-action trajectories extracted from egocentric human videos remain useful and comparable to real robot demonstrations once placed in the unified camera-space representation with morphology conditioning.

What would settle it

An ablation that trains the same VLA architecture on the 4.53K hours of robot data alone and measures no improvement or a drop on RoboCasa GR1 TableTop and RoboTwin 2.0 benchmarks compared with the version that adds the 1.48K hours of weighted human pseudo-actions.

read the original abstract

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ACE-EGO-0, a unified VLA pretraining framework for jointly leveraging robot demonstrations and egocentric human videos. It describes a scalable egocentric video-to-action pipeline that converts human videos into robot-format pseudo-action trajectories, a unified action representation based on camera-space actions, morphology conditioning, and time-aligned chunking, and a reliability-aware training objective with a human auxiliary loss to focus on reliable signals. The model is instantiated on 4.53K hours of robot/simulation data plus 1.48K hours of pseudo-labeled human data, with claims that human supervision under this weighting consistently improves joint pretraining and fine-tuning, achieving SOTA on RoboCasa GR1 TableTop and RoboTwin 2.0 while showing real-world bimanual transfer.

Significance. If the empirical results hold and the pseudo-action signals are validated as useful, the work could meaningfully advance scalable VLA pretraining by demonstrating how to incorporate abundant human video data alongside scarce robot trajectories. The reliability-aware objective and unified representation address practical challenges in heterogeneous embodiment and noisy supervision, with potential to reduce dependence on expensive robot data collection.

major comments (3)
  1. [Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.
  2. [Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.
  3. [Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.
minor comments (2)
  1. [Unified action representation] The exact definitions of camera-space actions and the morphology conditioning mechanism would benefit from explicit equations or pseudocode for reproducibility.
  2. [Data] Data sources and collection details for the 4.53K hours of robot/simulation data and 1.48K hours of human data should be expanded to support replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results, validation of the pipeline, and ablations for the training objective.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.

    Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised manuscript, we will add key metrics (e.g., success rates with error bars), baseline comparisons, and references to ablations to better substantiate the claims of consistent improvements from human data and SOTA performance. revision: yes

  2. Referee: [Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.

    Authors: The manuscript validates the pipeline through downstream task improvements when including the pseudo-labeled human data. However, we acknowledge the value of direct fidelity checks. We will add quantitative evaluations, including execution success rates on held-out converted trajectories and correlation analyses with available ground-truth where possible, in a new subsection of the revised version. revision: yes

  3. Referee: [Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.

    Authors: We will expand the experiments section with targeted ablations that isolate the reliability-aware objective and human auxiliary loss. These will include comparisons to joint training without the weighting mechanism and to data-volume-matched baselines, to clarify the contribution of selective amplification of reliable signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework relies on external data and benchmarks.

full rationale

The paper presents an empirical VLA pretraining approach that converts egocentric videos to pseudo-actions, applies a unified camera-space representation with morphology conditioning, and uses a reliability-aware loss to weight human supervision. No equations, derivations, or 'predictions' are described that reduce by construction to fitted inputs or self-citations. The claimed gains on RoboCasa and RoboTwin benchmarks are evaluated against external test sets rather than internal redefinitions, so the chain is self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5845 in / 1105 out tokens · 41568 ms · 2026-06-27T03:21:32.612848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

  2. [2]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  3. [3]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A Vision-Language-Action Flow Model for General Robot Control. InProceedings ...

  4. [4]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

  5. [5]

    ONeill, A

    A. ONeill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  6. [6]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  7. [7]

    L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-3952

  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

  9. [9]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  10. [10]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation, pages 13226–13233. IEEE, 2025

  11. [11]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440,

  12. [12]

    doi:10.48550/arXiv.2507.12440

  13. [13]

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025

  14. [14]

    Kuckreja, M

    G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9826–9836. IEEE, June 2024. doi:10.1109/cvpr52733.2024.00938

  15. [15]

    R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 15

  16. [16]

    ACM Trans

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transac- tions on Graphics, 36(6):1–17, November 2017. doi:10.1145/3130800.3130883

  17. [17]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  18. [18]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  19. [19]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  20. [20]

    Zheng, J

    J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22508–22519, 2025

  21. [21]

    S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Repre- sentations, 2025

  22. [22]

    S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

  23. [23]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.011

  24. [24]

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3D-VLA: A 3D vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

  25. [25]

    Zheng, Y

    R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, 2025

  26. [26]

    Huang, Y

    K. Grauman et al. Ego4d: Around the world in 3,000 hours of egocentric video. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990. IEEE, June 2022. doi:10.1109/cvpr52688.2022.01842

  27. [27]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022. doi:10.1007/s11263-021-01531-2

  28. [28]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F.-J. Chu, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages...

  29. [29]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. EgoDex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. doi:10.48550/arXiv.2505.11709

  30. [30]

    Zheng, D

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. EgoScale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710,

  31. [31]

    doi:10.48550/arXiv.2602.16710

  32. [32]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual representation for robot manipulation. In K. Liu, D. Kulic, and J. Ichnowski, editors,Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 892–909. PMLR, 14–18 Dec 2023. URLhttps://proceedings.mlr.press/v205/ nair23a.html

  33. [33]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=YJ7o2wetJ2

  34. [34]

    Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th 16 International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 23301...

  35. [35]

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022. URLhttps://arxiv.org/abs/2203.06173

  36. [36]

    Majumdar, K

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://p...

  37. [37]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics. InRobotics: Science and Systems, 2023. URLhttps://arxiv.org/abs/2302.12766

  38. [38]

    K. Q. Lin, J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R.-C. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou. Egocentric video-language pretraining. InAdvances in Neural Infor- mation Processing Systems, volume 35, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/hash/ 31fb284a0aaaad837d2930a610c...

  39. [39]

    Y . Zhao, I. Misra, P. Krähenbühl, and R. Girdhar. Learning video representations from large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, June 2023

  40. [40]

    J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. OKAMI: Teaching humanoid robots manipulation skills through single video imitation. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 299–317. PMLR, 2025

  41. [41]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025

  42. [42]

    L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. EMMA: scaling mobile manipulation via egocentric human data.IEEE Robotics Autom. Lett., 11(3):3087–3094, 2026. doi:10.1109/LRA.2026.3653320. URL https://doi.org/10.1109/LRA.2026.3653320

  43. [43]

    G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2R: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025. doi:10.48550/arXiv.2505.11920

  44. [44]

    V . Liu, A. Adeniji, D. Zhan, S. Haldar, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses. arXiv preprint arXiv:2505.20290, 2025. doi:10.48550/arXiv.2505.20290

  45. [45]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation, pages 16939–16947. IEEE,

  46. [46]

    doi:10.1109/ICRA55743.2025.11128283

  47. [47]

    Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026

  48. [48]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

  49. [49]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, November 2021. doi:10.1109/tpami.2020.2991965

  50. [50]

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. He, and H. Dong. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  51. [51]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

    Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

  52. [52]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  53. [53]

    Z. Yu, S. Zafeiriou, and T. Birdal. Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27716–27726, 2025

  54. [54]

    Huang, Q

    J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 17

  55. [55]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  56. [56]

    Zheng, J

    R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan. Flare: Robot learning with implicit world modeling. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volu...

  57. [57]

    Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  58. [58]

    Zhang, Z

    T. Zhang, Z. Yuan, D. Chi, P. Liu, D. Li, K. Hu, L. Zhang, J. Nie, Z. Wei, Z. Chen, Y . Tang, J. Li, Z. Xiang, M. Li, T. Luo, H. Wan, A. Li, L. Zhai, Z. Zhan, X. Bai, J. Cai, P. Cao, K. Chen, S. Chen, Y . Dai, S. Di, Y . Gong, C. Gui, Y . Guo, P. Hao, Q. He, H. Huang, K. Huang, Z. Huang, S. Jin, Y . Jin, A. Li, D. Li, J. Li, R. Li, Y . Li, Y . Li, J. Lian...

  59. [59]

    H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025. URLhttps://arxiv.org/abs/2512.13030

  60. [60]

    W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

  61. [61]

    Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026

    Tencent Robotics and Tencent Hy Team. Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026. 18 A Additional Method Details Human Video RoboTwin Galaxea World Agibot World Figure 7Camera-space action visualization across real robot demonstrations, simulation rollouts, and human e...