HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Pith reviewed 2026-06-26 17:50 UTC · model grok-4.3
The pith
Egocentric human video outperforms teleoperated robot trajectories for embodied model pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment.
What carries the argument
The filtering and labeling pipeline that converts raw egocentric human video into pretraining data aligned for embodied action prediction.
If this is right
- Egocentric pretraining learns more diverse world representations than robot-trajectory pretraining.
- A small amount of labeled real-robot data suffices afterward to align the action space.
- The paradigm reduces dependence on high-cost, low-diversity robot data collection.
- The study supplies guidance for assessing data quality before committing to robot data gathering.
Where Pith is reading between the lines
- The out-of-distribution gains imply that human behavioral variety transfers better to novel robot environments than robot trajectories do.
- If the pipeline proves reusable, the same human-video source could support pretraining across multiple robot embodiments without new collection.
- Scaling the human-video volume further while keeping robot adaptation data fixed could widen the observed gap.
Load-bearing premise
The filtering and labeling pipeline applied to egocentric human video is neutral with respect to the downstream evaluation tasks and does not confer an unfair advantage relative to the raw teleoperated robot trajectories.
What would settle it
Running the identical downstream evaluation on models pretrained with the same human videos but without the described filtering and labeling pipeline, and finding no performance advantage over the robot-data baseline, would falsify the central claim.
read the original abstract
Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that, under fixed post-training and validation protocols, embodied foundation models pretrained on the same volume of egocentric human video (after a filtering and labeling pipeline) achieve a 24% lower validation loss on real-robot action prediction and 52.5%/90% higher success rates on in-distribution and out-of-distribution real-robot tasks than models pretrained on raw teleoperated robot trajectories. It concludes that human video is not merely a substitute but can be superior for learning diverse world representations before action-space adaptation with limited robot data.
Significance. If the reported gains are shown to arise from the data distribution rather than pipeline-induced selection effects, the result would be significant for embodied AI scaling: it would support a cheaper, higher-diversity pretraining paradigm that reduces reliance on costly robot teleoperation while still enabling strong downstream robot performance. The fixed-protocol design and quantitative comparisons are strengths that make the finding falsifiable and reproducible in principle.
major comments (3)
- [Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.
- [Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.
- [Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.
minor comments (2)
- Notation for action spaces and loss functions should be defined explicitly on first use to aid readers comparing the two data regimes.
- Figure captions for success-rate plots should state the number of evaluation episodes and whether error bars represent standard deviation or standard error.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive suggestions. The comments correctly identify areas where additional transparency and statistical rigor would strengthen the manuscript. We respond to each point below and will incorporate revisions to address the concerns about reporting and methodological clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.
Authors: We agree that variance estimates are important for assessing reliability. In the revised manuscript we will report means and standard deviations across multiple random seeds for both validation loss and success rates, and include statistical significance tests (e.g., paired t-tests) comparing the two pretraining conditions. revision: yes
-
Referee: [Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.
Authors: The methods section already enumerates the pipeline criteria, but the abstract is concise. We will expand the abstract to list the main filtering steps and add an explicit clarification that the pipeline is applied exclusively to human video because it requires action inference from visual observations; robot trajectories already contain direct action labels, so the same steps are neither applicable nor necessary. revision: yes
-
Referee: [Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.
Authors: We will add a dedicated paragraph in the discussion section explaining why symmetric application of the full pipeline is not meaningful (robot data already supplies precise actions). To further isolate effects we will include, where data permits, a quality-filtering ablation on the robot trajectories and report whether performance changes materially. revision: partial
Circularity Check
No circularity; empirical comparison is self-contained
full rationale
The paper reports an empirical head-to-head comparison of pretraining data sources under fixed post-training and validation protocols. No equations, fitted parameters, or derivations are present that would reduce the reported loss reductions or success-rate gains to the input data by construction. The filtering/labeling pipeline is described as part of the human-video processing step, but the central claim is an observed performance difference rather than a mathematical identity or self-referential prediction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. This is the expected non-finding for a data-comparison study whose results remain externally falsifiable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025
Pith/arXiv arXiv 2025
-
[2]
AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025
Pith/arXiv arXiv 2025
-
[3]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[4]
arXiv preprint arXiv:2410.24164, 2024
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[5]
Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[6]
Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers
Build AI. Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers. https:// huggingface.co/datasets/builddotai/Egocentric-100K, 2026
2026
-
[7]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
Pith/arXiv arXiv 2024
-
[8]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022
2022
-
[9]
Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026
Pith/arXiv arXiv 2026
-
[10]
Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
arXiv 2026
-
[11]
Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026
Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026
Pith/arXiv arXiv 2026
-
[12]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[13]
Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024. URLhttps://arxiv.org/abs/2311.18259
arXiv 2024
-
[14]
Yoon, Mouli Sivapurapu, and Jian Zhang
Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709
Pith/arXiv arXiv 2026
-
[15]
Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025
arXiv 2025
-
[16]
Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, et al. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
arXiv 2025
-
[17]
Egomimic: Scaling imitation learning via egocentric video, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221
arXiv 2024
-
[18]
Droid: A large-scale in-the-wild robot manipulation dataset, 2025
Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2025. URLhttps: //arxiv.org/abs/2403.12945
Pith/arXiv arXiv 2025
-
[19]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[20]
Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10
Pith/arXiv arXiv 2026
-
[21]
Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025
Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025
arXiv 2025
-
[22]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
Pith/arXiv arXiv 2024
-
[23]
Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
Pith/arXiv arXiv 2024
-
[24]
Being-h0: Vision-language-action pretraining from large-scale human videos, 2025
Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597
arXiv 2025
-
[25]
Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026
Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993
arXiv 2026
-
[26]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327
arXiv 2019
-
[27]
R3m: A universal visual representation for robot manipulation, 2022
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601
Pith/arXiv arXiv 2022
-
[28]
GR00T N1: An open foundation model for generalist humanoid robots, 2025
NVIDIA et al. GR00T N1: An open foundation model for generalist humanoid robots, 2025. URLhttps: //arxiv.org/abs/2503.14734
Pith/arXiv arXiv 2025
-
[29]
Open X-Embodiment: Robotic learning datasets and RT-X models,
Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models,
-
[30]
URLhttps://arxiv.org/abs/2310.08864
-
[31]
Egoverse: An egocentric human dataset for robot learning from around the world, 2026
Ryan Punamiya et al. Egoverse: An egocentric human dataset for robot learning from around the world, 2026. URLhttps://arxiv.org/abs/2604.07607
Pith/arXiv arXiv 2026
-
[32]
Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026
Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset
2026
-
[33]
Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
Pith/arXiv arXiv 2025
-
[34]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
Pith/arXiv arXiv 2025
-
[35]
Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025
Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, and Yiannis Aloimonos. Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025
2025
-
[36]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[37]
World action models are zero-shot policies, 2026
Seonghyeon Ye et al. World action models are zero-shot policies, 2026. URLhttps://arxiv.org/abs/2602.15922
Pith/arXiv arXiv 2026
-
[38]
Fast-wam: Do world action models need test-time future imagination?, 2026
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?, 2026. URLhttps://arxiv.org/abs/2603.16666
Pith/arXiv arXiv 2026
-
[39]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023
2023
-
[40]
Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710
arXiv 2026
-
[41]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 11
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.