Recognition: 2 theorem links
· Lean TheoremHumanNet: Scaling Human-centric Video Learning to One Million Hours
Pith reviewed 2026-05-11 00:45 UTC · model grok-4.3
The pith
A one-million-hour human video dataset lets vision-language models learn physical interactions better than training on real robot data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HumanNet is a one-million-hour human-centric video corpus spanning first- and third-person views, fine-grained activities, human-object interactions, tool use, and long-horizon behaviors, accompanied by interaction-centric annotations including captions, motion descriptions, and hand-body signals. The authors treat human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment as first-class design principles that convert unstructured internet video into a substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. In a controlled vision-language-action ablation, continued training from the Qwen VLM using 1000
What carries the argument
HumanNet's systematic data curation paradigm that applies human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment to internet video.
If this is right
- Embodied models can be scaled using abundant human video rather than limited robot recordings.
- Human-to-robot transfer becomes feasible at larger data volumes without proportional hardware costs.
- Representation learning, activity understanding, and motion generation all benefit from the same interaction-centric annotations.
- Unstructured internet video can be systematically turned into training data once the four curation principles are applied.
- Vision-language-action training benefits from viewpoint diversity across first- and third-person footage.
Where Pith is reading between the lines
- Data-collection budgets in robotics could shift from building robot fleets to mining and annotating existing human video archives.
- The same curation principles might be tested on other modalities such as audio or force signals to further reduce reliance on physical hardware.
- Downstream robotic manipulation benchmarks could be used to measure whether the observed VLM gains translate to actual policy improvement.
- If the performance edge holds at even larger scales, training runs that currently require robot time could instead run on cloud video corpora.
Load-bearing premise
The 1000-hour egocentric human subset and the 100-hour robot subset are comparable in task distribution, interaction complexity, and annotation quality so that performance differences can be attributed to the data source rather than other factors.
What would settle it
Re-running the exact continued-training experiment after matching the human and robot subsets for task distribution, interaction complexity, and annotation quality; if the human-video advantage disappears, the claim that egocentric human data is a scalable substitute would be falsified.
read the original abstract
Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HumanNet, a 1-million-hour human-centric video dataset spanning first- and third-person views with annotations for captions, motion descriptions, and hand/body signals. It outlines a systematic curation paradigm treating human-centric filtering, temporal structuring, and viewpoint diversity as core principles for embodied representation learning. The central empirical validation shows that continued training of the Qwen VLM on 1000 hours of egocentric HumanNet video outperforms the identical procedure on 100 hours of real-robot data from Magic Cobot under a fixed validation set, suggesting human video as a scalable substitute for robot data.
Significance. If the ablation comparison holds after addressing volume and distribution controls, the result would indicate that large-scale human-centric video can serve as a cost-effective proxy for scarce robot interaction data, lowering barriers to training embodied vision-language-action models. The dataset scale and curation framework represent a substantial infrastructure contribution to the field.
major comments (1)
- [Abstract] Abstract: The claim that continued training from Qwen VLM with 1000 hours of HumanNet egocentric video surpasses the same procedure with 100 hours of Magic Cobot robot data is load-bearing for the substitution conclusion, yet the tenfold volume mismatch is unaddressed. No equal-volume HumanNet baseline (e.g., 100 hours) or explicit matching of task distributions, interaction complexity, or annotation quality between subsets is reported, so performance differences cannot be causally attributed to data source rather than quantity.
minor comments (2)
- [Abstract] The abstract would be strengthened by specifying the exact metrics, validation set composition (e.g., robot-specific vs. generic actions), and any statistical significance tests supporting the performance comparison.
- Quantitative statistics on corpus diversity (number of environments, activity categories, or viewpoint balance) would better substantiate the claims of broad coverage and systematic curation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The point raised about the volume mismatch in the ablation study is well-taken and directly impacts the strength of our substitution claim. We address it below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that continued training from Qwen VLM with 1000 hours of HumanNet egocentric video surpasses the same procedure with 100 hours of Magic Cobot robot data is load-bearing for the substitution conclusion, yet the tenfold volume mismatch is unaddressed. No equal-volume HumanNet baseline (e.g., 100 hours) or explicit matching of task distributions, interaction complexity, or annotation quality between subsets is reported, so performance differences cannot be causally attributed to data source rather than quantity.
Authors: We agree that the tenfold volume difference (1000 hours HumanNet vs. 100 hours Magic Cobot) prevents strong causal attribution to data source alone and that the current comparison is insufficient for the substitution conclusion. In the revised manuscript we will add a controlled equal-volume baseline using 100 hours of egocentric HumanNet video under the same training protocol and validation set. We will also add a section comparing task distributions, interaction complexity, and annotation characteristics between the HumanNet subset and Magic Cobot data to clarify what is and is not matched. These additions will be placed in the experiments section and referenced from the abstract. revision: yes
Circularity Check
No circularity: empirical comparison stands on direct experimental outcome
full rationale
The paper presents HumanNet as a dataset and validates its utility via a controlled ablation experiment comparing continued training from Qwen VLM on 1000 hours of egocentric HumanNet video versus 100 hours of Magic Cobot robot data, under fixed validation. No derivations, equations, or first-principles results are claimed. No parameters are fitted and then renamed as predictions. No self-citations are invoked to establish uniqueness theorems or load-bearing premises. The reported performance delta is a direct empirical measurement rather than a quantity that reduces to its own inputs by construction. Potential confounding from unequal data volumes affects causal interpretation but does not create circularity in any derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-centric video captures transferable interaction patterns usable for robot learning
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-VL technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report, 2025
2025
-
[2]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015
2015
-
[4]
Dexycb: A benchmark for capturing hand grasping of objects
Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9044–9053, 2021
2021
-
[5]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024
2024
-
[6]
Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha...
work page internal anchor Pith review arXiv 2025
-
[7]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022
2022
-
[8]
DeepSeek-V3 technical report, 2024
DeepSeek-AI et al. DeepSeek-V3 technical report, 2024
2024
-
[9]
Rethinking video generation model for the embodied world,
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
-
[10]
Rh20t: A robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023. URLhttps://arxiv.org/ abs/2307.00595
-
[11]
Gemma 3 technical report, 2025
Gemma Team. Gemma 3 technical report, 2025
2025
-
[12]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017. URLhttps://arxiv.org/...
-
[13]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...
-
[14]
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...
-
[15]
Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018. URLhttps://arxiv.org/abs/1705. 08421. 12
2018
-
[16]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,
Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709
-
[17]
Openego: A large-scale multimodal egocentric dataset for dexterous manipulation,
Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation,
- [18]
-
[19]
Egomimic: Scaling imitation learning via egocentric video, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221
-
[20]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. URLhttps://arxiv.org/abs/1705.06950
work page internal anchor Pith review arXiv 2017
-
[21]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
-
[22]
URLhttps://arxiv.org/abs/2403.12945
work page internal anchor Pith review arXiv
-
[23]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction, 2024
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction, 2024. URL https: //arxiv.org/abs/2203.01577
-
[24]
Being-h0: vision-language-action pretraining from large-scale human videos,
Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597
-
[25]
Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization
Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993
-
[26]
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-h0.7: A latent world-action model from egocentric videos, 2026. URLhttps://arxiv.org/ abs/2605.00078
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URLhttps://arxiv.org/abs/2308.09126
-
[28]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxu...
work page internal anchor Pith review arXiv 2025
-
[29]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 13 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327
-
[30]
arXiv preprint arXiv:2203.12601 (2022)
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601
-
[31]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y. Zhu, Patcharapong Aphiwetsa, Baoyu Li, Aniketh Cheluva, Pranav Kuppili, Yangcen Liu, Dhruv Patel, Aidan Gao, Hye-Young Chung, Ryan Co, Renee Zbizika, Jeff Liu, Xiaomeng Xu, Haoyu Xiong, Geng Chen, Sebastiano Oliani, Chen...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Assembly101: A large-scale multi-view video dataset for understanding procedural activities
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022
2022
-
[34]
Finegym: A hierarchical video dataset for fine-grained action understanding, 2020
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding, 2020. URLhttps://arxiv.org/abs/2004.06704
-
[35]
Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding, 2016. URLhttps://arxiv.org/abs/1604. 01753
2016
-
[36]
arXiv preprint arXiv:2601.18692 (2026)
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
-
[37]
Human2robot: Learning robot actions from paired human-robot videos, 2025
Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, and Yu-Gang Jiang. Human2robot: Learning robot actions from paired human-robot videos, 2025. URL https://arxiv.org/abs/2502.16587
-
[38]
Qwen3 technical report, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al. Qwen3 technical report, 2025
2025
-
[39]
Hacs: Human action clips and segments dataset for recognition and temporal localization, 2019
Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization, 2019. URLhttps://arxiv.org/abs/1712.09374
-
[40]
EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710
-
[41]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 14
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.