Recognition: unknown
GazeVLA: Learning Human Intention for Robotic Manipulation
Pith reviewed 2026-05-08 11:33 UTC · model grok-4.3
The pith
GazeVLA models human intention via gaze to transfer knowledge from human videos to robotic manipulation with limited robot data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents GazeVLA as a framework that treats gaze as a direct proxy for human intention. The model is pretrained on large-scale egocentric human video to capture how gaze precedes and guides physical actions, then finetuned on small robot and human datasets. During inference it follows a Chain-of-Thought process that first outputs an intention prediction and only then generates the action sequence, yielding consistent gains over baselines in simulation, real-world trials, long-horizon tasks, and few-shot settings.
What carries the argument
Gaze as an observable proxy for human intention, used both in pretraining on egocentric human data and in a Chain-of-Thought inference step that predicts intention before action.
If this is right
- Robotic systems can be trained with far smaller numbers of robot demonstrations by first learning from human gaze data.
- Performance improves on both long-horizon tasks and fine-grained manipulation under few-shot conditions.
- The same model achieves state-of-the-art results across simulation and physical robot evaluations.
- Better robustness is observed when test conditions differ from training distributions.
Where Pith is reading between the lines
- Large collections of everyday human videos could become a primary training resource for robot policies without requiring matched robot recordings.
- Adding other observable cues such as hand trajectories might further strengthen the intention signal across embodiment differences.
- The explicit intention step could make robot decisions easier to inspect or correct during deployment.
Load-bearing premise
That gaze patterns reliably encode the intention behind human actions in a form that transfers to robot bodies without losing essential task information.
What would settle it
An experiment in which a version of the model trained without the gaze-intention pretraining stage matches or exceeds GazeVLA performance on the same long-horizon and fine-grained benchmarks would show that gaze is not required to bridge the embodiment gap.
Figures
read the original abstract
Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance. Project page: https://gazevla.github.io .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GazeVLA, a framework for robotic manipulation that models human intention via gaze as an observable proxy. It pretrains on large-scale egocentric human datasets to capture intention-action synergy, finetunes on limited robot and human data, and employs a Chain-of-Thought inference process that first predicts intention before generating actions. The authors claim this bridges the embodiment gap and yields consistent outperformance over baselines with SOTA results across simulation, real-world, long-horizon, few-shot, and robustness benchmarks.
Significance. If the transfer of gaze-based intention holds, the work could meaningfully reduce dependence on large-scale robot demonstrations by leveraging abundant human egocentric data, advancing scalable embodied learning. The project page offers a positive signal for reproducibility.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'consistent outperformance,' 'better generalization,' and 'state-of-the-art performance' across multiple settings, yet supplies no quantitative metrics, baseline names, ablation tables, or error bars. This absence directly undermines verification of the central empirical claim.
- [§3] §3 (Method, Pretraining stage): The load-bearing assumption that gaze learned from human oculomotor data produces an embodiment-agnostic intention representation whose action synergy survives the domain shift to robot kinematics, camera intrinsics, and proprioception is stated but not tested. No alignment loss, domain-adaptation module, or cross-embodiment analysis is described, leaving the transfer mechanism ungrounded.
- [§3.3] §3.3 (Inference, Chain-of-Thought): The sequential intention-then-action prediction is presented as essential, yet no ablation compares it against direct action prediction or alternative intermediate representations, making it impossible to isolate whether CoT contributes to the claimed gains.
minor comments (2)
- [Abstract] The abstract would be strengthened by naming the specific human datasets, robot platforms, and task suites used in the evaluations.
- [§3.1] Notation for gaze representation (e.g., as 2D heatmaps, 3D rays, or tokens) should be defined explicitly in §3.1 to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve the presentation of quantitative results and to provide additional justification and analysis for the methodological choices.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'consistent outperformance,' 'better generalization,' and 'state-of-the-art performance' across multiple settings, yet supplies no quantitative metrics, baseline names, ablation tables, or error bars. This absence directly undermines verification of the central empirical claim.
Authors: We agree that the abstract and §4 would benefit from more explicit quantitative details to support the claims. In the revised manuscript we will update the abstract with key numerical results and expand §4 to include tables that name all baselines, report specific metrics (success rates, generalization scores, etc.), present ablation results, and include error bars or standard deviations computed over multiple runs. revision: yes
-
Referee: [§3] §3 (Method, Pretraining stage): The load-bearing assumption that gaze learned from human oculomotor data produces an embodiment-agnostic intention representation whose action synergy survives the domain shift to robot kinematics, camera intrinsics, and proprioception is stated but not tested. No alignment loss, domain-adaptation module, or cross-embodiment analysis is described, leaving the transfer mechanism ungrounded.
Authors: The transfer mechanism relies on pretraining the model on large-scale human egocentric data to learn gaze-based intention-action synergy, followed by finetuning on limited robot data; the shared visual backbone and intention prediction head are intended to produce representations that are largely embodiment-agnostic. While we did not introduce an explicit alignment loss or domain-adaptation module, the consistent gains observed on both simulated and real-robot tasks provide supporting evidence. To strengthen the grounding, we will add a dedicated paragraph in §3 discussing the transfer assumptions and include a qualitative cross-embodiment analysis (e.g., gaze-prediction visualizations on human and robot images) in the revision. revision: partial
-
Referee: [§3.3] §3.3 (Inference, Chain-of-Thought): The sequential intention-then-action prediction is presented as essential, yet no ablation compares it against direct action prediction or alternative intermediate representations, making it impossible to isolate whether CoT contributes to the claimed gains.
Authors: We appreciate the request for an ablation isolating the Chain-of-Thought component. In the revised manuscript we will add an ablation study in §4 that directly compares the full model (intention prediction followed by action generation) against a variant that predicts actions without the intermediate intention step, thereby quantifying the contribution of the sequential reasoning process. revision: yes
Circularity Check
No circularity: standard pretrain-finetune pipeline with external data
full rationale
The paper presents an empirical ML framework: pretrain on large-scale external egocentric human datasets to learn gaze-based intention, then finetune on small robot+human data, followed by CoT inference that predicts intention before action. No equations, derivations, or fitted parameters are shown that reduce claimed performance or intention-action synergy to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central assumption (gaze as transferable proxy) is stated as an argument rather than derived from the model's own outputs. This is self-contained against external benchmarks and receives the default non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gaze naturally precedes physical actions and serves as an observable proxy for human intent.
Reference graph
Works this paper leans on
-
[1]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review arXiv 2023
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., Zhang, L., Fountain, J., Miller, E., Basol, S., et al.: Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7061–7071 (2025)
2025
-
[3]
arXiv preprint arXiv:2409.16283 (2024)
Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video generation in novel sce- narios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)
-
[4]
In: European Conference on Computer Vision
Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. In: European Conference on Computer Vision. pp. 306–324. Springer (2024)
2024
-
[5]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)
work page internal anchor Pith review arXiv 2025
-
[6]
H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,
Bi, H., Wu, L., Lin, T., Tan, H., Su, Z., Su, H., Zhu, J.: H-rdt: Human manipulation enhanced bimanual robotic manipulation. arXiv preprint arXiv:2507.23523 (2025)
-
[7]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)
work page internal anchor Pith review arXiv 2025
-
[8]
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A vision-language-action flow model for general robot control. ...
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)
work page internal anchor Pith review arXiv 2022
-
[10]
Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al.: Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669 (2025)
work page internal anchor Pith review arXiv 2025
-
[11]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)
work page internal anchor Pith review arXiv 2025
-
[12]
arXiv preprint arXiv:2511.15704 (2025)
Cai, X., Qiu, R.Z., Chen, G., Wei, L., Liu, I., Huang, T., Cheng, X., Wang, X.: In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data. arXiv preprint arXiv:2511.15704 (2025)
-
[13]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Cheang, C.L., Chen, G., Jing, Y., Kong, T., Li, H., Li, Y., Liu, Y., Wu, H., Xu, J., Yang, Y., et al.: Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 (2024)
work page internal anchor Pith review arXiv 2024
-
[14]
Chen, X., Guo, J., He, T., Zhang, C., Zhang, P., Yang, D.C., Zhao, L., Bian, J.: Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv preprint arXiv:2411.00785 (2024)
-
[15]
Villa-x: enhancing latent action modeling in vision-language-action models,
Chen, X., Wei, H., Zhang, P., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., et al.: Villa-x: enhancing latent action modeling in vision- language-action models. arXiv preprint arXiv:2507.23682 (2025) GazeVLA: Learning Human Intention for Robotic Manipulation 17
-
[16]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, Y., Ge, Y., Tang, W., Li, Y., Ge, Y., Ding, M., Shan, Y., Liu, X.: Moto: Latent motion token as the bridging language for learning robot manipulation from videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19752–19763 (2025)
2025
-
[17]
The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
2025
-
[18]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Chuang, I., Lee, A., Gao, D., Naddaf-Sh, M.M., Soltani, I.: Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 7952–
2025
-
[19]
arXiv preprint arXiv:2507.15833 (2025)
Chuang, I., Zou, J., Lee, A., Gao, D., Soltani, I.: Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transformers. arXiv preprint arXiv:2507.15833 (2025)
-
[20]
IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 4125–4141 (2020)
Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence43(11), 4125–4141 (2020)
2020
-
[21]
Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,
Deng, S., Yan, M., Wei, S., Ma, H., Yang, Y., Chen, J., Zhang, Z., Yang, T., Zhang, X., Zhang, W., et al.: Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233 (2025)
-
[22]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Ta- lattof, A., Yuan, A., Souti, B., Meredith, B., et al.: Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561 (2023)
work page internal anchor Pith review arXiv 2023
-
[23]
Nature 424(6950), 769–771 (2003)
Flanagan, J.R., Johansson, R.S.: Action plans used in action observation. Nature 424(6950), 769–771 (2003)
2003
-
[24]
arXiv preprint arXiv:2601.05230 (2026)
Garrido,Q.,Nagarajan,T.,Terver,B.,Ballas,N.,LeCun,Y.,Rabbat,M.:Learning latent action world models in the wild. arXiv preprint arXiv:2601.05230 (2026)
-
[25]
glasses, P.N.: Pupil neon glasses.https://pupil-labs.com/products/neon
-
[26]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)
2022
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)
2024
-
[28]
G., Rao, K., Yu, W., Fu, C., Gopalakrishnan, K., Xu, Z., et al
Gu, J., Kirmani, S., Wohlhart, P., Lu, Y., Arenas, M.G., Rao, K., Yu, W., Fu, C., Gopalakrishnan, K., Xu, Z., et al.: Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977 (2023)
-
[29]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,
Hoque, R., Huang, P., Yoon, D.J., Sivapurapu, M., Zhang, J.: Egodex: Learn- ing dexterous manipulation from large-scale egocentric video. arXiv preprint arXiv:2505.11709 (2025)
-
[30]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)
work page internal anchor Pith review arXiv 2024
-
[31]
IEEE Transactions on Visualization and Computer Graphics (2024) 18 Li et al
Hu, Z., Xu, J., Schmitt, S., Bulling, A.: Pose2gaze: Eye-body coordination during daily activities for gaze prediction from full-body poses. IEEE Transactions on Visualization and Computer Graphics (2024) 18 Li et al
2024
-
[32]
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...
work page internal anchor Pith review arXiv 2025
-
[33]
Journal of neuroscience21(17), 6917–6932 (2001)
Johansson, R.S., Westling, G., Bäckström, A., Flanagan, J.R.: Eye–hand coordi- nation in object manipulation. Journal of neuroscience21(17), 6917–6932 (2001)
2001
-
[34]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Kareer, S., Patel, D., Punamiya, R., Mathur, P., Cheng, S., Wang, C., Hoffman, J., Xu, D.: Egomimic: Scaling imitation learning via egocentric video. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 13226–13233. IEEE (2025)
2025
-
[35]
arXiv preprint arXiv:2512.22414 (2025)
Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., Nair, S.: Emergence of human to robot transfer in vision-language-action models. arXiv preprint arXiv:2512.22414 (2025)
-
[36]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review arXiv 2024
-
[37]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: Two hands ma- nipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10138–10148 (2021)
2021
-
[38]
arXiv preprint arXiv:2508.09976 (2025)
Lepert, M., Fang, J., Bohg, J.: Masquerade: Learning from in-the-wild human videos using data-editing. arXiv preprint arXiv:2508.09976 (2025)
-
[39]
Li, Q., Deng, Y., Liang, Y., Luo, L., Zhou, L., Yao, C., Zeng, L., Feng, Z., Liang, H., Xu, S., et al.: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. arXiv preprint arXiv:2510.21571 (2025)
-
[40]
IEEE transactions on pattern analysis and machine intelligence45(6), 6731–6747 (2021)
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: Gaze and actions in first per- son video. IEEE transactions on pattern analysis and machine intelligence45(6), 6731–6747 (2021)
2021
-
[41]
Advances in Neural Information Processing Systems35, 7575–7586 (2022)
Lin, K.Q., Wang, J., Soldan, M., Wray, M., Yan, R., Xu, E.Z., Gao, D., Tu, R.C., Zhao, W., Kong, W., et al.: Egocentric video-language pretraining. Advances in Neural Information Processing Systems35, 7575–7586 (2022)
2022
-
[42]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[43]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, Y., Yang, H., Si, X., Liu, L., Li, Z., Zhang, Y., Liu, Y., Yi, L.: Taco: Bench- marking generalizable bimanual tool-action-object understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21740–21751 (2024)
2024
-
[44]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., Yi, L.: Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21013–21022 (2022)
2022
-
[45]
arXiv preprint arXiv:2405.02911 (2024)
Lou, Z., Cui, Q., Wang, H., Tang, X., Zhou, H.: Multimodal sense-informed pre- diction of 3d human motions. arXiv preprint arXiv:2405.02911 (2024)
-
[46]
Lyu, J., Liu, K., Zhang, X., Liao, H., Feng, Y., Zhu, W., Shen, T., Chen, J., Zhang, J., Dong, Y., et al.: Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215 (2026) GazeVLA: Learning Human Intention for Robotic Manipulation 19
-
[47]
In: European Conference on Computer Vision
Ma, L., Ye, Y., Hong, F., Guzov, V., Jiang, Y., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V., Kim, H.J., et al.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In: European Conference on Computer Vision. pp. 445–465. Springer (2024)
2024
-
[48]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)
work page internal anchor Pith review arXiv 2022
-
[49]
Majumdar, A., Yadav, K., Arnaud, S., Ma, J., Chen, C., Silwal, S., Jain, A., Berges, V.P., Wu, T., Vakil, J., et al.: Where are we in the search for an artificial visualcortex for embodied intelligence?AdvancesinNeural Information Processing Systems36, 655–677 (2023)
2023
-
[50]
arXiv preprint arXiv:2203.12601 (2022)
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)
-
[51]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learn- ing datasets and rt-x models: Open x-embodiment collaboration 0. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 6892–6903. IEEE (2024)
2024
-
[52]
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Pai, J., Achenbach, L., Montesinos, V., Forrai, B., Mees, O., Nava, E.: mimic-video: Video-action models for generalizable robot control beyond vlas. arXiv preprint arXiv:2512.15692 (2025)
work page internal anchor Pith review arXiv 2025
-
[53]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for ego- centric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)
2023
-
[54]
Pro, A.V.: Apple vision pro.https://www.apple.com.cn/apple-vision-pro
-
[55]
In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)
Punamiya, R., Patel, D., Aphiwetsa, P., Kuppili, P., Zhu, L.Y., Kareer, S., Hoff- man, J., Xu, D.: Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)
2025
-
[56]
arXiv preprint arXiv:2501.19061 (2025)
Qiu, H., Shi, Z., Wang, L., Xiong, H., Li, X., Li, H.: Egome: A new dataset and challenge for following me via egocentric view in real world. arXiv preprint arXiv:2501.19061 (2025)
-
[57]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[58]
ACM Transactions on Graphics, (Proc
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6) (Nov 2017)
2017
-
[59]
arXiv preprint arXiv:2511.07732 (2025)
Routray, S., Pan, H., Jain, U., Bahl, S., Pathak, D.: Vipra: Video prediction for robot actions. arXiv preprint arXiv:2511.07732 (2025)
-
[60]
Gemini Robotics: Bringing AI into the Physical World
Team, G.R., Abeyruwan, S., Ainslie, J., Alayrac, J.B., Arenas, M.G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al.: Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020 (2025)
work page internal anchor Pith review arXiv 2025
-
[61]
Octo: An Open-Source Generalist Robot Policy
Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)
work page internal anchor Pith review arXiv 2024
-
[62]
In: Conference on Robot Learning
Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: Bridgedata v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–1736. PMLR (2023) 20 Li et al
2023
-
[63]
Mimicplay: Long- horizon imitation learning by watching hu- man play
Wang, C., Fan, L., Sun, J., Zhang, R., Fei-Fei, L., Xu, D., Zhu, Y., Anandkumar, A.: Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422 (2023)
-
[64]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F.V., et al.: Holoassist: an egocentric human in- teraction dataset for interactive ai assistants in the real world. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20270–20281 (2023)
2023
-
[65]
Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point tra- jectory modeling for policy learning. arXiv preprint arXiv:2401.00025 (2023)
-
[66]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023)
work page internal anchor Pith review arXiv 2023
-
[67]
Xu, R., Zhang, J., Guo, M., Wen, Y., Yang, H., Lin, M., Huang, J., Li, Z., Zhang, K., Wang, L., et al.: A0: An affordance-aware hierarchical model for general robotic manipulation.In:ProceedingsoftheIEEE/CVFInternationalConferenceonCom- puter Vision. pp. 13491–13501 (2025)
2025
- [68]
-
[69]
Yang, G., Zhang, T., Hao, H., Wang, W., Liu, Y., Wang, D., Chen, G., Cai, Z., Chen, J., Su, W., et al.: Vlaser: Vision-language-action model with synergistic embodied reasoning. arXiv preprint arXiv:2510.11027 (2025)
-
[70]
Yang, J., Shi, Y., Zhu, H., Liu, M., Ma, K., Wang, Y., Wu, G., He, T., Wang, L.: Como: Learning continuous latent motion from internet videos for scalable robot learning. arXiv preprint arXiv:2505.17006 (2025)
-
[71]
In: Proceedings of the computer vision and pattern recognition conference
Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., Gu, Y., Cai, M., Ye, S., Jang, J., et al.: Magma: A foundation model for multimodal ai agents. In: Proceedings of the computer vision and pattern recognition conference. pp. 14203– 14214 (2025)
2025
-
[72]
Yang, R., Yu, Q., Wu, Y., Yan, R., Li, B., Cheng, A.C., Zou, X., Fang, Y., Cheng, X., Qiu, R.Z., et al.: Egovla: Learning vision-language-action models from egocen- tric human videos. arXiv preprint arXiv:2507.12440 (2025)
-
[73]
Latent Action Pretraining from Videos
Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al.: Latent action pretraining from videos. arXiv preprint arXiv:2410.11758 (2024)
work page Pith review arXiv 2024
-
[74]
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Yin, C., Lin, Y., Xu, W., Tam, S., Zeng, X., Liu, Z., Yin, Z.: Deepthinkvla: Enhancing reasoning capability of vision-language-action models. arXiv preprint arXiv:2511.15669 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Robotic Control via Embodied Chain-of-Thought Reasoning
Zawalski,M.,Chen,W.,Pertsch,K.,Mees,O.,Finn,C.,Levine,S.:Roboticcontrol via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693 (2024)
work page internal anchor Pith review arXiv 2024
-
[76]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhan, X., Yang, L., Zhao, Y., Mao, K., Xu, H., Lin, Z., Li, K., Lu, C.: Oakink2: A dataset of bimanual hands-object manipulation in complex task completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 445–456 (2024)
2024
-
[77]
Zhang, C., Wang, J., Gao, Z., Su, Y., Dai, T., Zhou, C., Lu, J., Tang, Y.: Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos. arXiv preprint arXiv:2601.04061 (2026)
-
[78]
In: CVPR
Zhang, J., Deng, J., Ma, C., Potamias, R.A.: Hawor: World-space hand motion reconstruction from egocentric videos. In: CVPR. pp. 1805–1815 (2025) GazeVLA: Learning Human Intention for Robotic Manipulation 21
2025
-
[79]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025)
2025
-
[80]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.