pith. sign in

arxiv: 2507.12440 · v3 · pith:APTM2RGTnew · submitted 2025-07-16 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Pith reviewed 2026-05-21 04:28 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords Vision-Language-Action modelsEgocentric videosRobot manipulationImitation learningBimanual tasksHumanoid robotsAction retargetingData efficiency
0
0 comments X

The pith

A VLA trained on egocentric human videos predicts wrist and hand actions that retarget to robots and improve bimanual tasks after light fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Vision-Language-Action models can learn from large sets of egocentric human videos instead of scarce robot data. The model first learns to output human wrist and hand movements from video and language. These outputs are then converted to robot joint commands through inverse kinematics and retargeting, after which the system receives a small number of real robot demonstrations for fine-tuning. A sympathetic reader would care because this route could let robot policies draw on the scale and variety of everyday human activity without requiring proportional robot hardware time.

Core claim

A Vision-Language-Action model trained on egocentric human videos to predict human wrist and hand actions, converted to robot actions by inverse kinematics retargeting, and then fine-tuned on a few robot demonstrations produces the EgoVLA policy that records significant gains over baselines across diverse bimanual manipulation tasks in the Ego Humanoid Manipulation Benchmark.

What carries the argument

Vision-Language-Action model that outputs predicted human wrist and hand poses from egocentric video, followed by inverse-kinematics retargeting to robot commands.

If this is right

  • Human video pre-training supplies the scale and scene diversity that robot-only data collection cannot match.
  • Retargeting plus brief fine-tuning closes most of the human-to-robot gap for bimanual tasks.
  • Ablation results indicate that removing the human video stage reduces final task performance.
  • The same pipeline can be applied to additional bimanual tasks once the benchmark demonstrations are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to real-world robot deployment if human videos are recorded in matching environments.
  • It suggests a route to leverage existing large human video corpora without new robot data collection for every new task.
  • Direct prediction of robot actions from human footage might eventually remove the retargeting stage altogether.

Load-bearing premise

Human wrist and hand actions predicted by the model can be mapped to usable robot actions through inverse kinematics and retargeting with only small performance loss and without task-by-task recalibration.

What would settle it

Measuring that robot success rates on the benchmark tasks remain the same or drop when the human-video model is used for initialization instead of a robot-only baseline, even after identical fine-tuning steps.

read the original abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EgoVLA, a Vision-Language-Action model trained on egocentric human videos to predict human wrist and hand actions. These actions are converted to robot actions via inverse kinematics and retargeting, followed by fine-tuning on a small number of robot demonstrations. The authors propose the Ego Humanoid Manipulation Benchmark with diverse bimanual tasks and claim significant improvements over baselines along with ablations showing the importance of human data.

Significance. If the empirical results hold with robust quantitative support, this work could meaningfully advance data-efficient robot learning by exploiting the scale and scene richness of human egocentric videos for VLA pretraining. The new simulation benchmark for bimanual humanoid manipulation is a constructive contribution to the field. The pipeline's reliance on human-to-robot action conversion is conceptually promising for reducing robot data needs, but its practical impact hinges on demonstrating that retargeting incurs only minor degradation.

major comments (2)
  1. [Experiments / Results] The central data-efficiency claim depends on IK retargeting of predicted human wrist/hand actions producing usable robot actions with only minor loss, so that fine-tuning mainly adapts rather than compensates for embodiment mismatch. No ablation or quantitative metrics are reported for retargeted-only performance (prior to fine-tuning) or for bimanual coordination errors on the benchmark. This is load-bearing for attributing gains to human pretraining rather than the robot demonstrations.
  2. [Method] The method description provides only a high-level account of the retargeting step. It does not specify whether retargeting is task-agnostic, how differences in arm reach, hand DOF, and grasp kinematics are resolved, or whether per-task calibration is required. These details are necessary to evaluate whether the approach truly supports the 'few robot demonstrations' regime.
minor comments (2)
  1. [Abstract] The abstract states 'significant improvements' and 'ablations' but supplies no concrete metrics, effect sizes, or error bars. Including these would allow readers to assess result strength immediately.
  2. [Benchmark] The benchmark description would benefit from additional detail on task selection criteria, success metrics, and how the small set of robot demonstrations was collected to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our data-efficiency claims and methodological clarity. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments / Results] The central data-efficiency claim depends on IK retargeting of predicted human wrist/hand actions producing usable robot actions with only minor loss, so that fine-tuning mainly adapts rather than compensates for embodiment mismatch. No ablation or quantitative metrics are reported for retargeted-only performance (prior to fine-tuning) or for bimanual coordination errors on the benchmark. This is load-bearing for attributing gains to human pretraining rather than the robot demonstrations.

    Authors: We agree that reporting performance metrics for the retargeted actions prior to fine-tuning, as well as explicit measures of bimanual coordination errors, would provide stronger evidence for the contribution of human pretraining. In the revised manuscript, we will add a new ablation table showing success rates using only the retargeted outputs (without robot fine-tuning) across the benchmark tasks. We will also include quantitative metrics for bimanual coordination, such as average end-effector distance errors between the two arms during coordinated actions and per-task breakdowns of coordination failures. These additions will help isolate the effect of the human video pretraining from the adaptation provided by the few robot demonstrations. revision: yes

  2. Referee: [Method] The method description provides only a high-level account of the retargeting step. It does not specify whether retargeting is task-agnostic, how differences in arm reach, hand DOF, and grasp kinematics are resolved, or whether per-task calibration is required. These details are necessary to evaluate whether the approach truly supports the 'few robot demonstrations' regime.

    Authors: We acknowledge that the retargeting procedure was described at a high level in the current manuscript. In the revision, we will expand Section 3 (Method) with a dedicated subsection on action retargeting. This will specify that the process is task-agnostic and uses a fixed inverse kinematics solver combined with a general hand-pose mapping. Differences in arm reach are handled via proportional scaling of joint targets, hand DOF mismatches are resolved through a predefined joint correspondence table, and grasp kinematics are mapped from human finger poses to robot gripper commands using a constant offset without requiring per-task calibration. These details will clarify how the pipeline enables effective transfer with minimal robot data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline uses independent human video data, IK conversion, and external benchmark

full rationale

The derivation begins with external egocentric human videos as input, trains a VLA to predict human wrist/hand actions, applies separate IK and retargeting steps, fine-tunes on distinct robot demonstrations, and evaluates on a newly proposed simulation benchmark with ablations. No step reduces by construction to a fitted parameter or self-citation that defines the claimed performance gains; the central result is an empirical comparison against baselines rather than a definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that human-to-robot action retargeting preserves task-relevant information and on the availability of large-scale egocentric human video datasets.

axioms (1)
  • domain assumption Human wrist and hand actions predicted from video can be accurately mapped to robot actions via inverse kinematics and retargeting without substantial task-specific loss
    This premise is required for the conversion step that turns human-video predictions into robot-executable actions.

pith-pipeline@v0.9.0 · 5774 in / 1330 out tokens · 41995 ms · 2026-05-21T04:28:40.754492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dexora: Open-source VLA for High-DoF Bimanual Dexterity

    cs.RO 2026-05 unverdicted novelty 7.0

    Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7%...

  2. StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

    cs.CV 2026-05 unverdicted novelty 7.0

    StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA res...

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  5. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    cs.RO 2026-02 unverdicted novelty 7.0

    DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...

  6. EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.

  7. SCAR: Self-Supervised Continuous Action Representation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.

  8. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  9. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  10. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  11. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  12. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  13. ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

    cs.RO 2026-04 unverdicted novelty 6.0

    ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.

  14. HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.

  15. HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...

  16. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  17. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  18. Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

    cs.RO 2026-05 unverdicted novelty 5.0

    A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.

  19. LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

    cs.RO 2026-04 unverdicted novelty 5.0

    LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...

  20. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  21. EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

    cs.RO 2026-04 unverdicted novelty 4.0

    EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 20 Pith papers · 5 internal anchors

  1. [1]

    Vuong, S

    Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, and A. S. et al. Open x-embodiment: Robotic learning datasets and RT-x models. In CoRL, 2023. 1, 3

  2. [2]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  3. [3]

    A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation. arXiv preprint arXiv:2403.07870, 2024. 1

  4. [4]

    S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martin- Martin. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024. 1

  5. [5]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. RSS, 2023. 1, 6, 7, 18

  6. [6]

    L. Zhao, T. Yang, Y . Yang, and P. Yu. A wearable upper limb exoskeleton for intuitive teleop- eration of anthropomorphic manipulators. Machines, 11(4):441, 2023. 1

  7. [7]

    Fang, H.-S

    H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu. Airexo: Low- cost exoskeletons for learning whole-arm manipulation in the wild. In2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 15031–15038. IEEE, 2024. 1

  8. [8]

    S. Yang, M. Liu, Y . Qin, D. Runyu, L. Jialong, X. Cheng, R. Yang, S. Yi, and X. Wang. Ace: A cross-platfrom visual-exoskeletons for low-cost dexterous teleoperation.arXiv preprint arXiv:240, 2024. 1

  9. [9]

    Naceri, D

    A. Naceri, D. Mazzanti, J. Bimbo, Y . T. Tefera, D. Prattichizzo, D. G. Caldwell, L. S. Mattos, and N. Deshpande. The vicarios virtual reality interface for remote robotic teleoperation: Teleporting for intuitive tele-manipulation. Journal of Intelligent & Robotic Systems , 101: 1–16, 2021. 1

  10. [10]

    Cheng, J

    X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback. arXiv preprint arXiv:2407.01512, 2024. 1, 5 9

  11. [11]

    R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang. Bunny-visionpro: Real- time bimanual dexterous teleoperation for imitation learning. 2024. URL https://arxiv. org/abs/2407.03162. 1

  12. [12]

    J. Tian, L. Yang, R. Ji, Y . Ma, L. Xu, J. Yu, Y . Shi, and J. Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method. arXiv preprint arXiv:2403.16169, 2024. 2

  13. [13]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In RSS, 2024. 2, 3

  14. [14]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In CoRL,

  15. [15]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint, 2024. 2, 3

  16. [16]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint, 2023. 2, 3

  17. [17]

    Isaac sim: Advanced simulation for robotics development, 2024

    NVIDIA. Isaac sim: Advanced simulation for robotics development, 2024. Accessed: Jun

  18. [18]

    Romero, D

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , 36(6), Nov. 2017. 2, 4, 15, 18

  19. [19]

    Rodriguez, M

    A. Rodriguez, M. T. Mason, and S. Ferry. From caging to grasping. IJRR, 2012. 2

  20. [20]

    Rosales, R

    C. Rosales, R. Su ´arez, M. Gabiccini, and A. Bicchi. On the synthesis of feasible and prehensile robotic grasps. In ICRA, 2012. 2

  21. [21]

    Prattichizzo, M

    D. Prattichizzo, M. Malvezzi, M. Gabiccini, and A. Bicchi. On the manipulability ellipsoids of underactuated robotic hands with compliance. RAS, 2012. 2

  22. [22]

    Ponce, S

    J. Ponce, S. Sullivan, J.-D. Boissonnat, and J.-P. Merlet. On characterizing and computing three-and four-finger force-closure grasps of polyhedral objects. In ICRA, 1993. 2

  23. [23]

    Ponce, S

    J. Ponce, S. Sullivan, A. Sudsang, J.-D. Boissonnat, and J.-P. Merlet. On computing four-finger equilibrium and force-closure grasps of polyhedral objects. IJRR, 1997. 2

  24. [24]

    Zheng and C.-M

    Y . Zheng and C.-M. Chew. Distance between a point and a convex cone in n-dimensional space: Computation and applications. T-RO, 2009. 2

  25. [25]

    H. Dai, A. Majumdar, and R. Tedrake. Synthesis and optimization of force closure grasps via sequential semidefinite programming. ISRR, 2018. 2

  26. [26]

    O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. IJRR, 2020. 2

  27. [27]

    Nagabandi, K

    A. Nagabandi, K. Konolige, S. Levine, and V . Kumar. Deep dynamics models for learning dexterous manipulation. In CoRL, 2020. 2

  28. [28]

    Jiang, S

    H. Jiang, S. Liu, J. Wang, and X. Wang. Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021. 2 10

  29. [29]

    Corona, A

    E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020. 2

  30. [30]

    L. Yang, X. Zhan, K. Li, W. Xu, J. Li, and C. Lu. Cpf: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021. 2

  31. [31]

    L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg. Unigrasp: Learning a unified model to grasp with multifingered robotic hands. RA-L,

  32. [32]

    A. Wu, M. Guo, and C. K. Liu. Learning diverse and physically feasible dexterous grasps with generative model and bilevel optimization. CoRL, 2022. 2

  33. [33]

    Brahmbhatt, A

    S. Brahmbhatt, A. Handa, J. Hays, and D. Fox. Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019. 2

  34. [34]

    Turpin, L

    D. Turpin, L. Wang, E. Heiden, Y .-C. Chen, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In ECCV, 2022. 2

  35. [35]

    R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,

  36. [36]

    URL https://arxiv.org/abs/2503.13441. 2

  37. [37]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URL https://arxiv. org/abs/2410.24221. 2

  38. [38]

    Achiam, S

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint, 2023. 2

  39. [39]

    J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. In CVPR, 2024. 2, 6, 7

  40. [40]

    S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint, 2024. 2

  41. [41]

    Pratt, I

    S. Pratt, I. Covert, R. Liu, and A. Farhadi. What does a platypus look like? generating cus- tomized prompts for zero-shot image classification. In ICCV, 2023. 2

  42. [42]

    Alaluf, E

    Y . Alaluf, E. Richardson, S. Tulyakov, K. Aberman, and D. Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. In ECCV, 2025. 2

  43. [43]

    W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023. 2

  44. [44]

    Huang, S

    D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz. Lita: Language instructed temporal-localization assistant. In ECCV, 2025. 2

  45. [45]

    T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model. arXiv preprint, 2023. 2

  46. [46]

    J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint, 2024. 3 11

  47. [47]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  48. [48]

    Mandlekar, Y

    A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imita- tion. In CoRL, 2018. 3

  49. [49]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In CoRL, 2023. 3

  50. [50]

    Dasari, F

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In CoRL, 2019. 3

  51. [51]

    Kalashnikov, A

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018. 3

  52. [52]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2022. 3

  53. [53]

    G. A. Sigurdsson, A. K. Gupta, C. Schmid, A. Farhadi, and A. Karteek. Actor and observer: Joint modeling of first and third-person videos. In CVPR, 2018. 3

  54. [54]

    Y . Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In CVPR, 2015. 3

  55. [55]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR,

  56. [56]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024. 3

  57. [57]

    Mahdisoltani, G

    F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. On the effectiveness of task granularity for transfer learning. arXiv preprint, 2018. 3

  58. [58]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 3

  59. [59]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 3

  60. [60]

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2023. 3

  61. [61]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601. 3

  62. [62]

    Majumdar, K

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelli- gence? In Thirty-seventh Conference on Neural Information Processing Systems , 2023. UR...

  63. [63]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics. In Robotics: Science and Systems (RSS) , 2023. 3

  64. [64]

    J. Yang, B. Liu, J. Fu, B. Pan, G. Wu, and L. Wang. Spatiotemporal predictive pre-training for robotic motor control, 2024. URL https://arxiv.org/abs/2403.05304. 3

  65. [65]

    J. Zeng, Q. Bu, B. Wang, W. Xia, L. Chen, H. Dong, H. Song, D. Wang, D. Hu, P. Luo, et al. Learning manipulation by predicting interaction. arXiv preprint arXiv:2406.00439, 2024. 3

  66. [66]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2024. URL https://arxiv.org/abs/2410.11758. 3

  67. [67]

    Lirui, C

    W. Lirui, C. Xinlei, Z. Jialiang, and H. Kaiming. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Neurips, 2024. 3

  68. [68]

    Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y . Lou, S. Yang, H. Xi, S. Cao, Y . Gu, D. Li, X. Li, Y . Fang, Y . Chen, C.-Y . Hsieh, D.-A. Huang, A.-C. Cheng, V . Nath, J. Hu, S. Liu, R. Krishna, D. Xu, X. Wang, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y . Lu. Nvila: Efficient frontier visual language models, 2025. URL https://arxiv.org/abs/2412.04468. 3

  69. [69]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URL https://arxiv.org/abs/1812.07035. 4, 17

  70. [70]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941. 5

  71. [71]

    Orbit: A unified simulation framework for interactive robot learning environments,

    M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. doi:10.1109/LRA.2023.3270034. 5

  72. [72]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URLhttps://arxiv.org/abs/2306. 03310. 5

  73. [73]

    Robotics

    U. Robotics. H1, 2024. Accessed: Sep 2024. 5

  74. [74]

    I. Robots. The dexterous hands, 2024. Accessed: Jul 2024. 5

  75. [75]

    S. Liu, S. Tripathi, S. Majumdar, and X. Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6

  76. [76]

    X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, N. Joshi, and M. Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, October 2023. 15, 16

  77. [77]

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21013–21022, June 2022. 15, 16, 18, 19, 22

  78. [78]

    Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding. arXiv preprint arXiv:2401.08399 ,

  79. [79]

    Banerjee, S

    P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, F. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. Introducing hot3d: An egocentric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598, 2024. 15, 18

  80. [80]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without s...

Showing first 80 references.