EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

An-Chieh Cheng; BoRui Li; Hongxu Yin; Qinxi Yu; Ri-Zhao Qiu; Ruihan Yang; Rui Yan; Sifei Liu; Song Han; Xiaolong Wang

arxiv: 2507.12440 · v3 · pith:APTM2RGTnew · submitted 2025-07-16 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang , Qinxi Yu , Yecheng Wu , Rui Yan , Borui Li , An-Chieh Cheng , Xueyan Zou , Yunhao Fang

show 7 more authors

Xuxin Cheng Ri-Zhao Qiu Hongxu Yin Sifei Liu Song Han Yao Lu Xiaolong Wang

This is my paper

Pith reviewed 2026-05-21 04:28 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords Vision-Language-Action modelsEgocentric videosRobot manipulationImitation learningBimanual tasksHumanoid robotsAction retargetingData efficiency

0 comments

The pith

A VLA trained on egocentric human videos predicts wrist and hand actions that retarget to robots and improve bimanual tasks after light fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Vision-Language-Action models can learn from large sets of egocentric human videos instead of scarce robot data. The model first learns to output human wrist and hand movements from video and language. These outputs are then converted to robot joint commands through inverse kinematics and retargeting, after which the system receives a small number of real robot demonstrations for fine-tuning. A sympathetic reader would care because this route could let robot policies draw on the scale and variety of everyday human activity without requiring proportional robot hardware time.

Core claim

A Vision-Language-Action model trained on egocentric human videos to predict human wrist and hand actions, converted to robot actions by inverse kinematics retargeting, and then fine-tuned on a few robot demonstrations produces the EgoVLA policy that records significant gains over baselines across diverse bimanual manipulation tasks in the Ego Humanoid Manipulation Benchmark.

What carries the argument

Vision-Language-Action model that outputs predicted human wrist and hand poses from egocentric video, followed by inverse-kinematics retargeting to robot commands.

If this is right

Human video pre-training supplies the scale and scene diversity that robot-only data collection cannot match.
Retargeting plus brief fine-tuning closes most of the human-to-robot gap for bimanual tasks.
Ablation results indicate that removing the human video stage reduces final task performance.
The same pipeline can be applied to additional bimanual tasks once the benchmark demonstrations are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to real-world robot deployment if human videos are recorded in matching environments.
It suggests a route to leverage existing large human video corpora without new robot data collection for every new task.
Direct prediction of robot actions from human footage might eventually remove the retargeting stage altogether.

Load-bearing premise

Human wrist and hand actions predicted by the model can be mapped to usable robot actions through inverse kinematics and retargeting with only small performance loss and without task-by-task recalibration.

What would settle it

Measuring that robot success rates on the benchmark tasks remain the same or drop when the human-video model is used for initialization instead of a robot-only baseline, even after identical fine-tuning steps.

read the original abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EgoVLA, a Vision-Language-Action model trained on egocentric human videos to predict human wrist and hand actions. These actions are converted to robot actions via inverse kinematics and retargeting, followed by fine-tuning on a small number of robot demonstrations. The authors propose the Ego Humanoid Manipulation Benchmark with diverse bimanual tasks and claim significant improvements over baselines along with ablations showing the importance of human data.

Significance. If the empirical results hold with robust quantitative support, this work could meaningfully advance data-efficient robot learning by exploiting the scale and scene richness of human egocentric videos for VLA pretraining. The new simulation benchmark for bimanual humanoid manipulation is a constructive contribution to the field. The pipeline's reliance on human-to-robot action conversion is conceptually promising for reducing robot data needs, but its practical impact hinges on demonstrating that retargeting incurs only minor degradation.

major comments (2)

[Experiments / Results] The central data-efficiency claim depends on IK retargeting of predicted human wrist/hand actions producing usable robot actions with only minor loss, so that fine-tuning mainly adapts rather than compensates for embodiment mismatch. No ablation or quantitative metrics are reported for retargeted-only performance (prior to fine-tuning) or for bimanual coordination errors on the benchmark. This is load-bearing for attributing gains to human pretraining rather than the robot demonstrations.
[Method] The method description provides only a high-level account of the retargeting step. It does not specify whether retargeting is task-agnostic, how differences in arm reach, hand DOF, and grasp kinematics are resolved, or whether per-task calibration is required. These details are necessary to evaluate whether the approach truly supports the 'few robot demonstrations' regime.

minor comments (2)

[Abstract] The abstract states 'significant improvements' and 'ablations' but supplies no concrete metrics, effect sizes, or error bars. Including these would allow readers to assess result strength immediately.
[Benchmark] The benchmark description would benefit from additional detail on task selection criteria, success metrics, and how the small set of robot demonstrations was collected to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our data-efficiency claims and methodological clarity. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Experiments / Results] The central data-efficiency claim depends on IK retargeting of predicted human wrist/hand actions producing usable robot actions with only minor loss, so that fine-tuning mainly adapts rather than compensates for embodiment mismatch. No ablation or quantitative metrics are reported for retargeted-only performance (prior to fine-tuning) or for bimanual coordination errors on the benchmark. This is load-bearing for attributing gains to human pretraining rather than the robot demonstrations.

Authors: We agree that reporting performance metrics for the retargeted actions prior to fine-tuning, as well as explicit measures of bimanual coordination errors, would provide stronger evidence for the contribution of human pretraining. In the revised manuscript, we will add a new ablation table showing success rates using only the retargeted outputs (without robot fine-tuning) across the benchmark tasks. We will also include quantitative metrics for bimanual coordination, such as average end-effector distance errors between the two arms during coordinated actions and per-task breakdowns of coordination failures. These additions will help isolate the effect of the human video pretraining from the adaptation provided by the few robot demonstrations. revision: yes
Referee: [Method] The method description provides only a high-level account of the retargeting step. It does not specify whether retargeting is task-agnostic, how differences in arm reach, hand DOF, and grasp kinematics are resolved, or whether per-task calibration is required. These details are necessary to evaluate whether the approach truly supports the 'few robot demonstrations' regime.

Authors: We acknowledge that the retargeting procedure was described at a high level in the current manuscript. In the revision, we will expand Section 3 (Method) with a dedicated subsection on action retargeting. This will specify that the process is task-agnostic and uses a fixed inverse kinematics solver combined with a general hand-pose mapping. Differences in arm reach are handled via proportional scaling of joint targets, hand DOF mismatches are resolved through a predefined joint correspondence table, and grasp kinematics are mapped from human finger poses to robot gripper commands using a constant offset without requiring per-task calibration. These details will clarify how the pipeline enables effective transfer with minimal robot data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline uses independent human video data, IK conversion, and external benchmark

full rationale

The derivation begins with external egocentric human videos as input, trains a VLA to predict human wrist/hand actions, applies separate IK and retargeting steps, fine-tunes on distinct robot demonstrations, and evaluates on a newly proposed simulation benchmark with ablations. No step reduces by construction to a fitted parameter or self-citation that defines the claimed performance gains; the central result is an empirical comparison against baselines rather than a definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that human-to-robot action retargeting preserves task-relevant information and on the availability of large-scale egocentric human video datasets.

axioms (1)

domain assumption Human wrist and hand actions predicted from video can be accurately mapped to robot actions via inverse kinematics and retargeting without substantial task-specific loss
This premise is required for the conversion step that turns human-video predictions into robot-executable actions.

pith-pipeline@v0.9.0 · 5774 in / 1330 out tokens · 41995 ms · 2026-05-21T04:28:40.754492+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dexora: Open-source VLA for High-DoF Bimanual Dexterity
cs.RO 2026-05 unverdicted novelty 7.0

Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7%...
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
cs.CV 2026-05 unverdicted novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA res...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
cs.RO 2026-02 unverdicted novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
cs.CV 2026-05 unverdicted novelty 6.0

EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
SCAR: Self-Supervised Continuous Action Representation Learning
cs.RO 2026-05 unverdicted novelty 6.0

SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
cs.CV 2026-05 unverdicted novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
cs.RO 2026-04 unverdicted novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

HEX is a new framework with humanoid-aligned state representation, mixture-of-experts proprioceptive predictor, history tokens, and residual-gated fusion that achieves state-of-the-art success and generalization on re...
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum
cs.RO 2026-05 unverdicted novelty 5.0

A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment
cs.RO 2026-04 unverdicted novelty 5.0

LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
cs.RO 2026-04 unverdicted novelty 4.0

EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 20 Pith papers · 5 internal anchors

[1]

Vuong, S

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, and A. S. et al. Open x-embodiment: Robotic learning datasets and RT-x models. In CoRL, 2023. 1, 3

work page 2023
[2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page 2024
[3]

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation. arXiv preprint arXiv:2403.07870, 2024. 1

work page arXiv 2024
[4]

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martin- Martin. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024. 1

work page arXiv 2024
[5]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. RSS, 2023. 1, 6, 7, 18

work page 2023
[6]

L. Zhao, T. Yang, Y . Yang, and P. Yu. A wearable upper limb exoskeleton for intuitive teleop- eration of anthropomorphic manipulators. Machines, 11(4):441, 2023. 1

work page 2023
[7]

Fang, H.-S

H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu. Airexo: Low- cost exoskeletons for learning whole-arm manipulation in the wild. In2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 15031–15038. IEEE, 2024. 1

work page 2024
[8]

S. Yang, M. Liu, Y . Qin, D. Runyu, L. Jialong, X. Cheng, R. Yang, S. Yi, and X. Wang. Ace: A cross-platfrom visual-exoskeletons for low-cost dexterous teleoperation.arXiv preprint arXiv:240, 2024. 1

work page 2024
[9]

Naceri, D

A. Naceri, D. Mazzanti, J. Bimbo, Y . T. Tefera, D. Prattichizzo, D. G. Caldwell, L. S. Mattos, and N. Deshpande. The vicarios virtual reality interface for remote robotic teleoperation: Teleporting for intuitive tele-manipulation. Journal of Intelligent & Robotic Systems , 101: 1–16, 2021. 1

work page 2021
[10]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback. arXiv preprint arXiv:2407.01512, 2024. 1, 5 9

work page arXiv 2024
[11]

R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang. Bunny-visionpro: Real- time bimanual dexterous teleoperation for imitation learning. 2024. URL https://arxiv. org/abs/2407.03162. 1

work page arXiv 2024
[12]

J. Tian, L. Yang, R. Ji, Y . Ma, L. Xu, J. Yu, Y . Shi, and J. Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method. arXiv preprint arXiv:2403.16169, 2024. 2

work page arXiv 2024
[13]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In RSS, 2024. 2, 3

work page 2024
[14]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In CoRL,

work page
[15]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint, 2024. 2, 3

work page 2024
[16]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint, 2023. 2, 3

work page 2023
[17]

Isaac sim: Advanced simulation for robotics development, 2024

NVIDIA. Isaac sim: Advanced simulation for robotics development, 2024. Accessed: Jun

work page 2024
[18]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , 36(6), Nov. 2017. 2, 4, 15, 18

work page 2017
[19]

Rodriguez, M

A. Rodriguez, M. T. Mason, and S. Ferry. From caging to grasping. IJRR, 2012. 2

work page 2012
[20]

Rosales, R

C. Rosales, R. Su ´arez, M. Gabiccini, and A. Bicchi. On the synthesis of feasible and prehensile robotic grasps. In ICRA, 2012. 2

work page 2012
[21]

Prattichizzo, M

D. Prattichizzo, M. Malvezzi, M. Gabiccini, and A. Bicchi. On the manipulability ellipsoids of underactuated robotic hands with compliance. RAS, 2012. 2

work page 2012
[22]

Ponce, S

J. Ponce, S. Sullivan, J.-D. Boissonnat, and J.-P. Merlet. On characterizing and computing three-and four-finger force-closure grasps of polyhedral objects. In ICRA, 1993. 2

work page 1993
[23]

Ponce, S

J. Ponce, S. Sullivan, A. Sudsang, J.-D. Boissonnat, and J.-P. Merlet. On computing four-finger equilibrium and force-closure grasps of polyhedral objects. IJRR, 1997. 2

work page 1997
[24]

Zheng and C.-M

Y . Zheng and C.-M. Chew. Distance between a point and a convex cone in n-dimensional space: Computation and applications. T-RO, 2009. 2

work page 2009
[25]

H. Dai, A. Majumdar, and R. Tedrake. Synthesis and optimization of force closure grasps via sequential semidefinite programming. ISRR, 2018. 2

work page 2018
[26]

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. IJRR, 2020. 2

work page 2020
[27]

Nagabandi, K

A. Nagabandi, K. Konolige, S. Levine, and V . Kumar. Deep dynamics models for learning dexterous manipulation. In CoRL, 2020. 2

work page 2020
[28]

Jiang, S

H. Jiang, S. Liu, J. Wang, and X. Wang. Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021. 2 10

work page 2021
[29]

Corona, A

E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020. 2

work page 2020
[30]

L. Yang, X. Zhan, K. Li, W. Xu, J. Li, and C. Lu. Cpf: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021. 2

work page 2021
[31]

L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg. Unigrasp: Learning a unified model to grasp with multifingered robotic hands. RA-L,

work page
[32]

A. Wu, M. Guo, and C. K. Liu. Learning diverse and physically feasible dexterous grasps with generative model and bilevel optimization. CoRL, 2022. 2

work page 2022
[33]

Brahmbhatt, A

S. Brahmbhatt, A. Handa, J. Hays, and D. Fox. Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019. 2

work page 2019
[34]

Turpin, L

D. Turpin, L. Wang, E. Heiden, Y .-C. Chen, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In ECCV, 2022. 2

work page 2022
[35]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,

work page
[36]

URL https://arxiv.org/abs/2503.13441. 2

work page arXiv
[37]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URL https://arxiv. org/abs/2410.24221. 2

work page arXiv 2024
[38]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint, 2023. 2

work page 2023
[39]

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. In CVPR, 2024. 2, 6, 7

work page 2024
[40]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint, 2024. 2

work page 2024
[41]

Pratt, I

S. Pratt, I. Covert, R. Liu, and A. Farhadi. What does a platypus look like? generating cus- tomized prompts for zero-shot image classification. In ICCV, 2023. 2

work page 2023
[42]

Alaluf, E

Y . Alaluf, E. Richardson, S. Tulyakov, K. Aberman, and D. Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. In ECCV, 2025. 2

work page 2025
[43]

W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023. 2

work page 2023
[44]

Huang, S

D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz. Lita: Language instructed temporal-localization assistant. In ECCV, 2025. 2

work page 2025
[45]

T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model. arXiv preprint, 2023. 2

work page 2023
[46]

J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint, 2024. 3 11

work page 2024
[47]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Mandlekar, Y

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imita- tion. In CoRL, 2018. 3

work page 2018
[49]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In CoRL, 2023. 3

work page 2023
[50]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In CoRL, 2019. 3

work page 2019
[51]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018. 3

work page 2018
[52]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2022. 3

work page 2022
[53]

G. A. Sigurdsson, A. K. Gupta, C. Schmid, A. Farhadi, and A. Karteek. Actor and observer: Joint modeling of first and third-person videos. In CVPR, 2018. 3

work page 2018
[54]

Y . Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In CVPR, 2015. 3

work page 2015
[55]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR,

work page
[56]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024. 3

work page 2024
[57]

Mahdisoltani, G

F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. On the effectiveness of task granularity for transfer learning. arXiv preprint, 2018. 3

work page 2018
[58]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 3

work page 2018
[59]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 3

work page 2023
[60]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2023. 3

work page 2023
[61]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelli- gence? In Thirty-seventh Conference on Neural Information Processing Systems , 2023. UR...

work page 2023
[63]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics. In Robotics: Science and Systems (RSS) , 2023. 3

work page 2023
[64]

J. Yang, B. Liu, J. Fu, B. Pan, G. Wu, and L. Wang. Spatiotemporal predictive pre-training for robotic motor control, 2024. URL https://arxiv.org/abs/2403.05304. 3

work page arXiv 2024
[65]

J. Zeng, Q. Bu, B. Wang, W. Xia, L. Chen, H. Dong, H. Song, D. Wang, D. Hu, P. Luo, et al. Learning manipulation by predicting interaction. arXiv preprint arXiv:2406.00439, 2024. 3

work page arXiv 2024
[66]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2024. URL https://arxiv.org/abs/2410.11758. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Lirui, C

W. Lirui, C. Xinlei, Z. Jialiang, and H. Kaiming. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Neurips, 2024. 3

work page 2024
[68]

Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y . Lou, S. Yang, H. Xi, S. Cao, Y . Gu, D. Li, X. Li, Y . Fang, Y . Chen, C.-Y . Hsieh, D.-A. Huang, A.-C. Cheng, V . Nath, J. Hu, S. Liu, R. Krishna, D. Xu, X. Wang, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y . Lu. Nvila: Efficient frontier visual language models, 2025. URL https://arxiv.org/abs/2412.04468. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URL https://arxiv.org/abs/1812.07035. 4, 17

work page arXiv 2020
[70]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Orbit: A unified simulation framework for interactive robot learning environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. doi:10.1109/LRA.2023.3270034. 5

work page doi:10.1109/lra.2023.3270034 2023
[72]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URLhttps://arxiv.org/abs/2306. 03310. 5

work page 2023
[73]

Robotics

U. Robotics. H1, 2024. Accessed: Sep 2024. 5

work page 2024
[74]

I. Robots. The dexterous hands, 2024. Accessed: Jul 2024. 5

work page 2024
[75]

S. Liu, S. Tripathi, S. Majumdar, and X. Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6

work page 2022
[76]

X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, N. Joshi, and M. Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, October 2023. 15, 16

work page 2023
[77]

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21013–21022, June 2022. 15, 16, 18, 19, 22

work page 2022
[78]

Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding. arXiv preprint arXiv:2401.08399 ,

work page arXiv
[79]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, F. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. Introducing hot3d: An egocentric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598, 2024. 15, 18

work page arXiv 2024
[80]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without s...

work page 2023

Showing first 80 references.

[1] [1]

Vuong, S

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, and A. S. et al. Open x-embodiment: Robotic learning datasets and RT-x models. In CoRL, 2023. 1, 3

work page 2023

[2] [2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page 2024

[3] [3]

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation. arXiv preprint arXiv:2403.07870, 2024. 1

work page arXiv 2024

[4] [4]

S. Dass, W. Ai, Y . Jiang, S. Singh, J. Hu, R. Zhang, P. Stone, B. Abbatematteo, and R. Martin- Martin. Telemoma: A modular and versatile teleoperation system for mobile manipulation. arXiv preprint arXiv:2403.07869, 2024. 1

work page arXiv 2024

[5] [5]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. RSS, 2023. 1, 6, 7, 18

work page 2023

[6] [6]

L. Zhao, T. Yang, Y . Yang, and P. Yu. A wearable upper limb exoskeleton for intuitive teleop- eration of anthropomorphic manipulators. Machines, 11(4):441, 2023. 1

work page 2023

[7] [7]

Fang, H.-S

H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu. Airexo: Low- cost exoskeletons for learning whole-arm manipulation in the wild. In2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 15031–15038. IEEE, 2024. 1

work page 2024

[8] [8]

S. Yang, M. Liu, Y . Qin, D. Runyu, L. Jialong, X. Cheng, R. Yang, S. Yi, and X. Wang. Ace: A cross-platfrom visual-exoskeletons for low-cost dexterous teleoperation.arXiv preprint arXiv:240, 2024. 1

work page 2024

[9] [9]

Naceri, D

A. Naceri, D. Mazzanti, J. Bimbo, Y . T. Tefera, D. Prattichizzo, D. G. Caldwell, L. S. Mattos, and N. Deshpande. The vicarios virtual reality interface for remote robotic teleoperation: Teleporting for intuitive tele-manipulation. Journal of Intelligent & Robotic Systems , 101: 1–16, 2021. 1

work page 2021

[10] [10]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback. arXiv preprint arXiv:2407.01512, 2024. 1, 5 9

work page arXiv 2024

[11] [11]

R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang. Bunny-visionpro: Real- time bimanual dexterous teleoperation for imitation learning. 2024. URL https://arxiv. org/abs/2407.03162. 1

work page arXiv 2024

[12] [12]

J. Tian, L. Yang, R. Ji, Y . Ma, L. Xu, J. Yu, Y . Shi, and J. Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method. arXiv preprint arXiv:2403.16169, 2024. 2

work page arXiv 2024

[13] [13]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In RSS, 2024. 2, 3

work page 2024

[14] [14]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In CoRL,

work page

[15] [15]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint, 2024. 2, 3

work page 2024

[16] [16]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint, 2023. 2, 3

work page 2023

[17] [17]

Isaac sim: Advanced simulation for robotics development, 2024

NVIDIA. Isaac sim: Advanced simulation for robotics development, 2024. Accessed: Jun

work page 2024

[18] [18]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , 36(6), Nov. 2017. 2, 4, 15, 18

work page 2017

[19] [19]

Rodriguez, M

A. Rodriguez, M. T. Mason, and S. Ferry. From caging to grasping. IJRR, 2012. 2

work page 2012

[20] [20]

Rosales, R

C. Rosales, R. Su ´arez, M. Gabiccini, and A. Bicchi. On the synthesis of feasible and prehensile robotic grasps. In ICRA, 2012. 2

work page 2012

[21] [21]

Prattichizzo, M

D. Prattichizzo, M. Malvezzi, M. Gabiccini, and A. Bicchi. On the manipulability ellipsoids of underactuated robotic hands with compliance. RAS, 2012. 2

work page 2012

[22] [22]

Ponce, S

J. Ponce, S. Sullivan, J.-D. Boissonnat, and J.-P. Merlet. On characterizing and computing three-and four-finger force-closure grasps of polyhedral objects. In ICRA, 1993. 2

work page 1993

[23] [23]

Ponce, S

J. Ponce, S. Sullivan, A. Sudsang, J.-D. Boissonnat, and J.-P. Merlet. On computing four-finger equilibrium and force-closure grasps of polyhedral objects. IJRR, 1997. 2

work page 1997

[24] [24]

Zheng and C.-M

Y . Zheng and C.-M. Chew. Distance between a point and a convex cone in n-dimensional space: Computation and applications. T-RO, 2009. 2

work page 2009

[25] [25]

H. Dai, A. Majumdar, and R. Tedrake. Synthesis and optimization of force closure grasps via sequential semidefinite programming. ISRR, 2018. 2

work page 2018

[26] [26]

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. IJRR, 2020. 2

work page 2020

[27] [27]

Nagabandi, K

A. Nagabandi, K. Konolige, S. Levine, and V . Kumar. Deep dynamics models for learning dexterous manipulation. In CoRL, 2020. 2

work page 2020

[28] [28]

Jiang, S

H. Jiang, S. Liu, J. Wang, and X. Wang. Hand-object contact consistency reasoning for human grasps generation. In ICCV, 2021. 2 10

work page 2021

[29] [29]

Corona, A

E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. In CVPR, 2020. 2

work page 2020

[30] [30]

L. Yang, X. Zhan, K. Li, W. Xu, J. Li, and C. Lu. Cpf: Learning a contact potential field to model the hand-object interaction. In ICCV, 2021. 2

work page 2021

[31] [31]

L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg. Unigrasp: Learning a unified model to grasp with multifingered robotic hands. RA-L,

work page

[32] [32]

A. Wu, M. Guo, and C. K. Liu. Learning diverse and physically feasible dexterous grasps with generative model and bilevel optimization. CoRL, 2022. 2

work page 2022

[33] [33]

Brahmbhatt, A

S. Brahmbhatt, A. Handa, J. Hays, and D. Fox. Contactgrasp: Functional multi-finger grasp synthesis from contact. In IROS, 2019. 2

work page 2019

[34] [34]

Turpin, L

D. Turpin, L. Wang, E. Heiden, Y .-C. Chen, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. In ECCV, 2022. 2

work page 2022

[35] [35]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,

work page

[36] [36]

URL https://arxiv.org/abs/2503.13441. 2

work page arXiv

[37] [37]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URL https://arxiv. org/abs/2410.24221. 2

work page arXiv 2024

[38] [38]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint, 2023. 2

work page 2023

[39] [39]

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. In CVPR, 2024. 2, 6, 7

work page 2024

[40] [40]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint, 2024. 2

work page 2024

[41] [41]

Pratt, I

S. Pratt, I. Covert, R. Liu, and A. Farhadi. What does a platypus look like? generating cus- tomized prompts for zero-shot image classification. In ICCV, 2023. 2

work page 2023

[42] [42]

Alaluf, E

Y . Alaluf, E. Richardson, S. Tulyakov, K. Aberman, and D. Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. In ECCV, 2025. 2

work page 2025

[43] [43]

W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023. 2

work page 2023

[44] [44]

Huang, S

D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz. Lita: Language instructed temporal-localization assistant. In ECCV, 2025. 2

work page 2025

[45] [45]

T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model. arXiv preprint, 2023. 2

work page 2023

[46] [46]

J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint, 2024. 3 11

work page 2024

[47] [47]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Mandlekar, Y

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imita- tion. In CoRL, 2018. 3

work page 2018

[49] [49]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In CoRL, 2023. 3

work page 2023

[50] [50]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In CoRL, 2019. 3

work page 2019

[51] [51]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018. 3

work page 2018

[52] [52]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2022. 3

work page 2022

[53] [53]

G. A. Sigurdsson, A. K. Gupta, C. Schmid, A. Farhadi, and A. Karteek. Actor and observer: Joint modeling of first and third-person videos. In CVPR, 2018. 3

work page 2018

[54] [54]

Y . Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In CVPR, 2015. 3

work page 2015

[55] [55]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR,

work page

[56] [56]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In CVPR, 2024. 3

work page 2024

[57] [57]

Mahdisoltani, G

F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. On the effectiveness of task granularity for transfer learning. arXiv preprint, 2018. 3

work page 2018

[58] [58]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 3

work page 2018

[59] [59]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. 2023. 3

work page 2023

[60] [60]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2023. 3

work page 2023

[61] [61]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [62]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelli- gence? In Thirty-seventh Conference on Neural Information Processing Systems , 2023. UR...

work page 2023

[63] [63]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics. In Robotics: Science and Systems (RSS) , 2023. 3

work page 2023

[64] [64]

J. Yang, B. Liu, J. Fu, B. Pan, G. Wu, and L. Wang. Spatiotemporal predictive pre-training for robotic motor control, 2024. URL https://arxiv.org/abs/2403.05304. 3

work page arXiv 2024

[65] [65]

J. Zeng, Q. Bu, B. Wang, W. Xia, L. Chen, H. Dong, H. Song, D. Wang, D. Hu, P. Luo, et al. Learning manipulation by predicting interaction. arXiv preprint arXiv:2406.00439, 2024. 3

work page arXiv 2024

[66] [66]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos, 2024. URL https://arxiv.org/abs/2410.11758. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Lirui, C

W. Lirui, C. Xinlei, Z. Jialiang, and H. Kaiming. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Neurips, 2024. 3

work page 2024

[68] [68]

Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y . Lou, S. Yang, H. Xi, S. Cao, Y . Gu, D. Li, X. Li, Y . Fang, Y . Chen, C.-Y . Hsieh, D.-A. Huang, A.-C. Cheng, V . Nath, J. Hu, S. Liu, R. Krishna, D. Xu, X. Wang, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y . Lu. Nvila: Efficient frontier visual language models, 2025. URL https://arxiv.org/abs/2412.04468. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks, 2020. URL https://arxiv.org/abs/1812.07035. 4, 17

work page arXiv 2020

[70] [70]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Orbit: A unified simulation framework for interactive robot learning environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. doi:10.1109/LRA.2023.3270034. 5

work page doi:10.1109/lra.2023.3270034 2023

[72] [72]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URLhttps://arxiv.org/abs/2306. 03310. 5

work page 2023

[73] [73]

Robotics

U. Robotics. H1, 2024. Accessed: Sep 2024. 5

work page 2024

[74] [74]

I. Robots. The dexterous hands, 2024. Accessed: Jul 2024. 5

work page 2024

[75] [75]

S. Liu, S. Tripathi, S. Majumdar, and X. Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 6

work page 2022

[76] [76]

X. Wang, T. Kwon, M. Rad, B. Pan, I. Chakraborty, S. Andrist, D. Bohus, A. Feniello, B. Tekin, F. V . Frujeri, N. Joshi, and M. Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, October 2023. 15, 16

work page 2023

[77] [77]

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21013–21022, June 2022. 15, 16, 18, 19, 22

work page 2022

[78] [78]

Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding. arXiv preprint arXiv:2401.08399 ,

work page arXiv

[79] [79]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, F. Zhang, J. Fountain, E. Miller, S. Basol, R. Newcombe, R. Wang, J. J. Engel, and T. Hodan. Introducing hot3d: An egocentric dataset for 3d hand and object tracking. arXiv preprint arXiv:2406.09598, 2024. 15, 18

work page arXiv 2024

[80] [80]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without s...

work page 2023