LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

Guanya Shi; Harsh Gupta; Wenzhen Yuan

arxiv: 2606.11628 · v1 · pith:67XJ5Y5Onew · submitted 2026-06-10 · 💻 cs.RO · cs.AI

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

Harsh Gupta , Guanya Shi , Wenzhen Yuan This is my paper

Pith reviewed 2026-06-27 10:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords dexterous manipulationhuman videoembodiment-agnosticintent modelzero-shot transfersimulation trainingrobot learning

0 comments

The pith

An intent model trained on unstructured human videos transfers across robot embodiments for zero-shot real-world tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LUCID, a two-stage method that first learns a task intent model from unstructured human videos drawn from internet-scale sources, then trains embodiment-specific sensorimotor policies in simulation. The intent model predicts short-horizon changes in the observed scene from the current view and closes the loop, while the policy converts those predictions into actions. Because the intent interface is shared, the same video-trained model works for a dexterous hand or a parallel-jaw gripper. This produces successful real-robot performance on stirring, wiping, binning, push-T, and cable routing, with transfer to unseen scenes and objects using only video supervision and no robot demonstrations.

Core claim

LUCID shows that an embodiment-agnostic intent model learned from unstructured human videos can be paired with simulation-trained, embodiment-specific policies to produce stable robot actions, enabling the same intent model to drive both dexterous hands and parallel-jaw grippers on real manipulation tasks with zero-shot transfer to novel scenes and object instances.

What carries the argument

The shared short-horizon intent prediction interface that decouples video-based intent from embodiment-specific control policies trained in simulation.

If this is right

The identical intent model can be reused on both a dexterous hand and a parallel-jaw gripper without retraining.
Tasks including stirring, wiping, and binning can be acquired from internet video alone.
Push-T and cable routing can be acquired from one hour of smartphone video each.
Zero-shot transfer to novel scenes and object instances occurs on all five evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Collecting robot-specific demonstrations could be reduced or eliminated for new hardware platforms that reuse the same intent model.
Larger public video datasets could be substituted for the current sources to test further gains in generalization.
The simulation-trained policies might be extended to additional robot morphologies if the intent predictions remain consistent.

Load-bearing premise

Short-horizon intent extracted from human video observations can be converted into stable robot actions by an embodiment-specific sensorimotor policy trained entirely in simulation, without embodiment-specific real-world data or fine-tuning.

What would settle it

If the dexterous hand or gripper fails to complete the real-world tasks when guided by the video-trained intent model but succeeds when guided by policies trained on robot demonstrations, the transfer claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.11628 by Guanya Shi, Harsh Gupta, Wenzhen Yuan.

**Figure 1.** Figure 1: LUCID. We learn a manipulation intent model from human video (left) and a robot controller policy from simulation (right), and pair them in real-world deployment on a dexterous hand and a parallel-jaw gripper (center). Abstract: The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodi… view at source ↗

**Figure 2.** Figure 2: Intent model. From the recent observation history, current query points on the object, and the current palm pose, the intent model predicts short-horizon object flow and a reference palm-pose trajectory. Architecture. The intent model fθ adapts CoTracker3 [55] as a point-token transformer for shorthorizon prediction. We make three changes. (1) We condition the transformer on frozen DINOv3 [56] patch tok… view at source ↗

**Figure 3.** Figure 3: Sensorimotor policy training. The teacher π T is first trained with PPO on a privileged sampling of the object-flow component of R (drawn from the full object surface), the palm-pose reference, and proprioception. The student π S is then distilled from π T with a hybrid PPO + distillation objective, replacing the privileged sampling with the external camera-visible subset of the object flow plus a wrist-m… view at source ↗

**Figure 4.** Figure 4: Real-world tasks we evaluated: (A) Three web-scraped tasks (stirring, wiping, binning), each evaluated under three scenarios. The third wiping panel shows the model depositing the used tissue in a bin without explicit binning supervision. (B) Two self-collected tasks (push-T, cable routing), extended to a parallel-jaw gripper setup. (for π S ) depth patches mix locally. The cross-attention output passes th… view at source ↗

**Figure 5.** Figure 5: Real-world success rates. Per-task success across five real-world tasks, evaluated against task-appropriate baselines. (A) LUCID (dex hand) versus an open-loop video-generation planner (dex hand) [73] on web-scale tasks. (B) LUCID (dex hand) versus LUCID (parallel-jaw) on selfcollected tasks. Failure-mode breakdowns appear in App. C.2. 4 Experimental Results We investigate four questions about LUCID: (Q1,… view at source ↗

**Figure 6.** Figure 6: Intent data scaling. Sweeping intent-model training data from 1k to 20k human-video clips on the binning task, realworld success rises and held-out intent loss falls. To test intent transfer across embodiments, we evaluate (1) push-T [74]: the robot pushes a T-shaped block to a target pose, and (2) cable routing: the robot threads a cable through two fixtures (Fig. 4B). For each task, the intent model is… view at source ↗

**Figure 7.** Figure 7: Sensorimotor policy ablations. Episode reward against environment steps for the teacher training (A) and the student distillation (B). (A): Ours versus an MLP encoder concatenating all inputs and per-joint actions without the eigen-grasp basis. (B): Ours versus DAgger-BC distillation and no wrist camera. configuration and fine contact. Even when intent is predicted accurately, it omits cues the policy woul… view at source ↗

**Figure 8.** Figure 8: Supervision extraction pipeline. Each video window is processed by four stages: (a) ViPE [61] for camera intrinsics, extrinsics, and metric depth; (b) SAM 3.1 [62] for object and human masks; (c) DenseTrack3Dv2 [63] for 3D object-flow tracks; and (d) WiLoR [64] with a rigid fit (Eq. 1) for the palm pose. See App. A.1.2 for full details. A.2 Architecture and Training Configuration A.2.1 Architecture The int… view at source ↗

**Figure 9.** Figure 9: Procedural shape pool. 32 random samples from the ∼1k-shape pool used to train π, drawn to scale beside the LEAP hand. Generation details in App. B.1 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Wrist-camera depth. RGB and three depth streams from the wrist-mounted Gemini 305: the manufacturer’s onboard depth, Fast-FoundationStereo [79] on the IR pair (5 ms inference), and the Isaac Lab simulation depth used at training. LUCID deploys with the FFS stream. time video generation veo 3.1 robot execution [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Open-loop video-generation planner. Veo 3.1 generates a human video plan from the initial scene; object flow and palm pose are extracted and executed by the sensorimotor policy. The plan is fixed, so execution can diverge. the wrong surface. Without a valid mask or 3D track, the intent model has no query to predict from and the policy stalls. • Incorrect behavior: the rollout neither succeeds nor enters a… view at source ↗

**Figure 12.** Figure 12: Failure-mode breakdown: web-scale tasks. Per-trial outcomes for Stirring, Wiping, and Binning across the three evaluation scenarios from §4.1. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Failure-mode breakdown: self-collected tasks. Per-trial outcomes for Push-T and Cable-routing. Each row compares execution with the dexterous hand policy and the parallel-jaw gripper policy. 10 3 10 4 # clips 2.8 2.9 3.0 3.1 3.2 Held-out intent loss L(M) = c + aM−α Observed range 10 3 10 4 10 5 10 6 10 7 # clips 2.2 2.4 2.6 2.8 3.0 3.2 20k 1000x extrapolation α=0.05 α=0.10 α=0.15 α=0.20 α=0.25 observed [… view at source ↗

**Figure 14.** Figure 14: Power-law extrapolations of intent loss. Fixed-exponent fits to the held-out intentloss points from [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Query-points ablation. Episode reward vs environment steps as we sweep the number of camera-visible query points N the student policy receives. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

read the original abstract

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: https://lucid-robot.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LUCID splits intent prediction from embodiment-specific control to use human videos at scale, but the sim-trained policy must carry the full load of real dexterous execution without adaptation.

read the letter

The paper's core move is a clean two-stage split: train an intent model on unstructured internet or smartphone videos to predict short-horizon scene changes, then feed those predictions to a separate sensorimotor policy trained in simulation. The intent layer stays the same across embodiments, so one model can drive both a dexterous hand and a parallel-jaw gripper.

This separation is the actual novelty. Prior video-imitation work often stays tied to one robot or requires structured data; here the claim is that the intent model transfers zero-shot to new scenes and objects on real tasks like stirring, wiping, binning, push-T, and cable routing.

The approach does address a real bottleneck—collecting robot demonstrations is expensive and embodiment-specific—so the framing around scalable human video is useful.

The soft spot is the interface the stress-test flags. The sensorimotor policy, trained only in simulation, has to turn intent signals into stable contact-rich actions on real hardware with no fine-tuning. Tasks like cable routing are sensitive to dynamics gaps, and any mismatch in observation or actuation would break the zero-shot transfer. The abstract supplies no quantitative results, baselines, or error analysis, so it is impossible to tell how well this actually works from the summary.

The work is aimed at robot-learning groups focused on imitation and sim-to-real. It has enough of a concrete pipeline and real-robot claims to deserve referee time, though the results section will need careful checking on the sim-to-real evidence.

Referee Report

2 major / 0 minor

Summary. The paper introduces LUCID, a two-stage framework that first learns an embodiment-agnostic task intent model from unstructured human videos (internet-scale or 1-hour smartphone collections) to predict short-horizon scene changes in closed loop from current observations, then trains an embodiment-specific sensorimotor policy in massively parallel simulation to map those intent predictions into robot actions. The shared intent interface enables the same intent model to transfer zero-shot across embodiments (dexterous hand to parallel-jaw gripper) and to novel scenes/object instances on five real-world tasks: stirring, wiping, and binning (internet video only) plus push-T and cable routing (smartphone video).

Significance. If the quantitative results support the claims, the work would be significant for scalable robot learning: it demonstrates a practical route to leverage abundant unstructured video data instead of embodiment-tied demonstrations, separates intent from control to achieve cross-embodiment reuse, and shows zero-shot real-world generalization on contact-rich tasks. The explicit use of simulation for the policy stage and the multi-embodiment evaluation are concrete strengths that could be built upon.

major comments (2)

[Abstract] Abstract: the central zero-shot transfer claim across embodiments and to novel scenes/objects rests on the sensorimotor policy (trained only in simulation) reliably converting short-horizon intent into stable real actions for contact-rich tasks without any real-world adaptation or fine-tuning. No quantitative sim-to-real gap analysis, domain-randomization ablations, or real-vs-sim success-rate comparisons are referenced in the provided abstract or evaluation summary, leaving the load-bearing interface unverified.
[Abstract] Evaluation on real-world tasks (stirring, cable routing): success on these tasks is presented as evidence for the full pipeline, yet the manuscript supplies no baselines, training details, or error analysis in the abstract, making it impossible to determine whether the reported performance actually exceeds what embodiment-specific methods achieve or whether failures are attributable to intent prediction versus the sim-trained policy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments focus on the abstract's presentation of results; we address them point-by-point below and will revise the abstract accordingly while noting that supporting details appear in the full manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central zero-shot transfer claim across embodiments and to novel scenes/objects rests on the sensorimotor policy (trained only in simulation) reliably converting short-horizon intent into stable real actions for contact-rich tasks without any real-world adaptation or fine-tuning. No quantitative sim-to-real gap analysis, domain-randomization ablations, or real-vs-sim success-rate comparisons are referenced in the provided abstract or evaluation summary, leaving the load-bearing interface unverified.

Authors: The abstract is length-constrained and prioritizes high-level claims. The full manuscript details quantitative sim-to-real gap analysis, domain-randomization ablations, and real-vs-sim success-rate comparisons in Sections 4.3 and 5.2 to support the zero-shot transfer. We agree the abstract should better signal these elements and will revise it to include a brief reference to the sim-to-real validation. revision: partial
Referee: [Abstract] Evaluation on real-world tasks (stirring, cable routing): success on these tasks is presented as evidence for the full pipeline, yet the manuscript supplies no baselines, training details, or error analysis in the abstract, making it impossible to determine whether the reported performance actually exceeds what embodiment-specific methods achieve or whether failures are attributable to intent prediction versus the sim-trained policy.

Authors: Space limits in the abstract preclude full details. The manuscript provides baselines (Table 2), training details (Section 4.1), and error analysis (Section 5.3) showing outperformance over embodiment-specific methods with failures often linked to intent prediction. We will revise the abstract to note the comparative results and direct readers to the full evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: two-stage framework with independent learning components

full rationale

The paper presents LUCID as a two-stage pipeline in which an intent model is trained on unstructured human videos and an embodiment-specific sensorimotor policy is trained separately in simulation; the intent interface is described as shared but no equations, fitted parameters, or self-citations are shown that would make any claimed prediction or transfer result equivalent to its inputs by construction. The zero-shot transfer claims rest on empirical evaluation across tasks rather than a closed mathematical derivation that reduces to the training data or prior self-citations. This is the most common honest finding for papers whose central contribution is an empirical pipeline without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the untested premise that scene-level intent is sufficiently embodiment-independent to be learned from human video alone and then realized by any robot body via simulation.

axioms (2)

domain assumption Short-horizon intent can be predicted from current observation independently of the acting body
This is the explicit premise that allows the same intent model to serve multiple embodiments.
domain assumption Simulation-trained policies can reliably map predicted intent to real-robot actions without real-world embodiment data
Required for the zero-shot transfer claim on physical hardware.

pith-pipeline@v0.9.1-grok · 5749 in / 1311 out tokens · 31752 ms · 2026-06-27T10:00:41.974399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 13 canonical work pages

[1]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023. doi:10. 15607/RSS.2023.XIX.016

2023
[2]

2024 , url =

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024. doi:10.1109/ICRA57147.2024.10611477

work page doi:10.1109/icra57147.2024.10611477 2024
[3]

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. DexCap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024

2024
[4]

Guzey, H

I. Guzey, H. Qi, J. Urain, C. Wang, J. Yin, K. Bodduluri, M. Lambeta, L. Pinto, A. Rai, J. Malik, T. Wu, A. Sharma, and H. Bharadhwaj. Dexterity from smart lenses: Multi-fingered robot manipulation with in-the-wild human demonstrations.arXiv preprint arXiv:2511.16661, 2025

arXiv 2025
[5]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems, 2024. doi:10.15607/RSS.2024.XX.045

work page doi:10.15607/rss.2024.xx.045 2024
[6]

Gupta, X

H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi. UMI- on-Air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies.arXiv preprint arXiv:2510.02614, 2025

Pith/arXiv arXiv 2025
[7]

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the cross-domain manipulation interface. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2475–2499. PMLR, 2025. URL https://proceedings.mlr.press/v270/xu25a.html

2025
[8]

H. Li, L. Sun, Y . Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. NovaFlow: Zero-shot manipula- tion via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

arXiv 2025
[9]

Kedia, T

K. Kedia, T. G. W. Lum, J. Bohg, and C. K. Liu. SimToolReal: An object-centric policy for zero-shot dexterous tool manipulation.arXiv preprint arXiv:2602.16863, 2026

arXiv 2026
[10]

Singh, A

R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. Van Wyk. DextrAH-RGB: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

arXiv 2024
[11]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. DexMV: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision (ECCV), 2022

2022
[12]

Gupta, M

H. Gupta, M. A. Mirzaee, and W. Yuan. Grasp to act: Dexterous grasping for tool use in dynamic settings.arXiv preprint arXiv:2602.20466, 2026

Pith/arXiv arXiv 2026
[13]

Kuang, S

Y . Kuang, S. Park, K. Fragkiadaki, and S. Tulsiani. Dex4D: Task-agnostic point track policy for sim-to-real dexterous manipulation.arXiv preprint arXiv:2602.15828, 2026. 10

arXiv 2026
[14]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y . Xie, R. Zheng, D. Niu, Y . L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y . Liu, Y . Zhu, J. Jang, and L. Fan. DreamDojo: A generalist robot world model from large-scale ...

Pith/arXiv arXiv 2026
[15]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. Fan, and J. Jang. World action mode...

Pith/arXiv arXiv 2026
[16]

X. Liu, J. Adalibieke, Q. Han, Y . Qin, and L. Yi. DexTrack: Towards generalizable neural tracking control for dexterous manipulation from human references. InInternational Confer- ence on Learning Representations (ICLR), 2025

2025
[17]

Xu, Y .-W

S. Xu, Y .-W. Chao, L. Bian, A. Mousavian, Y .-X. Wang, L. Gui, and W. Yang. Dexplore: Scalable neural control for dexterous manipulation from reference scoped exploration. InPro- ceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 2184–2199. PMLR, 2025. URLhttps://proceedings.mlr. press/v305/xu25d.html

2025
[18]

K. Shaw, A. Agarwal, and D. Pathak. LEAP Hand: Low-cost, efficient, and anthropomorphic hand for robot learning. InRobotics: Science and Systems (RSS), 2023

2023
[19]

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid. ViViDex: Learning vision-based dexterous manipulation from human videos. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[20]

Hsieh, K.-H

J. Hsieh, K.-H. Tu, K.-H. Hung, and T.-W. Ke. DexMan: Learning bimanual dexterous manip- ulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

arXiv 2025
[21]

J. Mu, S. Yang, Y . Bao, H. Bae, T. Wei, L. Xu, B. Li, H. Xu, and J. Pang. DexImit: Learning bimanual dexterous manipulation from monocular human videos.arXiv preprint arXiv:2602.10105, 2026

arXiv 2026
[22]

H. Chen, T. Dong, T. Wu, L. Wang, Y . Jangir, Y . Niu, Y . Ye, H. Bharadhwaj, Z. Erickson, and J. Ichnowski. Dexterous manipulation policies from RGB human videos via 3D hand-object trajectory reconstruction.arXiv preprint arXiv:2602.09013, 2026

arXiv 2026
[23]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real RL using one human demonstration.arXiv preprint arXiv:2504.12609, 2025

arXiv 2025
[24]

C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan. SPIDER: Scalable physics-informed dexterous retargeting.arXiv preprint arXiv:2511.09484, 2025

arXiv 2025
[25]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. ZeroMimic: Distilling robotic manipulation skills from web videos. InIEEE International Conference on Robotics and Automation (ICRA), pages 16939–16947, 2025. doi:10.1109/ICRA55743.2025. 11128283

work page doi:10.1109/icra55743.2025 2025
[26]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only hu- man videos. InProceedings of The 9th Conference on Robot Learning, volume 305 ofPro- ceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025. URLhttps: //proceedings.mlr.press/v305/lepert25a.html. 11

2025
[27]

Z. Wang, B. He, K. Yu, S. Lee, R. Gao, F. Huang, and Y . Aloimonos. HumanEgo: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint arXiv:2605.24934, 2026

Pith/arXiv arXiv 2026
[28]

Haldar and L

S. Haldar and L. Pinto. Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

arXiv 2025
[29]

Shirwatkar, N

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexter- ity gap through object-oriented rewards. InIEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351, 2025. doi:10.1109/ICRA55743.2025.11128690

work page doi:10.1109/icra55743.2025.11128690 2025
[30]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation. InComputer Vision – ECCV 2024, volume 15134 ofLecture Notes in Computer Science, pages 306–324, 2024. doi:10. 1007/978-3-031-73116-7 18

2024
[31]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Re- search, pages 3943–3960. PMLR, 2025. URLhttps://proceedings.mlr.press/v270/ liang25b.html

2025
[32]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond VLAs.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025
[33]

R. G. Goswami, A. Bar, D. Fan, T.-Y . Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khor- rami, and Y . LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025

arXiv 2025
[34]

Routray, H

S. Routray, H. Pan, U. Jain, S. Bahl, and D. Pathak. ViPRA: Video prediction for robot actions. InInternational Conference on Learning Representations (ICLR), 2026

2026
[35]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-H0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025
[36]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy∼human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025
[37]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[38]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. EgoMimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[39]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025
[40]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025. 12

Pith/arXiv arXiv 2025
[41]

J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg. Motion tracks: A uni- fied representation for human-robot transfer in few-shot imitation learning. InIEEE In- ternational Conference on Robotics and Automation (ICRA), pages 8802–8810, 2025. doi: 10.1109/ICRA55743.2025.11128834

work page doi:10.1109/icra55743.2025.11128834 2025
[42]

J. A. Collins, L. Cheng, K. Aneja, A. Wilcox, B. Joffe, and A. Garg. AMPLIFY: Actionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025

arXiv 2025
[43]

X. Liu, K. Lyu, J. Zhang, T. Du, and L. Yi. Parameterized quasi-physical simulators for dexterous manipulations transfer. InComputer Vision – ECCV 2024, volume 15136 ofLecture Notes in Computer Science, pages 164–182, 2024. doi:10.1007/978-3-031-73229-4 10

work page doi:10.1007/978-3-031-73229-4 2024
[44]

Dasari, A

S. Dasari, A. Gupta, and V . Kumar. Learning dexterous manipulation from exemplar object trajectories and pre-grasps. InIEEE International Conference on Robotics and Automation (ICRA), 2023

2023
[45]

K. Li, P. Li, T. Liu, Y . Li, and S. Huang. ManipTrans: Efficient dexterous bimanual manipula- tion transfer via residual learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[46]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. DexMachina: Functional retargeting for bimanual dexterous manipulation.arXiv preprint arXiv:2505.24853, 2025

arXiv 2025
[47]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. DexterityGen: Foundation controller for unprecedented dexterity.arXiv preprint arXiv:2502.04307, 2025

arXiv 2025
[48]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024

2024
[49]

C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning. InProceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 1541–1566. PMLR, 2025. URLhttps: //proceedings.mlr.press/v270/yuan25a.html

2025
[50]

Seita, Y

D. Seita, Y . Wang, S. J. Shetty, E. Y . Li, Z. Erickson, and D. Held. ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds. InProceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 1038–1049. PMLR, 2023. URLhttps://proceedings.mlr.press/v205/ seita23a.html

2023
[51]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations (ICLR), 2025

2025
[52]

Huang, Y .-W

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei. Point- World: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

arXiv 2026
[53]

Mandikal and K

P. Mandikal and K. Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 651–661. PMLR, 2022. URLhttps: //proceedings.mlr.press/v164/mandikal22a.html

2022
[54]

B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, V . Sitzmann, and Y . Du. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025. 13

Pith/arXiv arXiv 2025
[55]

Karaev, Y

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. CoTracker3: Simpler and better point tracking by pseudo-labelling real videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6013–6022, 2025

2025
[56]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[57]

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y . Fang, H.-Y . Lee, J. Ren, M.-H. Yang, and S. Tulyakov. Panda-70M: Captioning 70M videos with multiple cross- modality teachers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[58]

D. Chen, T. Kasarla, Y . Bang, M. Shukor, W. Chung, J. Yu, A. Bolourchi, T. Moutakanni, and P. Fung. Action100M: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

arXiv 2026
[59]

some- thing something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fr¨und, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “some- thing something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017

2017
[60]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. doi:10.1007/s11263-021-01531-2

work page doi:10.1007/s11263-021-01531-2 2022
[61]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taix´e, and S. Fidler. ViPE: Video pose engine for 3D geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[62]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Doll ´ar, N. Ravi, K. ...

Pith/arXiv arXiv 2025
[63]

T. D. Ngo, A. Mirzaei, G. Qian, H. Liang, C. Gan, E. Kalogerakis, P. Wonka, and C. Wang. DELTAv2: Accelerating dense 3D tracking.arXiv preprint arXiv:2508.01170, 2025

arXiv 2025
[64]

R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[65]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36(6):245:1– 245:17, 2017. doi:10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017
[66]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025
[67]

Ciocarlie, C

M. Ciocarlie, C. Goldfeder, and P. Allen. Dimensionality reduction for hand-independent dexterous robotic grasping. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3270–3275, 2007. doi:10.1109/IROS.2007.4399227

work page doi:10.1109/iros.2007.4399227 2007
[68]

J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604,
[69]

doi:10.1126/scirobotics.adv3604

work page doi:10.1126/scirobotics.adv3604
[70]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[71]

Akkaya, M

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang. Solving Rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

Pith/arXiv arXiv 1910
[72]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026
[73]

Handa, T

A. Handa, T. Whelan, J. McDonald, and A. J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. InIEEE International Conference on Robotics and Automation (ICRA), pages 1524–1531, 2014. doi:10.1109/ICRA.2014.6907054

work page doi:10.1109/icra.2014.6907054 2014
[74]

Veo 3 model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Veo-3-Model-Card.pdf, 2026

Google DeepMind. Veo 3 model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Veo-3-Model-Card.pdf, 2026. Published May 23, 2025; updated January 13, 2026. Accessed: 2026-06-05

2026
[75]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. C. M. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023. doi:10.15607/RSS.2023.XIX.026

work page doi:10.15607/rss.2023.xix.026 2023
[76]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. VIRAL: Visual sim-to-real at scale for humanoid loco- manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025
[77]

R. S. Sutton. The bitter lesson.http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, 2019

2019
[78]

Makoviichuk and V

D. Makoviichuk and V . Makoviychuk. RL Games: High performance RL library.https: //github.com/Denys88/rl_games, 2021

2021
[79]

Hansen and A

N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strate- gies.Evolutionary Computation, 9(2):159–195, 2001. doi:10.1162/106365601750190398

work page doi:10.1162/106365601750190398 2001
[80]

B. Wen, S. Dewan, and S. Birchfield. Fast-FoundationStereo: Real-time zero-shot stereo matching.arXiv preprint arXiv:2512.11130, 2025. 15 A Intent Model Details A.1 Supervision Pipeline This appendix details how each raw video clip is processed into the per-window supervision targets consumed by the intent-model loss (Sec. 3.1). A.1.1 Dataset Mix and Clip...

arXiv 2025

[1] [1]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023. doi:10. 15607/RSS.2023.XIX.016

2023

[2] [2]

2024 , url =

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024. doi:10.1109/ICRA57147.2024.10611477

work page doi:10.1109/icra57147.2024.10611477 2024

[3] [3]

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. DexCap: Scalable and portable mocap data collection system for dexterous manipulation. InRobotics: Science and Systems (RSS), 2024

2024

[4] [4]

Guzey, H

I. Guzey, H. Qi, J. Urain, C. Wang, J. Yin, K. Bodduluri, M. Lambeta, L. Pinto, A. Rai, J. Malik, T. Wu, A. Sharma, and H. Bharadhwaj. Dexterity from smart lenses: Multi-fingered robot manipulation with in-the-wild human demonstrations.arXiv preprint arXiv:2511.16661, 2025

arXiv 2025

[5] [5]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems, 2024. doi:10.15607/RSS.2024.XX.045

work page doi:10.15607/rss.2024.xx.045 2024

[6] [6]

Gupta, X

H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi. UMI- on-Air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies.arXiv preprint arXiv:2510.02614, 2025

Pith/arXiv arXiv 2025

[7] [7]

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song. Flow as the cross-domain manipulation interface. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2475–2499. PMLR, 2025. URL https://proceedings.mlr.press/v270/xu25a.html

2025

[8] [8]

H. Li, L. Sun, Y . Hu, D. Ta, J. Barry, G. Konidaris, and J. Fu. NovaFlow: Zero-shot manipula- tion via actionable flow from generated videos.arXiv preprint arXiv:2510.08568, 2025

arXiv 2025

[9] [9]

Kedia, T

K. Kedia, T. G. W. Lum, J. Bohg, and C. K. Liu. SimToolReal: An object-centric policy for zero-shot dexterous tool manipulation.arXiv preprint arXiv:2602.16863, 2026

arXiv 2026

[10] [10]

Singh, A

R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. Van Wyk. DextrAH-RGB: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

arXiv 2024

[11] [11]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. DexMV: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision (ECCV), 2022

2022

[12] [12]

Gupta, M

H. Gupta, M. A. Mirzaee, and W. Yuan. Grasp to act: Dexterous grasping for tool use in dynamic settings.arXiv preprint arXiv:2602.20466, 2026

Pith/arXiv arXiv 2026

[13] [13]

Kuang, S

Y . Kuang, S. Park, K. Fragkiadaki, and S. Tulsiani. Dex4D: Task-agnostic point track policy for sim-to-real dexterous manipulation.arXiv preprint arXiv:2602.15828, 2026. 10

arXiv 2026

[14] [14]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y . Xie, R. Zheng, D. Niu, Y . L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y . Liu, Y . Zhu, J. Jang, and L. Fan. DreamDojo: A generalist robot world model from large-scale ...

Pith/arXiv arXiv 2026

[15] [15]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. Fan, and J. Jang. World action mode...

Pith/arXiv arXiv 2026

[16] [16]

X. Liu, J. Adalibieke, Q. Han, Y . Qin, and L. Yi. DexTrack: Towards generalizable neural tracking control for dexterous manipulation from human references. InInternational Confer- ence on Learning Representations (ICLR), 2025

2025

[17] [17]

Xu, Y .-W

S. Xu, Y .-W. Chao, L. Bian, A. Mousavian, Y .-X. Wang, L. Gui, and W. Yang. Dexplore: Scalable neural control for dexterous manipulation from reference scoped exploration. InPro- ceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 2184–2199. PMLR, 2025. URLhttps://proceedings.mlr. press/v305/xu25d.html

2025

[18] [18]

K. Shaw, A. Agarwal, and D. Pathak. LEAP Hand: Low-cost, efficient, and anthropomorphic hand for robot learning. InRobotics: Science and Systems (RSS), 2023

2023

[19] [19]

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid. ViViDex: Learning vision-based dexterous manipulation from human videos. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[20] [20]

Hsieh, K.-H

J. Hsieh, K.-H. Tu, K.-H. Hung, and T.-W. Ke. DexMan: Learning bimanual dexterous manip- ulation from human and generated videos.arXiv preprint arXiv:2510.08475, 2025

arXiv 2025

[21] [21]

J. Mu, S. Yang, Y . Bao, H. Bae, T. Wei, L. Xu, B. Li, H. Xu, and J. Pang. DexImit: Learning bimanual dexterous manipulation from monocular human videos.arXiv preprint arXiv:2602.10105, 2026

arXiv 2026

[22] [22]

H. Chen, T. Dong, T. Wu, L. Wang, Y . Jangir, Y . Niu, Y . Ye, H. Bharadhwaj, Z. Erickson, and J. Ichnowski. Dexterous manipulation policies from RGB human videos via 3D hand-object trajectory reconstruction.arXiv preprint arXiv:2602.09013, 2026

arXiv 2026

[23] [23]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real RL using one human demonstration.arXiv preprint arXiv:2504.12609, 2025

arXiv 2025

[24] [24]

C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan. SPIDER: Scalable physics-informed dexterous retargeting.arXiv preprint arXiv:2511.09484, 2025

arXiv 2025

[25] [25]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. ZeroMimic: Distilling robotic manipulation skills from web videos. InIEEE International Conference on Robotics and Automation (ICRA), pages 16939–16947, 2025. doi:10.1109/ICRA55743.2025. 11128283

work page doi:10.1109/icra55743.2025 2025

[26] [26]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only hu- man videos. InProceedings of The 9th Conference on Robot Learning, volume 305 ofPro- ceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025. URLhttps: //proceedings.mlr.press/v305/lepert25a.html. 11

2025

[27] [27]

Z. Wang, B. He, K. Yu, S. Lee, R. Gao, F. Huang, and Y . Aloimonos. HumanEgo: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint arXiv:2605.24934, 2026

Pith/arXiv arXiv 2026

[28] [28]

Haldar and L

S. Haldar and L. Pinto. Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

arXiv 2025

[29] [29]

Shirwatkar, N

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexter- ity gap through object-oriented rewards. InIEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351, 2025. doi:10.1109/ICRA55743.2025.11128690

work page doi:10.1109/icra55743.2025.11128690 2025

[30] [30]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation. InComputer Vision – ECCV 2024, volume 15134 ofLecture Notes in Computer Science, pages 306–324, 2024. doi:10. 1007/978-3-031-73116-7 18

2024

[31] [31]

Liang, R

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Re- search, pages 3943–3960. PMLR, 2025. URLhttps://proceedings.mlr.press/v270/ liang25b.html

2025

[32] [32]

J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond VLAs.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025

[33] [33]

R. G. Goswami, A. Bar, D. Fan, T.-Y . Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khor- rami, and Y . LeCun. World models for learning dexterous hand-object interactions from human videos.arXiv preprint arXiv:2512.13644, 2025

arXiv 2025

[34] [34]

Routray, H

S. Routray, H. Pan, U. Jain, S. Bahl, and D. Pathak. ViPRA: Video prediction for robot actions. InInternational Conference on Learning Representations (ICLR), 2026

2026

[35] [35]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-H0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025

[36] [36]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy∼human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025

[37] [37]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[38] [38]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. EgoMimic: Scaling imitation learning via egocentric video. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[39] [39]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, Y . Zhang, X. Chen, H. Chen, L. Sun, D. Chen, J. Yang, and B. Guo. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025

[40] [40]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025. 12

Pith/arXiv arXiv 2025

[41] [41]

J. Ren, P. Sundaresan, D. Sadigh, S. Choudhury, and J. Bohg. Motion tracks: A uni- fied representation for human-robot transfer in few-shot imitation learning. InIEEE In- ternational Conference on Robotics and Automation (ICRA), pages 8802–8810, 2025. doi: 10.1109/ICRA55743.2025.11128834

work page doi:10.1109/icra55743.2025.11128834 2025

[42] [42]

J. A. Collins, L. Cheng, K. Aneja, A. Wilcox, B. Joffe, and A. Garg. AMPLIFY: Actionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025

arXiv 2025

[43] [43]

X. Liu, K. Lyu, J. Zhang, T. Du, and L. Yi. Parameterized quasi-physical simulators for dexterous manipulations transfer. InComputer Vision – ECCV 2024, volume 15136 ofLecture Notes in Computer Science, pages 164–182, 2024. doi:10.1007/978-3-031-73229-4 10

work page doi:10.1007/978-3-031-73229-4 2024

[44] [44]

Dasari, A

S. Dasari, A. Gupta, and V . Kumar. Learning dexterous manipulation from exemplar object trajectories and pre-grasps. InIEEE International Conference on Robotics and Automation (ICRA), 2023

2023

[45] [45]

K. Li, P. Li, T. Liu, Y . Li, and S. Huang. ManipTrans: Efficient dexterous bimanual manipula- tion transfer via residual learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[46] [46]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. DexMachina: Functional retargeting for bimanual dexterous manipulation.arXiv preprint arXiv:2505.24853, 2025

arXiv 2025

[47] [47]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. DexterityGen: Foundation controller for unprecedented dexterity.arXiv preprint arXiv:2502.04307, 2025

arXiv 2025

[48] [48]

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems (RSS), 2024

2024

[49] [49]

C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning. InProceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 1541–1566. PMLR, 2025. URLhttps: //proceedings.mlr.press/v270/yuan25a.html

2025

[50] [50]

Seita, Y

D. Seita, Y . Wang, S. J. Shetty, E. Y . Li, Z. Erickson, and D. Held. ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds. InProceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Re- search, pages 1038–1049. PMLR, 2023. URLhttps://proceedings.mlr.press/v205/ seita23a.html

2023

[51] [51]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations (ICLR), 2025

2025

[52] [52]

Huang, Y .-W

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei. Point- World: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

arXiv 2026

[53] [53]

Mandikal and K

P. Mandikal and K. Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 651–661. PMLR, 2022. URLhttps: //proceedings.mlr.press/v164/mandikal22a.html

2022

[54] [54]

B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, V . Sitzmann, and Y . Du. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025. 13

Pith/arXiv arXiv 2025

[55] [55]

Karaev, Y

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. CoTracker3: Simpler and better point tracking by pseudo-labelling real videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6013–6022, 2025

2025

[56] [56]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[57] [57]

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y . Fang, H.-Y . Lee, J. Ren, M.-H. Yang, and S. Tulyakov. Panda-70M: Captioning 70M videos with multiple cross- modality teachers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[58] [58]

D. Chen, T. Kasarla, Y . Bang, M. Shukor, W. Chung, J. Yu, A. Bolourchi, T. Moutakanni, and P. Fung. Action100M: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

arXiv 2026

[59] [59]

some- thing something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fr¨und, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “some- thing something” video database for learning and evaluating visual common sense. InIEEE International Conference on Computer Vision (ICCV), 2017

2017

[60] [60]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. doi:10.1007/s11263-021-01531-2

work page doi:10.1007/s11263-021-01531-2 2022

[61] [61]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, J. Ren, K. Xie, J. Biswas, L. Leal-Taix´e, and S. Fidler. ViPE: Video pose engine for 3D geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[62] [62]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Doll ´ar, N. Ravi, K. ...

Pith/arXiv arXiv 2025

[63] [63]

T. D. Ngo, A. Mirzaei, G. Qian, H. Liang, C. Gan, E. Kalogerakis, P. Wonka, and C. Wang. DELTAv2: Accelerating dense 3D tracking.arXiv preprint arXiv:2508.01170, 2025

arXiv 2025

[64] [64]

R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[65] [65]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36(6):245:1– 245:17, 2017. doi:10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017

[66] [66]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025

[67] [67]

Ciocarlie, C

M. Ciocarlie, C. Goldfeder, and P. Allen. Dimensionality reduction for hand-independent dexterous robotic grasping. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3270–3275, 2007. doi:10.1109/IROS.2007.4399227

work page doi:10.1109/iros.2007.4399227 2007

[68] [68]

J. He, C. Zhang, F. Jenelten, R. Grandia, M. B ¨acher, and M. Hutter. Attention-based map encoding for learning generalized legged locomotion.Science Robotics, 10(105):eadv3604,

[69] [69]

doi:10.1126/scirobotics.adv3604

work page doi:10.1126/scirobotics.adv3604

[70] [70]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[71] [71]

Akkaya, M

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang. Solving Rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

Pith/arXiv arXiv 1910

[72] [72]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, G. Shi, and C. K. Liu. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026

[73] [73]

Handa, T

A. Handa, T. Whelan, J. McDonald, and A. J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. InIEEE International Conference on Robotics and Automation (ICRA), pages 1524–1531, 2014. doi:10.1109/ICRA.2014.6907054

work page doi:10.1109/icra.2014.6907054 2014

[74] [74]

Veo 3 model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Veo-3-Model-Card.pdf, 2026

Google DeepMind. Veo 3 model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Veo-3-Model-Card.pdf, 2026. Published May 23, 2025; updated January 13, 2026. Accessed: 2026-06-05

2026

[75] [75]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. C. M. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023. doi:10.15607/RSS.2023.XIX.026

work page doi:10.15607/rss.2023.xix.026 2023

[76] [76]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. VIRAL: Visual sim-to-real at scale for humanoid loco- manipulation.arXiv preprint arXiv:2511.15200, 2025

arXiv 2025

[77] [77]

R. S. Sutton. The bitter lesson.http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, 2019

2019

[78] [78]

Makoviichuk and V

D. Makoviichuk and V . Makoviychuk. RL Games: High performance RL library.https: //github.com/Denys88/rl_games, 2021

2021

[79] [79]

Hansen and A

N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strate- gies.Evolutionary Computation, 9(2):159–195, 2001. doi:10.1162/106365601750190398

work page doi:10.1162/106365601750190398 2001

[80] [80]

B. Wen, S. Dewan, and S. Birchfield. Fast-FoundationStereo: Real-time zero-shot stereo matching.arXiv preprint arXiv:2512.11130, 2025. 15 A Intent Model Details A.1 Supervision Pipeline This appendix details how each raw video clip is processed into the per-window supervision targets consumed by the intent-model loss (Sec. 3.1). A.1.1 Dataset Mix and Clip...

arXiv 2025