EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Alfred Cueva; Chuye Zhang; Danfei Xu; Shuo Cheng; Woo Chul Shin; Xinchen Yin; Yangcen Liu; Yiran Yang; Zhenyang Chen

arxiv: 2606.12604 · v1 · pith:DLGNKD2Bnew · submitted 2026-06-10 · 💻 cs.RO

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Yangcen Liu , Shuo Cheng , Xinchen Yin , Woo Chul Shin , Alfred Cueva , Yiran Yang , Zhenyang Chen , Chuye Zhang

show 1 more author

Danfei Xu

This is my paper

Pith reviewed 2026-06-27 09:26 UTC · model grok-4.3

classification 💻 cs.RO

keywords egocentric videosdexterous manipulationrobot data generationvisuomotor policieshuman-to-robot transferzero-shot learningaction trajectory synthesis

0 comments

The pith

EgoEngine converts egocentric human manipulation videos into high-fidelity robot videos and action trajectories, allowing zero-shot visuomotor policy learning on real robots without any real-robot demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that egocentric human videos can supply the data needed for dexterous robot policies. EgoEngine replaces the human hand in each video with a robot hand while keeping the surrounding scene and timing intact, then outputs an executable sequence of robot joint commands that respects physical limits. If the conversion is reliable, large collections of everyday human videos become usable training material, bypassing the expense of recording equivalent robot demonstrations. A reader would care because current dexterous skills are bottlenecked by the small size of robot datasets. The method therefore aims to scale data generation from an existing, abundant source.

Core claim

Given an egocentric RGB video, EgoEngine produces a high-fidelity robot observation video that replaces the human with a robot while preserving scene context and temporal alignment, together with a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots demonstrate that policies trained exclusively on this generated data transfer successfully, constituting the first reported zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations.

What carries the argument

EgoEngine, the framework that generates robot observation videos and feasible action trajectories from egocentric human videos.

Load-bearing premise

The robot videos and action trajectories produced from human videos are accurate and physically feasible enough that policies trained only on them succeed when deployed on real robots.

What would settle it

Train a policy on EgoEngine-generated data for a given task and observe that it fails on the real robot, while an otherwise identical policy trained on real-robot demonstrations for the same task succeeds.

Figures

Figures reproduced from arXiv: 2606.12604 by Alfred Cueva, Chuye Zhang, Danfei Xu, Shuo Cheng, Woo Chul Shin, Xinchen Yin, Yangcen Liu, Yiran Yang, Zhenyang Chen.

**Figure 1.** Figure 1: EgoEngine is a scalable data engine that converts egocentric human videos into robot demonstrations. Given an egocentric human video, EgoEngine creates a digital twin and jointly produces (1) a high-fidelity, temporally consistent robot observation video and (2) executable action trajectories aligned with the video. The generated demonstrations serve as training data that can be distilled into downstream … view at source ↗

**Figure 2.** Figure 2: Adaptive MCTS-style switching optimization. EgoEngine progressively selects among Replay, MPC, and RL to solve each trajectory chunk with the cheapest feasible solver. After retargeting, EgoEngine faces a longhorizon, high-DoF trajectory optimization problem: the reference trajectory preserves the human motion prior, but must be converted into executable robot actions that reproduce the demonstrated ob… view at source ↗

**Figure 4.** Figure 4: KDE [61] of features extracted from ResNet18, VGG16, and DINOv2. Method (FD↓) ResNet18 VGG16 DINOv2 Human Video 764.5 670.2 602.9 EgoMimic 830.5 812.1 579.6 VACE 713.6 745.3 488.0 Phantom 620.0 650.8 470.6 EgoEngine 614.7 644.2 473.1 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Throughput comparison: Successful demo number generated over simulation steps on 20 Aria demos across 4 tasks. Replay MPC RL (c) Brush Bowl (Right) (b) Mustard (d) Brush Bowl (Left) (a) Drawer [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of Adaptive MCTS-style mode switching across trajectory chunks on four demonstrations from the Aria and TACO datasets. Method Mustard Drawer Flower Hammer Human Video 0.00 0.10 0.00 0.00 Phantom [17] 0.00 0.05 0.00 0.00 Real Robot 0.80 0.80 0.70 0.25 EgoEngine 0.40 0.35 0.70 0.60 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Behavioral strengths and failure modes of Ego [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoEngine turns egocentric human videos into robot demos and claims the first zero-shot dexterous policy transfer without real-robot data, but the abstract gives no numbers or method details to check the transfer.

read the letter

The paper's core contribution is a framework that takes an egocentric RGB video of a human doing a manipulation task and outputs both a robot-rendered observation video and a feasible robot action trajectory. The stated goal is to remove the need for costly robot demonstrations by scaling from existing human video sources.

What is new is the explicit claim of zero-shot visuomotor policy learning on dexterous tasks using only the generated data. The work identifies the visual gap and the action gap clearly and proposes to close both in one pipeline while preserving scene context and temporal alignment.

The approach is practical on paper because human videos are far more abundant than robot ones. If the generation step produces trajectories that are both executable and distributionally close to real robot behavior, this could reduce the data collection burden for complex hands.

The main limitation visible from the abstract is the lack of any quantitative support. No success rates, no comparison to real-demonstration baselines, and no ablation on how well the generated videos and actions match real robot dynamics are provided. The zero-shot claim rests entirely on the unshown experiments in simulation and on hardware. Without those numbers or the generation method, it is impossible to tell whether the feasibility constraints actually produce usable data or whether the visual replacement step introduces artifacts that break policy transfer.

This paper is aimed at researchers working on imitation learning and data scaling for dexterous manipulation. A reader in that area would get value from the problem framing and the high-level pipeline even if they end up modifying the generation components.

It deserves peer review because the data-bottleneck problem is real and the proposed direction is concrete, though any referee would need to see the full methods and results before accepting the zero-shot claim.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes EgoEngine, a scalable framework that takes an egocentric RGB human manipulation video as input and outputs (i) a high-fidelity robot observation video in which the human is replaced by a robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory subject to feasibility constraints. The central claim is that policies trained exclusively on the resulting synthetic robot data achieve zero-shot visuomotor dexterous manipulation on real robots, constituting the first such demonstration without any real-robot demonstrations; supporting experiments are stated to have been performed in simulation and on physical hardware.

Significance. If the zero-shot transfer result holds with the reported fidelity and without hidden real-robot data, the work would materially advance scalable dexterous manipulation learning by converting abundant egocentric human video into robot-executable data, directly addressing the data-collection bottleneck highlighted in the abstract.

major comments (1)

[Abstract] Abstract: the claim that EgoEngine 'demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations' is load-bearing for the entire contribution, yet the provided text contains no quantitative results (success rates, number of trials, comparison to real-demonstration baselines, or ablations on generation fidelity). Without these data it is impossible to verify that the generated trajectories are distributionally close enough for real-robot transfer, which is the weakest assumption identified in the stress-test note.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support in the abstract for our central claim. We address this point directly below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that EgoEngine 'demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations' is load-bearing for the entire contribution, yet the provided text contains no quantitative results (success rates, number of trials, comparison to real-demonstration baselines, or ablations on generation fidelity). Without these data it is impossible to verify that the generated trajectories are distributionally close enough for real-robot transfer, which is the weakest assumption identified in the stress-test note.

Authors: We agree that the abstract should include key quantitative results to make the zero-shot transfer claim verifiable at a glance. The full manuscript reports these metrics in the experiments section (e.g., real-robot success rates across multiple tasks and trials, with comparisons to real-demonstration baselines). We will revise the abstract to incorporate representative numbers such as average success rate, number of trials, and a brief note on fidelity ablations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents EgoEngine as a framework that converts egocentric human videos into robot observations and action trajectories, with the central claim supported by experiments in simulation and on real robots rather than by any internal derivation, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems are referenced in the abstract or description that reduce outputs to inputs by construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5726 in / 894 out tokens · 17846 ms · 2026-06-27T09:26:13.635871+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments
cs.RO 2026-06 unverdicted novelty 6.0

Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.

Reference graph

Works this paper leans on

68 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[3]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexterity- gen: Foundation controller for unprecedented dexterity, 2025. URLhttps://arxiv. org/abs/2502.04307

arXiv 2025
[4]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

arXiv 2025
[5]

Punamiya, S

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, P. Aphiwetsa, B. Li, A. Cheluva, P. Kuppili, Y . Liu, D. Patel, A. Gao, H.-Y . Chung, R. Co, R. Zbizika, J. Liu, X. Xu, H. Xiong, G. Chen, S. Oliani, C. Yang, X. Wang, J. Fort, R. Newcombe, J. Gao, J. Chong, G. Matsuda, A. Doriwala, M. Polle- fe...

Pith/arXiv arXiv 2026
[6]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[7]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/ 2505.11709

Pith/arXiv arXiv 2026
[8]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta ˜neda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

arXiv 2026
[9]

Engel, K

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, C. Peng, C. Sweeney, C. Wilson, D. Barnes, D. DeTone, D. Caruso, D. Valleroy, D. Ginjupalli, D. Frost, E. Miller, E. Mueggler, E. Oleinik, F. Zhang, G. Soma- sundaram, G. Solaira, H. Lanaras, H. Howard-Jenkins, H. Tang, H. J. Kim, J. Rivera, J...

Pith/arXiv arXiv 2023
[10]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024

arXiv 2024
[11]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos.arXiv preprint arXiv:2503.23877, 2025

arXiv 2025
[12]

Y . Liu, W. C. Shin, Y . Han, Z. Chen, H. Ravichandar, and D. Xu. Immimic: Cross-domain imitation from human videos via mapping and interpolation, 2025. URLhttps://arxiv. org/abs/2509.10952

arXiv 2025
[13]

Punamiya, D

R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data, 2025. URLhttps://arxiv.org/abs/2509.19626

arXiv 2025
[14]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. Emma: Scaling mobile manipulation via egocentric human data, 2025. URLhttps:// arxiv.org/abs/2509.04443

arXiv 2025
[15]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URLhttps://arxiv. org/abs/2302.12422

arXiv 2023
[16]

V . Liu, A. Adeniji, H. Zhan, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses.arXiv preprint arXiv:2505.20290, 2025

arXiv 2025
[17]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning (CoRL), Seoul, Korea, 2025

2025
[18]

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

2023
[19]

C. Kong, Y . Cho, W. Jung, I. Wibowo, P. Shinde, S. Vinodh-Sangeetha, L. K. Chung, Z. Chen, A. Mattei, A. Nidumukkala, A. Elias, D. Xu, T. Higgins, and S. Kousik. A closed- form geometric retargeting solver for upper body humanoid robot teleoperation, 2026. URL https://arxiv.org/abs/2602.01632

arXiv 2026
[20]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback, 2024. URLhttps://arxiv.org/abs/2407.01512

arXiv 2024
[21]

Z.-H. Yin, C. Wang, L. Pineda, K. Bodduluri, T. Wu, P. Abbeel, and M. Mukadam. Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm, 2025. URLhttps: //arxiv.org/abs/2503.07541

arXiv 2025
[22]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,
[23]

URLhttps://arxiv.org/abs/2503.13441

arXiv
[24]

J. Yin, H. Qi, Y . Wi, S. Kundu, M. Lambeta, W. Yang, C. Wang, T. Wu, J. Malik, and T. Helle- brekers. Osmo: Open-source tactile glove for human-to-robot skill transfer, 2025. URL https://arxiv.org/abs/2512.08920

arXiv 2025
[25]

Jiang, Y

Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning, 2025. URLhttps://arxiv.org/abs/2410.24185

arXiv 2025
[26]

J. Mu, S. Yang, Y . Bao, H. Bae, T. Wei, L. Xu, B. Li, H. Xu, and J. Pang. Deximit: Learning bimanual dexterous manipulation from monocular human videos, 2026. URLhttps:// arxiv.org/abs/2602.10105

arXiv 2026
[27]

W. Wan, J. Fu, X. Yuan, Y . Zhu, and H. Su. Lodestar: Long-horizon dexterity via synthetic data augmentation from human demonstrations, 2025. URLhttps://arxiv.org/abs/ 2508.17547

arXiv 2025
[28]

S. Zhao, X. Zhu, Y . Chen, C. Li, X. Zhang, M. Ding, and M. Tomizuka. Dexh2r: Task-oriented dexterous manipulation from human to robots.arXiv preprint arXiv:2411.04428, 2024

arXiv 2024
[29]

T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu. Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids, 2025. URLhttps://arxiv.org/ abs/2502.20396

arXiv 2025
[30]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URLhttps://arxiv.org/abs/ 2505.24853

arXiv 2025
[31]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration.arXiv preprint arXiv:2504.12609, 2025

arXiv 2025
[32]

Z. Gu, Y . Chen, Z. Chai, A. Cueva, T. Nguyen, Y . Wu, H. Xue, M. Kim, I. Legene, F. Liu, M. Kim, A. Barula, Y . Chen, and Y . Zhao. Refine-dp: Diffusion policy fine-tuning for hu- manoid loco-manipulation via reinforcement learning, 2026. URLhttps://arxiv.org/ abs/2603.13707

arXiv 2026
[33]

C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan. Spider: Scalable physics-informed dexterous retargeting, 2026. URLhttps: //arxiv.org/abs/2511.09484

arXiv 2026
[34]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction, 2025. URLhttps://arxiv.org/abs/2509. 26633

2025
[35]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

2026
[36]

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies, 2025. URL https://arxiv.org/abs/2509.17759

arXiv 2025
[37]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

arXiv 2025
[38]

T. Xu, Z. Chen, L. Wu, H. Lu, Y . Chen, L. Jiang, B. Liu, and Y . Chen. Motion dreamer: Realizing physically coherent video generation through scene-aware motion reasoning.arXiv preprint arXiv:2412.00547, 2024

arXiv 2024
[39]

D. Geng, C. Herrmann, J. Hur, F. Cole, S. Zhang, T. Pfaff, T. Lopez-Guevara, C. Doersch, Y . Aytar, M. Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories.arXiv preprint arXiv:2412.02700, 2024

arXiv 2024
[40]

Zhang, H.-X

T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024
[41]

Zhang, C

K. Zhang, C. Xiao, Y . Mei, J. Xu, and V . M. Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025

arXiv 2025
[42]

P. Yang, H. Ci, Y . Song, and M. Z. Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale, 2025. URLhttps://arxiv.org/abs/2512.04537

arXiv 2025
[43]

Jiang, Z

Z. Jiang, Z. Han, C. Mao, J. Zhang, Y . Pan, and Y . Liu. Vace: All-in-one video creation and editing, 2025. URLhttps://arxiv.org/abs/2503.07598

Pith/arXiv arXiv 2025
[44]

Patel, S

S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitat- ing generated videos without physical demonstrations, 2025. URLhttps://arxiv.org/ abs/2507.00990

Pith/arXiv arXiv 2025
[45]

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human in- teractions for in-the-wild robot policies, 2025. URLhttps://arxiv.org/abs/2505. 07813

2025
[46]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos, 2025. URLhttps://arxiv.org/abs/2507.12440

Pith/arXiv arXiv 2025
[47]

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. URLhttps://arxiv.org/abs/2501.09898

arXiv 2025
[48]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations (ICLR), 2025

2025
[49]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024

2024
[50]

K. Zakka. Mink: Python inverse kinematics based on MuJoCo, Feb. 2026. URLhttps: //github.com/kevinzakka/mink

2026
[51]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017
[52]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research, 2025. URLhttps://arxiv.org/abs/2509.10771

arXiv 2025
[53]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement – residual rl for precise assembly, 2024. URLhttps://arxiv.org/abs/2407.16677

arXiv 2024
[54]

Y . Chen, C. Wang, L. Fei-Fei, and C. K. Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation, 2023. URLhttps://arxiv.org/abs/2309.00987

arXiv 2023
[55]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. Viral: Visual sim-to-real at scale for humanoid loco- manipulation, 2025. URLhttps://arxiv.org/abs/2511.15200

arXiv 2025
[56]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models, 2024. URLhttps://arxiv.org/abs/2310.12931

Pith/arXiv arXiv 2024
[57]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URLhttps://arxiv.org/abs/ 2603.15789

arXiv 2026
[58]

Cheng and D

S. Cheng and D. Xu. League: Guided skill learning and abstraction for long-horizon manipu- lation, 2023. URLhttps://arxiv.org/abs/2210.12631

arXiv 2023
[59]

T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen. Inpaint anything: Segment anything meets image inpainting, 2023. URLhttps://arxiv.org/abs/2304.06790

arXiv 2023
[60]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers, 2024. URLhttps://arxiv.org/abs/2409.20537

arXiv 2024
[61]

Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding.arXiv preprint arXiv:2401.08399, 2024

arXiv 2024
[62]

E. Parzen. On estimation of a probability density function and mode.The annals of mathemat- ical statistics, 33(3):1065–1076, 1962

1962
[63]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

Pith/arXiv arXiv 2025
[64]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016

2016
[65]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog- nition. InInternational Conference on Learning Representations (ICLR), 2015

2015
[66]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research (TMLR), 2024

2024
[67]

S. D. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Doll´ar, G. Gkioxari, M. Feiszli, and J. Malik. Sam 3d: 3dfy anything in images, 2025. URL https://arxiv.org/abs/2511.16624

Pith/arXiv arXiv 2025
[68]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015

Balasubramanian, Sivakumar and Melendez-Calderon, Alejandro and Roby-Brami, Agnes and Burdet, Etienne. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015. doi:10.1186/s12984-015-0090-9. URLhttps://doi. org/10.1186/s12984-015-0090-9. Appendix A Hardware and Data Preprocess The overall data collection and ...

work page doi:10.1186/s12984-015-0090-9 2015

[1] [1]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[2] [2]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[3] [3]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexterity- gen: Foundation controller for unprecedented dexterity, 2025. URLhttps://arxiv. org/abs/2502.04307

arXiv 2025

[4] [4]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

arXiv 2025

[5] [5]

Punamiya, S

R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, P. Aphiwetsa, B. Li, A. Cheluva, P. Kuppili, Y . Liu, D. Patel, A. Gao, H.-Y . Chung, R. Co, R. Zbizika, J. Liu, X. Xu, H. Xiong, G. Chen, S. Oliani, C. Yang, X. Wang, J. Fort, R. Newcombe, J. Gao, J. Chong, G. Matsuda, A. Doriwala, M. Polle- fe...

Pith/arXiv arXiv 2026

[6] [6]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022

[7] [7]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/ 2505.11709

Pith/arXiv arXiv 2026

[8] [8]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta ˜neda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

arXiv 2026

[9] [9]

Engel, K

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, C. Peng, C. Sweeney, C. Wilson, D. Barnes, D. DeTone, D. Caruso, D. Valleroy, D. Ginjupalli, D. Frost, E. Miller, E. Mueggler, E. Oleinik, F. Zhang, G. Soma- sundaram, G. Solaira, H. Lanaras, H. Howard-Jenkins, H. Tang, H. J. Kim, J. Rivera, J...

Pith/arXiv arXiv 2023

[10] [10]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024

arXiv 2024

[11] [11]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos.arXiv preprint arXiv:2503.23877, 2025

arXiv 2025

[12] [12]

Y . Liu, W. C. Shin, Y . Han, Z. Chen, H. Ravichandar, and D. Xu. Immimic: Cross-domain imitation from human videos via mapping and interpolation, 2025. URLhttps://arxiv. org/abs/2509.10952

arXiv 2025

[13] [13]

Punamiya, D

R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data, 2025. URLhttps://arxiv.org/abs/2509.19626

arXiv 2025

[14] [14]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. Emma: Scaling mobile manipulation via egocentric human data, 2025. URLhttps:// arxiv.org/abs/2509.04443

arXiv 2025

[15] [15]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URLhttps://arxiv. org/abs/2302.12422

arXiv 2023

[16] [16]

V . Liu, A. Adeniji, H. Zhan, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses.arXiv preprint arXiv:2505.20290, 2025

arXiv 2025

[17] [17]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning (CoRL), Seoul, Korea, 2025

2025

[18] [18]

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

2023

[19] [19]

C. Kong, Y . Cho, W. Jung, I. Wibowo, P. Shinde, S. Vinodh-Sangeetha, L. K. Chung, Z. Chen, A. Mattei, A. Nidumukkala, A. Elias, D. Xu, T. Higgins, and S. Kousik. A closed- form geometric retargeting solver for upper body humanoid robot teleoperation, 2026. URL https://arxiv.org/abs/2602.01632

arXiv 2026

[20] [20]

Cheng, J

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback, 2024. URLhttps://arxiv.org/abs/2407.01512

arXiv 2024

[21] [21]

Z.-H. Yin, C. Wang, L. Pineda, K. Bodduluri, T. Wu, P. Abbeel, and M. Mukadam. Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm, 2025. URLhttps: //arxiv.org/abs/2503.07541

arXiv 2025

[22] [22]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,

[23] [23]

URLhttps://arxiv.org/abs/2503.13441

arXiv

[24] [24]

J. Yin, H. Qi, Y . Wi, S. Kundu, M. Lambeta, W. Yang, C. Wang, T. Wu, J. Malik, and T. Helle- brekers. Osmo: Open-source tactile glove for human-to-robot skill transfer, 2025. URL https://arxiv.org/abs/2512.08920

arXiv 2025

[25] [25]

Jiang, Y

Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning, 2025. URLhttps://arxiv.org/abs/2410.24185

arXiv 2025

[26] [26]

J. Mu, S. Yang, Y . Bao, H. Bae, T. Wei, L. Xu, B. Li, H. Xu, and J. Pang. Deximit: Learning bimanual dexterous manipulation from monocular human videos, 2026. URLhttps:// arxiv.org/abs/2602.10105

arXiv 2026

[27] [27]

W. Wan, J. Fu, X. Yuan, Y . Zhu, and H. Su. Lodestar: Long-horizon dexterity via synthetic data augmentation from human demonstrations, 2025. URLhttps://arxiv.org/abs/ 2508.17547

arXiv 2025

[28] [28]

S. Zhao, X. Zhu, Y . Chen, C. Li, X. Zhang, M. Ding, and M. Tomizuka. Dexh2r: Task-oriented dexterous manipulation from human to robots.arXiv preprint arXiv:2411.04428, 2024

arXiv 2024

[29] [29]

T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu. Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids, 2025. URLhttps://arxiv.org/ abs/2502.20396

arXiv 2025

[30] [30]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URLhttps://arxiv.org/abs/ 2505.24853

arXiv 2025

[31] [31]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration.arXiv preprint arXiv:2504.12609, 2025

arXiv 2025

[32] [32]

Z. Gu, Y . Chen, Z. Chai, A. Cueva, T. Nguyen, Y . Wu, H. Xue, M. Kim, I. Legene, F. Liu, M. Kim, A. Barula, Y . Chen, and Y . Zhao. Refine-dp: Diffusion policy fine-tuning for hu- manoid loco-manipulation via reinforcement learning, 2026. URLhttps://arxiv.org/ abs/2603.13707

arXiv 2026

[33] [33]

C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan. Spider: Scalable physics-informed dexterous retargeting, 2026. URLhttps: //arxiv.org/abs/2511.09484

arXiv 2026

[34] [34]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction, 2025. URLhttps://arxiv.org/abs/2509. 26633

2025

[35] [35]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

2026

[36] [36]

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies, 2025. URL https://arxiv.org/abs/2509.17759

arXiv 2025

[37] [37]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

arXiv 2025

[38] [38]

T. Xu, Z. Chen, L. Wu, H. Lu, Y . Chen, L. Jiang, B. Liu, and Y . Chen. Motion dreamer: Realizing physically coherent video generation through scene-aware motion reasoning.arXiv preprint arXiv:2412.00547, 2024

arXiv 2024

[39] [39]

D. Geng, C. Herrmann, J. Hur, F. Cole, S. Zhang, T. Pfaff, T. Lopez-Guevara, C. Doersch, Y . Aytar, M. Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories.arXiv preprint arXiv:2412.02700, 2024

arXiv 2024

[40] [40]

Zhang, H.-X

T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024

[41] [41]

Zhang, C

K. Zhang, C. Xiao, Y . Mei, J. Xu, and V . M. Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025

arXiv 2025

[42] [42]

P. Yang, H. Ci, Y . Song, and M. Z. Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale, 2025. URLhttps://arxiv.org/abs/2512.04537

arXiv 2025

[43] [43]

Jiang, Z

Z. Jiang, Z. Han, C. Mao, J. Zhang, Y . Pan, and Y . Liu. Vace: All-in-one video creation and editing, 2025. URLhttps://arxiv.org/abs/2503.07598

Pith/arXiv arXiv 2025

[44] [44]

Patel, S

S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitat- ing generated videos without physical demonstrations, 2025. URLhttps://arxiv.org/ abs/2507.00990

Pith/arXiv arXiv 2025

[45] [45]

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human in- teractions for in-the-wild robot policies, 2025. URLhttps://arxiv.org/abs/2505. 07813

2025

[46] [46]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos, 2025. URLhttps://arxiv.org/abs/2507.12440

Pith/arXiv arXiv 2025

[47] [47]

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. URLhttps://arxiv.org/abs/2501.09898

arXiv 2025

[48] [48]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations (ICLR), 2025

2025

[49] [49]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024

2024

[50] [50]

K. Zakka. Mink: Python inverse kinematics based on MuJoCo, Feb. 2026. URLhttps: //github.com/kevinzakka/mink

2026

[51] [51]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017

[52] [52]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research, 2025. URLhttps://arxiv.org/abs/2509.10771

arXiv 2025

[53] [53]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement – residual rl for precise assembly, 2024. URLhttps://arxiv.org/abs/2407.16677

arXiv 2024

[54] [54]

Y . Chen, C. Wang, L. Fei-Fei, and C. K. Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation, 2023. URLhttps://arxiv.org/abs/2309.00987

arXiv 2023

[55] [55]

T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. Viral: Visual sim-to-real at scale for humanoid loco- manipulation, 2025. URLhttps://arxiv.org/abs/2511.15200

arXiv 2025

[56] [56]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models, 2024. URLhttps://arxiv.org/abs/2310.12931

Pith/arXiv arXiv 2024

[57] [57]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URLhttps://arxiv.org/abs/ 2603.15789

arXiv 2026

[58] [58]

Cheng and D

S. Cheng and D. Xu. League: Guided skill learning and abstraction for long-horizon manipu- lation, 2023. URLhttps://arxiv.org/abs/2210.12631

arXiv 2023

[59] [59]

T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen. Inpaint anything: Segment anything meets image inpainting, 2023. URLhttps://arxiv.org/abs/2304.06790

arXiv 2023

[60] [60]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers, 2024. URLhttps://arxiv.org/abs/2409.20537

arXiv 2024

[61] [61]

Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding.arXiv preprint arXiv:2401.08399, 2024

arXiv 2024

[62] [62]

E. Parzen. On estimation of a probability density function and mode.The annals of mathemat- ical statistics, 33(3):1065–1076, 1962

1962

[63] [63]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

Pith/arXiv arXiv 2025

[64] [64]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016

2016

[65] [65]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog- nition. InInternational Conference on Learning Representations (ICLR), 2015

2015

[66] [66]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research (TMLR), 2024

2024

[67] [67]

S. D. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Doll´ar, G. Gkioxari, M. Feiszli, and J. Malik. Sam 3d: 3dfy anything in images, 2025. URL https://arxiv.org/abs/2511.16624

Pith/arXiv arXiv 2025

[68] [68]

On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015

Balasubramanian, Sivakumar and Melendez-Calderon, Alejandro and Roby-Brami, Agnes and Burdet, Etienne. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015. doi:10.1186/s12984-015-0090-9. URLhttps://doi. org/10.1186/s12984-015-0090-9. Appendix A Hardware and Data Preprocess The overall data collection and ...

work page doi:10.1186/s12984-015-0090-9 2015