pith. sign in

arxiv: 2606.12604 · v1 · pith:DLGNKD2Bnew · submitted 2026-06-10 · 💻 cs.RO

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Pith reviewed 2026-06-27 09:26 UTC · model grok-4.3

classification 💻 cs.RO
keywords egocentric videosdexterous manipulationrobot data generationvisuomotor policieshuman-to-robot transferzero-shot learningaction trajectory synthesis
0
0 comments X

The pith

EgoEngine converts egocentric human manipulation videos into high-fidelity robot videos and action trajectories, allowing zero-shot visuomotor policy learning on real robots without any real-robot demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that egocentric human videos can supply the data needed for dexterous robot policies. EgoEngine replaces the human hand in each video with a robot hand while keeping the surrounding scene and timing intact, then outputs an executable sequence of robot joint commands that respects physical limits. If the conversion is reliable, large collections of everyday human videos become usable training material, bypassing the expense of recording equivalent robot demonstrations. A reader would care because current dexterous skills are bottlenecked by the small size of robot datasets. The method therefore aims to scale data generation from an existing, abundant source.

Core claim

Given an egocentric RGB video, EgoEngine produces a high-fidelity robot observation video that replaces the human with a robot while preserving scene context and temporal alignment, together with a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots demonstrate that policies trained exclusively on this generated data transfer successfully, constituting the first reported zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations.

What carries the argument

EgoEngine, the framework that generates robot observation videos and feasible action trajectories from egocentric human videos.

Load-bearing premise

The robot videos and action trajectories produced from human videos are accurate and physically feasible enough that policies trained only on them succeed when deployed on real robots.

What would settle it

Train a policy on EgoEngine-generated data for a given task and observe that it fails on the real robot, while an otherwise identical policy trained on real-robot demonstrations for the same task succeeds.

Figures

Figures reproduced from arXiv: 2606.12604 by Alfred Cueva, Chuye Zhang, Danfei Xu, Shuo Cheng, Woo Chul Shin, Xinchen Yin, Yangcen Liu, Yiran Yang, Zhenyang Chen.

Figure 1
Figure 1. Figure 1: EgoEngine is a scalable data engine that converts egocentric human videos into robot demonstrations. Given an egocentric human video, EgoEngine creates a digital twin and jointly pro￾duces (1) a high-fidelity, temporally consistent robot observation video and (2) executable action trajectories aligned with the video. The generated demonstrations serve as training data that can be distilled into downstream … view at source ↗
Figure 2
Figure 2. Figure 2: Adaptive MCTS-style switching opti￾mization. EgoEngine progressively selects among Replay, MPC, and RL to solve each trajectory chunk with the cheapest feasible solver. After retargeting, EgoEngine faces a long￾horizon, high-DoF trajectory optimization problem: the reference trajectory preserves the human motion prior, but must be con￾verted into executable robot actions that re￾produce the demonstrated ob… view at source ↗
Figure 4
Figure 4. Figure 4: KDE [61] of features extracted from ResNet18, VGG16, and DINOv2. Method (FD↓) ResNet18 VGG16 DINOv2 Human Video 764.5 670.2 602.9 EgoMimic 830.5 812.1 579.6 VACE 713.6 745.3 488.0 Phantom 620.0 650.8 470.6 EgoEngine 614.7 644.2 473.1 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Throughput compar￾ison: Successful demo number generated over simulation steps on 20 Aria demos across 4 tasks. Replay MPC RL (c) Brush Bowl (Right) (b) Mustard (d) Brush Bowl (Left) (a) Drawer [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of Adaptive MCTS-style mode switching across trajectory chunks on four demonstrations from the Aria and TACO datasets. Method Mustard Drawer Flower Hammer Human Video 0.00 0.10 0.00 0.00 Phantom [17] 0.00 0.05 0.00 0.00 Real Robot 0.80 0.80 0.70 0.25 EgoEngine 0.40 0.35 0.70 0.60 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Behavioral strengths and failure modes of Ego [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes EgoEngine, a scalable framework that takes an egocentric RGB human manipulation video as input and outputs (i) a high-fidelity robot observation video in which the human is replaced by a robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory subject to feasibility constraints. The central claim is that policies trained exclusively on the resulting synthetic robot data achieve zero-shot visuomotor dexterous manipulation on real robots, constituting the first such demonstration without any real-robot demonstrations; supporting experiments are stated to have been performed in simulation and on physical hardware.

Significance. If the zero-shot transfer result holds with the reported fidelity and without hidden real-robot data, the work would materially advance scalable dexterous manipulation learning by converting abundant egocentric human video into robot-executable data, directly addressing the data-collection bottleneck highlighted in the abstract.

major comments (1)
  1. [Abstract] Abstract: the claim that EgoEngine 'demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations' is load-bearing for the entire contribution, yet the provided text contains no quantitative results (success rates, number of trials, comparison to real-demonstration baselines, or ablations on generation fidelity). Without these data it is impossible to verify that the generated trajectories are distributionally close enough for real-robot transfer, which is the weakest assumption identified in the stress-test note.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support in the abstract for our central claim. We address this point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that EgoEngine 'demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations' is load-bearing for the entire contribution, yet the provided text contains no quantitative results (success rates, number of trials, comparison to real-demonstration baselines, or ablations on generation fidelity). Without these data it is impossible to verify that the generated trajectories are distributionally close enough for real-robot transfer, which is the weakest assumption identified in the stress-test note.

    Authors: We agree that the abstract should include key quantitative results to make the zero-shot transfer claim verifiable at a glance. The full manuscript reports these metrics in the experiments section (e.g., real-robot success rates across multiple tasks and trials, with comparisons to real-demonstration baselines). We will revise the abstract to incorporate representative numbers such as average success rate, number of trials, and a brief note on fidelity ablations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents EgoEngine as a framework that converts egocentric human videos into robot observations and action trajectories, with the central claim supported by experiments in simulation and on real robots rather than by any internal derivation, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems are referenced in the abstract or description that reduce outputs to inputs by construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5726 in / 894 out tokens · 17846 ms · 2026-06-27T09:26:13.635871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments

    cs.RO 2026-06 unverdicted novelty 6.0

    Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.

Reference graph

Works this paper leans on

68 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  2. [2]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  3. [3]

    Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexterity- gen: Foundation controller for unprecedented dexterity, 2025. URLhttps://arxiv. org/abs/2502.04307

  4. [4]

    M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  5. [5]

    Punamiya, S

    R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Li- conti, L. Y . Zhu, P. Aphiwetsa, B. Li, A. Cheluva, P. Kuppili, Y . Liu, D. Patel, A. Gao, H.-Y . Chung, R. Co, R. Zbizika, J. Liu, X. Xu, H. Xiong, G. Chen, S. Oliani, C. Yang, X. Wang, J. Fort, R. Newcombe, J. Gao, J. Chong, G. Matsuda, A. Doriwala, M. Polle- fe...

  6. [6]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  7. [7]

    Hoque, P

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/ 2505.11709

  8. [8]

    Zheng, D

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta ˜neda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

  9. [9]

    Engel, K

    J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, C. Peng, C. Sweeney, C. Wilson, D. Barnes, D. DeTone, D. Caruso, D. Valleroy, D. Ginjupalli, D. Frost, E. Miller, E. Mueggler, E. Oleinik, F. Zhang, G. Soma- sundaram, G. Solaira, H. Lanaras, H. Howard-Jenkins, H. Tang, H. J. Kim, J. Rivera, J...

  10. [10]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024

  11. [11]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos.arXiv preprint arXiv:2503.23877, 2025

  12. [12]

    Y . Liu, W. C. Shin, Y . Han, Z. Chen, H. Ravichandar, and D. Xu. Immimic: Cross-domain imitation from human videos via mapping and interpolation, 2025. URLhttps://arxiv. org/abs/2509.10952

  13. [13]

    Punamiya, D

    R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data, 2025. URLhttps://arxiv.org/abs/2509.19626

  14. [14]

    L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. Emma: Scaling mobile manipulation via egocentric human data, 2025. URLhttps:// arxiv.org/abs/2509.04443

  15. [15]

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URLhttps://arxiv. org/abs/2302.12422

  16. [16]

    V . Liu, A. Adeniji, H. Zhan, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses.arXiv preprint arXiv:2505.20290, 2025

  17. [17]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. InConference on Robot Learning (CoRL), Seoul, Korea, 2025

  18. [18]

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

  19. [19]

    C. Kong, Y . Cho, W. Jung, I. Wibowo, P. Shinde, S. Vinodh-Sangeetha, L. K. Chung, Z. Chen, A. Mattei, A. Nidumukkala, A. Elias, D. Xu, T. Higgins, and S. Kousik. A closed- form geometric retargeting solver for upper body humanoid robot teleoperation, 2026. URL https://arxiv.org/abs/2602.01632

  20. [20]

    Cheng, J

    X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang. Open-television: Teleoperation with immer- sive active visual feedback, 2024. URLhttps://arxiv.org/abs/2407.01512

  21. [21]

    Z.-H. Yin, C. Wang, L. Pineda, K. Bodduluri, T. Wu, P. Abbeel, and M. Mukadam. Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm, 2025. URLhttps: //arxiv.org/abs/2503.07541

  22. [22]

    R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, G. Yang, J. Zhang, S. Yi, G. Shi, and X. Wang. Humanoid policy human policy,

  23. [23]

    URLhttps://arxiv.org/abs/2503.13441

  24. [24]

    J. Yin, H. Qi, Y . Wi, S. Kundu, M. Lambeta, W. Yang, C. Wang, T. Wu, J. Malik, and T. Helle- brekers. Osmo: Open-source tactile glove for human-to-robot skill transfer, 2025. URL https://arxiv.org/abs/2512.08920

  25. [25]

    Jiang, Y

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning, 2025. URLhttps://arxiv.org/abs/2410.24185

  26. [26]

    J. Mu, S. Yang, Y . Bao, H. Bae, T. Wei, L. Xu, B. Li, H. Xu, and J. Pang. Deximit: Learning bimanual dexterous manipulation from monocular human videos, 2026. URLhttps:// arxiv.org/abs/2602.10105

  27. [27]

    W. Wan, J. Fu, X. Yuan, Y . Zhu, and H. Su. Lodestar: Long-horizon dexterity via synthetic data augmentation from human demonstrations, 2025. URLhttps://arxiv.org/abs/ 2508.17547

  28. [28]

    S. Zhao, X. Zhu, Y . Chen, C. Li, X. Zhang, M. Ding, and M. Tomizuka. Dexh2r: Task-oriented dexterous manipulation from human to robots.arXiv preprint arXiv:2411.04428, 2024

  29. [29]

    T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu. Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids, 2025. URLhttps://arxiv.org/ abs/2502.20396

  30. [30]

    Mandi, Y

    Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URLhttps://arxiv.org/abs/ 2505.24853

  31. [31]

    T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration.arXiv preprint arXiv:2504.12609, 2025

  32. [32]

    Z. Gu, Y . Chen, Z. Chai, A. Cueva, T. Nguyen, Y . Wu, H. Xue, M. Kim, I. Legene, F. Liu, M. Kim, A. Barula, Y . Chen, and Y . Zhao. Refine-dp: Diffusion policy fine-tuning for hu- manoid loco-manipulation via reinforcement learning, 2026. URLhttps://arxiv.org/ abs/2603.13707

  33. [33]

    C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan. Spider: Scalable physics-informed dexterous retargeting, 2026. URLhttps: //arxiv.org/abs/2511.09484

  34. [34]

    L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction, 2025. URLhttps://arxiv.org/abs/2509. 26633

  35. [35]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In2026 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2026

  36. [36]

    C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies, 2025. URL https://arxiv.org/abs/2509.17759

  37. [37]

    G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2r: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025

  38. [38]

    T. Xu, Z. Chen, L. Wu, H. Lu, Y . Chen, L. Jiang, B. Liu, and Y . Chen. Motion dreamer: Realizing physically coherent video generation through scene-aware motion reasoning.arXiv preprint arXiv:2412.00547, 2024

  39. [39]

    D. Geng, C. Herrmann, J. Hur, F. Cole, S. Zhang, T. Pfaff, T. Lopez-Guevara, C. Doersch, Y . Aytar, M. Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories.arXiv preprint arXiv:2412.02700, 2024

  40. [40]

    Zhang, H.-X

    T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

  41. [41]

    Zhang, C

    K. Zhang, C. Xiao, Y . Mei, J. Xu, and V . M. Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025

  42. [42]

    P. Yang, H. Ci, Y . Song, and M. Z. Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale, 2025. URLhttps://arxiv.org/abs/2512.04537

  43. [43]

    Jiang, Z

    Z. Jiang, Z. Han, C. Mao, J. Zhang, Y . Pan, and Y . Liu. Vace: All-in-one video creation and editing, 2025. URLhttps://arxiv.org/abs/2503.07598

  44. [44]

    Patel, S

    S. Patel, S. Mohan, H. Mai, U. Jain, S. Lazebnik, and Y . Li. Robotic manipulation by imitat- ing generated videos without physical demonstrations, 2025. URLhttps://arxiv.org/ abs/2507.00990

  45. [45]

    T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human in- teractions for in-the-wild robot policies, 2025. URLhttps://arxiv.org/abs/2505. 07813

  46. [46]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. Egovla: Learning vision-language-action models from egocentric human videos, 2025. URLhttps://arxiv.org/abs/2507.12440

  47. [47]

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. URLhttps://arxiv.org/abs/2501.09898

  48. [48]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations (ICLR), 2025

  49. [49]

    B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024

  50. [50]

    K. Zakka. Mink: Python inverse kinematics based on MuJoCo, Feb. 2026. URLhttps: //github.com/kevinzakka/mink

  51. [51]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  52. [52]

    Schwarke, M

    C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research, 2025. URLhttps://arxiv.org/abs/2509.10771

  53. [53]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement – residual rl for precise assembly, 2024. URLhttps://arxiv.org/abs/2407.16677

  54. [54]

    Y . Chen, C. Wang, L. Fei-Fei, and C. K. Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation, 2023. URLhttps://arxiv.org/abs/2309.00987

  55. [55]

    T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sastry, C. Liu, G. Shi, L. Fan, and Y . Zhu. Viral: Visual sim-to-real at scale for humanoid loco- manipulation, 2025. URLhttps://arxiv.org/abs/2511.15200

  56. [56]

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models, 2024. URLhttps://arxiv.org/abs/2310.12931

  57. [57]

    P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URLhttps://arxiv.org/abs/ 2603.15789

  58. [58]

    Cheng and D

    S. Cheng and D. Xu. League: Guided skill learning and abstraction for long-horizon manipu- lation, 2023. URLhttps://arxiv.org/abs/2210.12631

  59. [59]

    T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen. Inpaint anything: Segment anything meets image inpainting, 2023. URLhttps://arxiv.org/abs/2304.06790

  60. [60]

    L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heteroge- neous pre-trained transformers, 2024. URLhttps://arxiv.org/abs/2409.20537

  61. [61]

    Y . Liu, H. Yang, X. Si, L. Liu, Z. Li, Y . Zhang, Y . Liu, and L. Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding.arXiv preprint arXiv:2401.08399, 2024

  62. [62]

    E. Parzen. On estimation of a probability density function and mode.The annals of mathemat- ical statistics, 33(3):1065–1076, 1962

  63. [63]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

  64. [64]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016

  65. [65]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recog- nition. InInternational Conference on Learning Representations (ICLR), 2015

  66. [66]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research (TMLR), 2024

  67. [67]

    S. D. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Doll´ar, G. Gkioxari, M. Feiszli, and J. Malik. Sam 3d: 3dfy anything in images, 2025. URL https://arxiv.org/abs/2511.16624

  68. [68]

    On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015

    Balasubramanian, Sivakumar and Melendez-Calderon, Alejandro and Roby-Brami, Agnes and Burdet, Etienne. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12(1):112, 2015. doi:10.1186/s12984-015-0090-9. URLhttps://doi. org/10.1186/s12984-015-0090-9. Appendix A Hardware and Data Preprocess The overall data collection and ...