pith. sign in

arxiv: 2605.16257 · v1 · pith:JSMCGZ5Hnew · submitted 2026-05-15 · 💻 cs.RO

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Pith reviewed 2026-05-20 17:44 UTC · model grok-4.3

classification 💻 cs.RO
keywords dexterous manipulationbenchmarkMuJoCorobotic handstool usebimanual coordinationpolicy evaluationlong-horizon tasks
0
0 comments X

The pith

DexJoCo introduces 11 tasks to benchmark dexterous hands on tool use and coordination that current benchmarks overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish a standardized evaluation framework for dexterous robotic manipulation by creating DexJoCo, a MuJoCo-based benchmark with 11 tasks focused on tool-use, bimanual coordination, long-horizon execution, and reasoning. Existing benchmarks do not sufficiently test capabilities that set dexterous hands apart from parallel grippers, slowing systematic progress in robot learning. The work supplies 1.1K trajectories, domain randomization support, and pipelines for testing modern policies under visual, dynamics, and multi-task conditions. If the benchmark succeeds in its design, it would expose recurring policy weaknesses and steer research toward more capable dexterous systems.

Core claim

We present DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation on MuJoCo, comprising 11 functionally grounded tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning. We develop a low-cost data collection system and collect 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. We benchmark modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation, and identify several important insights and common limitations of current policies in dexterous manipulation.

What carries the argument

The DexJoCo benchmark consisting of 11 tasks, 1.1K trajectories, and evaluation pipelines that target tool-use, bimanual coordination, long-horizon execution, and reasoning while supporting randomization.

If this is right

  • Policies trained or evaluated with multi-task and action-head adaptation settings can be compared directly on the provided tasks.
  • Domain randomization in visual and dynamics parameters serves as a test for robustness before real-world transfer.
  • Identified limitations in long-horizon execution and reasoning become concrete targets for new algorithm development.
  • The low-cost data collection system enables scalable expansion of trajectory datasets for further training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to measure sim-to-real gaps by deploying the same tasks on physical dexterous hands.
  • Insights on common policy failures might inform new hierarchical or planning-based architectures for manipulation.
  • Similar task suites could be developed for other robot embodiments to enable cross-platform comparisons.

Load-bearing premise

The 11 tasks and collected trajectories sufficiently capture the unique manipulation capabilities of dexterous hands compared to parallel grippers and provide a representative basis for identifying policy limitations.

What would settle it

An experiment in which current state-of-the-art dexterous policies achieve near-perfect success rates on all 11 tasks without task-specific adaptations, or in which parallel-gripper policies match dexterous performance, would undermine the benchmark's claim to reveal distinctive challenges.

Figures

Figures reproduced from arXiv: 2605.16257 by Boyuan Zheng, Gang Wang, Hanwen Wang, He Lin, He Wang, Hongsheng Li, Lue Fan, Rongtao Xu, Siyuan Huang, Tieniu Tan, Weizhi Zhao, Xiangyu Wang, Yao Mu, Zhaoxiang Zhang.

Figure 1
Figure 1. Figure 1: Overview of DexJoCo. DexJoCo is a dexterous manipulation benchmark with a toolkit for data collection and policy evaluation, covering tool-use, bimanual coordination, long-horizon execution, and reasoning. It includes 11 tasks, 1.1K human demonstration trajectories, and supports trajectory replay under domain randomization for robustness evaluation. Abstract: Achieving human-level manipulation requires dex… view at source ↗
Figure 2
Figure 2. Figure 2: DexJoCo pipeline. 3D assets are first imported into MuJoCo, where structured success conditions are defined based on object poses, articulated joint states, contact conditions, and temporal constraints. Human demonstrations are collected through the teleoperation system, with actions directly recorded as robot position control commands. Replay-based visual augmentation can optionally be applied to the coll… view at source ↗
Figure 3
Figure 3. Figure 3: Human demonstration data collection system. The left figure shows the overall teleoper￾ation system. A Rokoko glove is used to capture hand poses, while an HTC Vive tracker is employed to track the wrist pose. The right figure shows that a retargeting mapping is trained to convert human fingertip poses into joint configurations of the Allegro hand. Hardware Design The hardware system in DexJoCo is designed… view at source ↗
Figure 4
Figure 4. Figure 4: Task design in DexJoCo. The top panel illustrates the task environment design, showing the initial state of each task. The bottom panel presents the visual and interactive properties of the task assets. Formulation Each task in DexJoCo is defined by a set of interactive objects and task goals: T = (O, G), where O = {o1, o2, . . . , om} denotes the set of interactive objects in the scene. The task goal is f… view at source ↗
Figure 5
Figure 5. Figure 5: Performance evaluation and failure mode analysis. DP denotes Diffusion Policy, with -T and -C representing Transformer and CNN-based architectures, respectively. (a) Comparison of average success rates across different baselines under the “rand-obj” ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of failure cases in typical tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Output distribution of π0.5 (trained on single digits 1-5) across instruc￾tions on the Unlock iPad. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Domain randomization settings. The left panel shows the default scene configuration, while the right panel illustrates the effects of domain randomization, including variations in table height, third-person camera viewpoints, lighting conditions, and tabletop textures. 2 1 0 1 2 x 2 1 0 1 2 y 1 0 1 2 3 z Randomized Camera Poses on Spherical Shells - Iso View Single-arm (n=50) Dual-arm (n=50) Scene center … view at source ↗
Figure 9
Figure 9. Figure 9: Preset third-person camera poses used for visual randomization. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Achieving human-level manipulation requires dexterous robotic hands capable of complex object interactions. Advancing such capabilities further demands standardized benchmarks for systematic evaluation. However, existing dexterous benchmarks lack tasks that reflect the unique manipulation capabilities of dexterous hands over parallel grippers, as well as comprehensive evaluation pipelines. In this paper, we present DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, comprising 11 functionally grounded tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning. We develop a low-cost data collection system and collect 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. We benchmark modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation. Through extensive empirical analysis, we identify several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. Project page available at: https://dexjoco.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation on MuJoCo. It comprises 11 functionally grounded tasks evaluating tool-use, bimanual coordination, long-horizon execution, and reasoning. The authors describe a low-cost data collection system and the collection of 1.1K trajectories with domain randomization support. They benchmark modern models under settings including visual and dynamics randomization, multi-task training, and action-head adaptation, claiming to identify important insights and common limitations of current policies in dexterous manipulation.

Significance. If the tasks are well-specified and the empirical benchmarks are reproducible with clear quantitative results, DexJoCo could provide a valuable standardized platform for dexterous manipulation research, addressing gaps in prior benchmarks that fail to highlight capabilities unique to dexterous hands versus parallel grippers. The toolkit, trajectories, and randomization features add practical utility for the community.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Task Definitions): The abstract and task descriptions provide no quantitative results, error analysis, or full task definitions (e.g., success criteria, horizon lengths, or object properties), limiting independent verification of the claimed insights and the assertion that these tasks capture unique dexterous capabilities.
  2. [§4] §4 (Data Collection): The low-cost data collection system and 1.1K trajectories are described without metrics on collection quality, human demonstration fidelity, or direct comparisons to parallel-gripper baselines, undermining the claim that they form a representative basis for identifying policy limitations.
minor comments (2)
  1. [§5] §5 (Benchmarking): Tables or figures summarizing performance across randomization settings and multi-task training would benefit from explicit error bars and statistical significance tests to strengthen the empirical analysis.
  2. [Implementation Details] The project page link is provided but the manuscript should include a brief summary of available code, environment files, and trajectory formats to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to improve clarity, completeness, and reproducibility where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Task Definitions): The abstract and task descriptions provide no quantitative results, error analysis, or full task definitions (e.g., success criteria, horizon lengths, or object properties), limiting independent verification of the claimed insights and the assertion that these tasks capture unique dexterous capabilities.

    Authors: We agree that expanded details in §3 would strengthen independent verification. In the revised manuscript, we have added explicit success criteria, horizon lengths, object properties (including masses, sizes, and friction coefficients), and initial error analysis for each of the 11 tasks. The abstract has been updated with a concise summary of key quantitative benchmark findings to better contextualize the identified insights. Full quantitative results, including policy performance tables and error breakdowns, remain in §§5–6 per standard practice for benchmark papers, but the additions to §3 now directly support the claim that these tasks highlight dexterous capabilities (e.g., in-hand reorientation and bimanual tool use) beyond parallel-gripper limits. revision: yes

  2. Referee: [§4] §4 (Data Collection): The low-cost data collection system and 1.1K trajectories are described without metrics on collection quality, human demonstration fidelity, or direct comparisons to parallel-gripper baselines, undermining the claim that they form a representative basis for identifying policy limitations.

    Authors: We acknowledge that additional quantitative metrics would improve the section. The revised §4 now includes metrics on collection quality, such as human demonstration success rates (averaged across tasks) and trajectory fidelity measures (e.g., joint-angle variance and contact consistency) with and without domain randomization. Direct side-by-side data collection comparisons to parallel-gripper baselines were not conducted, as the benchmark and toolkit are designed specifically for dexterous hands; however, our policy evaluations in §6 provide indirect evidence through performance gaps when adapting dexterous policies versus simpler gripper equivalents. We maintain that the 1.1K trajectories, collected with the described low-cost system and randomization support, form a representative basis for the observed policy limitations in dexterous settings. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or fitted predictions

full rationale

This is an empirical benchmark and toolkit paper that defines 11 tasks, collects 1.1K trajectories via a low-cost system, applies domain randomization, and evaluates modern policies under various settings. No equations, predictions, or first-principles derivations are claimed; the central contributions rest on external data collection and model benchmarking rather than any self-referential fitting or reduction of outputs to inputs. The work is self-contained against external benchmarks and does not invoke self-citations or ansatzes as load-bearing elements for any claimed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the chosen tasks reflect unique dexterous capabilities and that the collected trajectories enable meaningful robustness assessment. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Existing dexterous benchmarks lack tasks that reflect the unique manipulation capabilities of dexterous hands over parallel grippers.
    Directly stated in the abstract as the motivation for the new benchmark.

pith-pipeline@v0.9.0 · 5758 in / 1238 out tokens · 58077 ms · 2026-05-20T17:44:27.765629+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 5 internal anchors

  1. [1]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), 2024

  2. [2]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024

  3. [3]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  4. [4]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679–2713. PMLR, 2025

  6. [6]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fu- sai, M. Y . Galliker, et al.π0.5: A vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  7. [7]

    Bjorck, N

    NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

  8. [8]

    K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS), 2023

  9. [9]

    C. C. Christoph, M. Eberlein, F. Katsimalis, A. Roberti, A. Sympetheros, M. R. V ogt, D. Liconti, C. Yang, B. G. Cangan, R. J. Hinchet, et al. Orca: An open-source, reliable, cost-effective, anthropomorphic robotic hand for uninterrupted dexterous task learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8503–...

  10. [10]

    Romero, H.-S

    B. Romero, H.-S. Fang, P. Agrawal, and E. Adelson. Eyesight hand: Design of a fully-actuated dexterous robot hand with integrated vision-based tactile sensors and compliant actuation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1853–1860. IEEE, 2024

  11. [11]

    EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  12. [12]

    L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik. Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation, 2025. URL https://arxiv.org/ abs/2506.15953

  13. [13]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022. 9

  14. [14]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

  15. [15]

    Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, L. Lin, Z. Xie, M. Ding, and P. Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 27649–27660, June 2025

  16. [16]

    Jiang, Y

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. J. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923–16930. IEEE, 2025

  17. [17]

    Y . Chen, Y . Yang, T. Wu, S. Wang, X. Feng, J. Jiang, Z. Lu, S. M. McAleer, H. Dong, and S.-C. Zhu. Towards human-level bimanual dexterous manipulation with reinforcement learning. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/forum?id=D29JbExncTP

  18. [18]

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Conference on Learning Representations, 2023

  19. [19]

    McLean, E

    R. McLean, E. Chatzaroulas, L. McCutcheon, F. R ¨oder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=1de3azE606

  20. [20]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

  21. [21]

    T. Mu, Z. Ling, F. Xiang, D. C. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  22. [22]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

  23. [23]

    Nasiriany, S

    S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

  24. [24]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InarXiv preprint arXiv:2108.03298, 2021

  25. [25]

    S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

  26. [26]

    Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4737–4746, 2023. 10

  27. [27]

    W. Wan, H. Geng, Y . Liu, Z. Shan, Y . Yang, L. Yi, and H. Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist- specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3891–3902, 2023

  28. [28]

    K. Zhu, F. Bai, Y . Xiang, Y . Cai, X. Chen, R. Li, X. Wang, H. Dong, Y . Yang, X. Fan, et al. Dexflywheel: A scalable and self-improving data generation framework for dexterous manipulation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  29. [29]

    S. Luo, Q. Peng, J. Lv, K. Hong, K. R. Driggs-Campbell, C. Lu, and Y .-L. Li. Human-agent joint learning for efficient robot manipulation skill acquisition. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1370–1377. IEEE, 2025

  30. [30]

    P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12156–12163. IEEE, 2024

  31. [31]

    J. J. Liu, Y . Li, K. Shaw, T. Tao, R. Salakhutdinov, and D. Pathak. Factr: Force-attending curriculum training for contact-rich policy learning.arXiv preprint arXiv:2502.17432, 2025

  32. [32]

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems, 2023

  33. [33]

    K. Shaw, S. Bahl, A. Sivakumar, A. Kannan, and D. Pathak. Learning dexterity from human hand motion in internet videos.The International Journal of Robotics Research, 43(4):513–532, 2024

  34. [34]

    R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang. Bunny-visionpro: Real- time bimanual dexterous teleoperation for imitation learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12248–12255. IEEE, 2025

  35. [35]

    A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation. InCoRL 2024 Workshop on Mastering Robot Manipulation in a World of Abundant Data

  36. [36]

    Zhang, Q

    G. Zhang, Q. Xu, H. Zhang, J. Ma, L. He, Y . Bao, Z. Ping, Z. Yuan, C. Lu, C. Yuan, et al. Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.arXiv preprint arXiv:2603.22264, 2026

  37. [37]

    Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025

    H. Zhang, S. Hu, Z. Yuan, and H. Xu. Doglove: Dexterous manipulation with a low-cost open-source haptic force feedback glove.arXiv preprint arXiv:2502.07730, 2025

  38. [38]

    Z.-H. Yin, C. Wang, L. Pineda, K. Bodduluri, T. Wu, P. Abbeel, and M. Mukadam. Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 17376–17382. IEEE, 2025

  39. [39]

    M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  40. [40]

    R. Wen, G. Chen, Z. Cui, M. Du, Y . Gou, Z. Han, L. Huang, M. Lei, Y . Li, Z. Li, et al. Gr-dexter technical report.arXiv preprint arXiv:2512.24210, 2025

  41. [41]

    C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024. 11

  42. [42]

    Y . Gao, H. Ma, and P. Zheng. Glovity: Learning dexterous contact-rich manipulation via spatial wrench feedback teleoperation system.arXiv preprint arXiv:2510.09229, 2025

  43. [43]

    H.-S. Fang, B. Romero, Y . Xie, A. Hu, B.-R. Huang, J. Alvarez, M. Kim, G. Margolis, K. An- barasu, M. Tomizuka, E. Adelson, and P. Agrawal. Dexop: A device for robotic transfer of dexterous human manipulation.arXiv preprint arXiv:2509.04441, 2025

  44. [44]

    Y . Feng, H. Fang, Y . He, J. Chen, C. Wang, Z. He, R. Liu, and C. Lu. Learning dexterous manipulation with quantized hand state.arXiv preprint arXiv:2509.17450, 2025

  45. [45]

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, K. Lin, S. Nasiriany, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

  46. [46]

    Zakka, Y

    K. Zakka, Y . Tassa, and MuJoCo Menagerie Contributors. MuJoCo Menagerie: A collec- tion of high-quality simulation models for MuJoCo, 2022. URL http://github.com/ google-deepmind/mujoco_menagerie

  47. [47]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environment. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  48. [48]

    T. H. Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

  49. [49]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  50. [50]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  51. [51]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  52. [52]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  53. [53]

    two” (30.0%±5.3) and “1+1

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 12 Appendix A Statistical Analysis for Language Generalization Results 1 2 3 4 Other No input Actual input 1 2 4 1+1 2+2 two one plus one Instruction 0....