pith. sign in

arxiv: 2606.09798 · v1 · pith:MGFHFCU5new · submitted 2026-06-08 · 💻 cs.RO

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

Pith reviewed 2026-06-27 16:20 UTC · model grok-4.3

classification 💻 cs.RO
keywords dexterous graspinghuman-like manipulationsynthetic pre-graspsgrasp retargetingforce-closure optimizationbimanual roboticsprehensile tasksrobot simulation
0
0 comments X

The pith

SynManDex turns synthetic human pre-grasps into stable, human-like grasps on complex robotic hands by retargeting and contact optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline that samples object-conditioned digital human pre-grasps, retargets the poses to a dexterous robot embodiment, optimizes force-closure contacts, and filters trajectories that satisfy all intermediate checks. This produces grasp keyframes that support both simple lift tasks and more complex prehensile actions such as pouring tea or playing a flute. A sympathetic reader would care because direct copying of human hand poses usually breaks under differences in finger count, joint limits, and reachability, and the method claims to bridge that gap while keeping both stability and perceived naturalness high.

Core claim

SynManDex samples synthetic human pre-grasps as affordance-aware proposals, retargets them to robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits only trajectories that pass every step; the resulting grasps achieve 86.4 percent grasp stability and 4.67 out of 5 human-likeness on a 36-DOF bimanual platform, with 80.7 percent simulation success and 25 out of 30 real-robot successes.

What carries the argument

The four-stage pipeline of sampling object-conditioned human pre-grasps, retargeting to robot poses, force-closure contact optimization, and multi-stage trajectory filtering.

If this is right

  • The generated keyframes directly support grasp-and-lift demonstrations on the 36-DOF platform.
  • VLM agents can compose the keyframes into multi-step tasks such as tea pouring, photo taking, and flute playing.
  • The method reports 80.7 percent success in simulation across tested objects.
  • Real-robot execution reaches 83.3 percent success on 30 trials with the bimanual dexterous system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce reliance on expensive motion-capture datasets by substituting procedurally generated human pre-grasps.
  • Similar pipelines might transfer to other high-DOF embodiments if the retargeting and optimization stages are re-tuned for new joint limits.
  • Success on bimanual coordination tasks suggests the method implicitly handles inter-hand reachability constraints that single-hand methods often ignore.

Load-bearing premise

Synthetic human pre-grasps contain enough functional intent to remain useful after retargeting and robot-specific contact optimization without violating morphology or reachability limits.

What would settle it

Real-robot success falling below 60 percent on the same set of manipulation tasks or average human-likeness ratings dropping below 4.0 out of 5 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.09798 by Mingjie Zhou, Tianxing Chen, Wenwei Lin, Xiaokang Yang, Yanming Shao, Yao Mu, Yichen Chi, Zanxin Chen.

Figure 1
Figure 1. Figure 1: Human-prior-guided bimanual dexterous manipulation generated by SynManDex. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SynManDex pipeline. (A) An object-conditioned diffusion model samples digital human pre-grasps as affordance-aware pre-contact proposals. (B) Geometric retargeting maps human hand keypoints to robotic hand seeds, and robot-native optimization refines contacts under collision and force-closure objectives. (C) After validations, we collect dexterous manipulation trajectories for policy learning and real-robo… view at source ↗
Figure 3
Figure 3. Figure 3: Grasp generation stages. This figure illustrates the generated digital human pre-grasp, robotic pre￾grasp, and refined grasps. Left: digital human pre-grasps encode human-like approach direction and coarse finger coordination. Middle: geometric retargeting preserves the pre-grasp intents but can leave wrist offsets or object penetration. Right: force-closure optimization resolves contact on the robotic han… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative coverage of generated bimanual candidates. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generating trajectories. Rows show piggy-bank, rose, duck, cylinder, and donut examples. In each row, the left panel is the optimized goal keyframe, followed by the executed rollout sequence: start, approach, pre-grasp, grasp, and lift. A rollout enters the imitation dataset only if it passes physical checks and the 10 cm lift test in Equation (10); aggregate admission rates are summarized in [PITH_FULL_I… view at source ↗
Figure 6
Figure 6. Figure 6: Force-closure refinement corrects retargeted contact failures. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Grasping camera and binoculars. Rows (a) and (b) show camera and binoculars. In each row, the image left of the divider is the bimanual version of BODEX optimization baseline from UltraDexGrasp, while the three images to the right are accepted SynManDex grasps generated from human-prior seeds. The figure illustrates how human-prior basins preserve object-specific bimanual roles; aggregate matched-baseline … view at source ↗
Figure 8
Figure 8. Figure 8: Bottle grasping. (a) The baseline converges to a stable wrap that does not preserve the intended side￾oriented grasp prior. (b) SynManDex samples retain side-approach directions while satisfying robot-native contact constraints. This qualitative stress case complements the task-match comparison in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Fine-grained flute-holding contact modes. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Diverse dexterous grasps given diverse human priors. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: In-grasp reconfiguration from validated keyframes. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prehensile keyframes. Accepted SynManDex grasps on binoculars and cameras use dual-hand contact patterns conditioned on human priors object geometry, rather than generic power grasps [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Handover bimanual graspss. The examples establish compatible dual contacts on objects to grasp and handover from one hand to another [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pick-and-place rollout sequence. Temporal frames show approach, grasp, lift, transport to the target region, and terminal release or stabilization [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Real-world robot platform. The bimanual system operates over a tabletop workspace observed by two Azure Kinect cameras. The fused point cloud and robot proprioception form the same policy interface used in simulation [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Real-world grasping. Rows show successful vase, apple, and spray-bottle trials from the three-object tabletop benchmark; columns are time-ordered execution frames. The corresponding 30-trial success rates are reported in [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: adds qualitative trials outside the 30-trial count: toy-camera lifting, pick-handover-place, and tilted pouring. These examples test transfer, release, terminal placement, and possession under a functional object pose [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Shadow hand grasps with SynManDex. MANO→Shadow seeds initialize the Shadow wrist and fingers near object-relevant contact regions before BODex refinement [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Diffusion model architecture. The object mesh is encoded with a point-based representation and, together with the diffusion timestep, conditions a U-Net denoising backbone that predicts digital human pre-grasp parameters. This figure expands Stage A of [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Manus pro glove for geometric retargting. [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Left-hand grasp-generation diagnostic. Each row shows three accepted left-hand grasps on a piggy bank, stapler, and rubber duck. The examples show that the same human-prior-to-robot-grounding mechanism can operate on left hand setting. We demonstrated bimanual grasping and right-hand grasping in the main sections, and here we add qualitative results of left-hand grasping [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 22
Figure 22. Figure 22: Canonical flute-holding support pose. The all-finger-contact pose provides the stable two-hand root configuration from which the release variants in Figures 23 and 24 are organized [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Finger-release taxonomy for flute-holding poses. [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Representative flute-holding release variants. [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Shadow-hand failure cases without human-prior seeding. [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗
read the original abstract

Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SynManDex, a pipeline that samples object-conditioned synthetic human pre-grasps, retargets them to a 36-DOF bimanual dexterous robot, optimizes force-closure contacts with robot-native methods, and validates the resulting keyframes on grasp-and-lift plus prehensile tasks (tea pouring, photo taking, flute playing) generated via VLM agents. It reports 86.4% grasp stability, 4.67/5 human-likeness (93.4%), 80.7% simulation success, and 25/30 (83.3%) real-robot success.

Significance. If the quantitative results and evaluation protocols hold under scrutiny, the work supplies a concrete, end-to-end demonstration that synthetic human pre-grasps can serve as affordance-aware proposals that survive retargeting and embodiment-specific optimization while preserving functional intent. The real-robot success rate on a high-DOF bimanual platform for multi-step manipulation tasks would constitute a useful data point for the community.

major comments (2)
  1. [Abstract] Abstract: the reported grasp stability (86.4%) and human-likeness (4.67/5) figures are presented without any definition of the underlying metric, rating protocol, number of evaluators, or baseline methods. Because these numbers are the primary quantitative support for the central claim that the pipeline “combines high grasp quality with human-likeness,” their definitions are load-bearing and must appear in the abstract or be cross-referenced to a clearly labeled section.
  2. [Pipeline description (assumed §3–4)] The weakest assumption identified in the pipeline—that generated synthetic pre-grasps remain effective after retargeting and robot-native optimization—is asserted but not accompanied by an ablation that isolates the contribution of the synthetic pre-grasp stage versus a direct robot-native sampler. A controlled comparison (e.g., success rate with vs. without the human pre-grasp proposal) would be required to substantiate that the synthetic proposals are the operative factor.
minor comments (2)
  1. The manuscript should include a single overview figure that shows the four stages (sampling, retargeting, force-closure optimization, trajectory validation) with explicit failure-mode annotations at each gate.
  2. Notation for the 36-DOF bimanual platform and the contact-force variables used in the optimization should be introduced once in a dedicated notation table or subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify opportunities to strengthen clarity and empirical support, which we address point by point below with planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported grasp stability (86.4%) and human-likeness (4.67/5) figures are presented without any definition of the underlying metric, rating protocol, number of evaluators, or baseline methods. Because these numbers are the primary quantitative support for the central claim that the pipeline “combines high grasp quality with human-likeness,” their definitions are load-bearing and must appear in the abstract or be cross-referenced to a clearly labeled section.

    Authors: We agree that the abstract would benefit from explicit pointers to the metric definitions. In the revised manuscript we will append a concise cross-reference in the abstract to Section 5.1, which defines grasp stability via the force-closure residual threshold after optimization and reports the human-likeness protocol (15 evaluators, 5-point Likert scale, 50 grasp samples per condition, inter-rater agreement statistics). This change preserves abstract length while satisfying the requirement. revision: yes

  2. Referee: [Pipeline description (assumed §3–4)] The weakest assumption identified in the pipeline—that generated synthetic pre-grasps remain effective after retargeting and robot-native optimization—is asserted but not accompanied by an ablation that isolates the contribution of the synthetic pre-grasp stage versus a direct robot-native sampler. A controlled comparison (e.g., success rate with vs. without the human pre-grasp proposal) would be required to substantiate that the synthetic proposals are the operative factor.

    Authors: The observation is correct: the manuscript does not contain a controlled ablation isolating the synthetic pre-grasp proposals from a pure robot-native sampler. While the end-to-end real-robot results on multi-step tasks provide supporting evidence, an explicit comparison would strengthen the central claim. We will therefore add a new ablation subsection (Section 6.4) that reports success rates for the full pipeline versus a baseline that initializes optimization from random or heuristic robot poses without human pre-grasp retargeting, using identical optimization budgets and task sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline (sample synthetic pre-grasps, retarget to robot, optimize force-closure contacts, validate via simulation and hardware) whose reported metrics (80.7% sim success, 83.3% real-robot success, 86.4% stability, 4.67/5 human-likeness) are presented as measured outcomes of that pipeline rather than quantities derived from equations or fitted parameters. No self-definitional relations, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the supplied text; the central claim rests on external experimental validation, not internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the four-stage pipeline; the abstract invokes standard robotics concepts (force closure, retargeting, trajectory feasibility) without introducing new free parameters or entities.

axioms (1)
  • domain assumption Human pre-grasps encode functional intent transferable via retargeting and optimization
    The pipeline description states that generated human pre-grasps are used as affordance-aware proposals.

pith-pipeline@v0.9.1-grok · 5744 in / 1474 out tokens · 37382 ms · 2026-06-27T16:20:04.881541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 3 canonical work pages

  1. [1]

    Trends and challenges in robot manipulation.Science, 364(6446): eaat8414, 2019

    Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446): eaat8414, 2019

  2. [2]

    Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

    OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020

  3. [3]

    DexGraspNet: A large-scale robotic dexterous grasp dataset for general objects based on simulation

    Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. DexGraspNet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. InInternational Conference on Robotics and Automation (ICRA), pages 11359–11366. IEEE, 2023

  4. [4]

    UniDexGrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

    Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, Tengyu Liu, Li Yi, and He Wang. UniDexGrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog...

  5. [5]

    Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization

    Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization. InInternational Conference on Robotics and Automation (ICRA), 2025

  6. [6]

    GraspQP: Differentiable optimization of force closure for diverse and robust dexterous grasping

    René Zurbrügg, Andrei Cramariuc, and Marco Hutter. GraspQP: Differentiable optimization of force closure for diverse and robust dexterous grasping. InConference on Robot Learning (CoRL), 2025

  7. [7]

    Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter.arXiv preprint arXiv:2503.00778, 2025

    Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter.arXiv preprint arXiv:2503.00778, 2025

  8. [8]

    Afforddexgrasp: Open-set language-guided dexterous grasp with generalizable-instructive affor- dance.arXiv preprint arXiv:2503.07360, 2025

    Yi-Lin Wei, Mu Lin, Yuhao Lin, Jian-Jian Jiang, Xiao-Ming Wu, Ling-An Zeng, and Wei-Shi Zheng. Afforddexgrasp: Open-set language-guided dexterous grasp with generalizable-instructive affor- dance.arXiv preprint arXiv:2503.07360, 2025

  9. [9]

    Dollar, and Danica Kragic

    Thomas Feix, Javier Romero, Heinz-Bodo Schmiedmayer, Aaron M. Dollar, and Danica Kragic. The GRASP taxonomy of human grasp types.IEEE T ransactions on Human-Machine Systems, 46(1):66–77,

  10. [10]

    doi: 10.1109/THMS.2015.2470657

  11. [11]

    Dexonomy: Synthesizing all dexterous grasp types in a grasp taxonomy

    Jiayi Chen, Yubin Ke, Lin Peng, and He Wang. Dexonomy: Synthesizing all dexterous grasp types in a grasp taxonomy. InRobotics: Science and Systems (RSS), 2025

  12. [12]

    DexMV: Imitation learning for dexterous manipulation from human videos

    Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. DexMV: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision (ECCV), pages 570–587. Springer, 2022

  13. [13]

    Ratliff, and Dieter Fox

    Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan D. Ratliff, and Dieter Fox. DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system. InInternational Conference on Robotics and Automation (ICRA), pages 9164–9170, 2020

  14. [14]

    DexVIP: Learning dexterous grasping with human hand pose priors from video

    Priyanka Mandikal and Kristen Grauman. DexVIP: Learning dexterous grasping with human hand pose priors from video. InConference on Robot Learning (CoRL), pages 651–661, 2022

  15. [15]

    DexH2R: Task-oriented dexterous manipulation from human to robots.arXiv preprint arXiv:2411.04428, 2024

    Shuqi Zhao, Xinghao Zhu, Yuxin Chen, Chenran Li, Xiang Zhang, Mingyu Ding, and Masayoshi Tomizuka. DexH2R: Task-oriented dexterous manipulation from human to robots.arXiv preprint arXiv:2411.04428, 2024

  16. [16]

    DexImit: Learning bimanual dexterous manipulation from monocular human videos.arXiv preprint arXiv:2602.10105, 2026

    Juncheng Mu, Sizhe Yang, Yiming Bao, Hojin Bae, Tianming Wei, Linning Xu, Boyi Li, Huazhe Xu, and Jiangmiao Pang. DexImit: Learning bimanual dexterous manipulation from monocular human videos.arXiv preprint arXiv:2602.10105, 2026

  17. [17]

    Dexmachina: Functional retargeting for bimanual dexterous manipulation.arXiv preprint arXiv:2505.24853, 2025

    Mandi Zhao, Yifan Hou, Dieter Fox, Yashraj Narang, Ajay Mandlekar, and Shuran Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation.arXiv preprint arXiv:2505.24853, 2025. 18

  18. [18]

    Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning

    Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, and Siyuan Huang. Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6991–7003, 2025

  19. [19]

    Embodied hands: Modeling and capturing hands and bodies together.ACM T ransactions on Graphics (T oG), 36(6):1–17, 2017

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.ACM T ransactions on Graphics (T oG), 36(6):1–17, 2017

  20. [20]

    Grab: A dataset of whole- body human grasping of objects

    Omid Taheri, Nima Ghorbani, Michael J Black, and Dimitrios Tzionas. Grab: A dataset of whole- body human grasping of objects. InEuropean Conference on Computer Vision (ECCV), pages 581–600. Springer, 2020

  21. [21]

    DiffH2O: Diffusion-based synthesis of hand-object interactions from textual descriptions

    Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. DiffH2O: Diffusion-based synthesis of hand-object interactions from textual descriptions. InACM SIGGRAPH Asia 2024 Conference Papers, 2024. doi: 10.1145/3680528. 3687563

  22. [22]

    Black, and Dimitrios Tzionas

    Omid Taheri, Vasileios Choutas, Michael J. Black, and Dimitrios Tzionas. GOAL: Generating 4D whole-body motion for hand-object grasping. InConference on Computer Vision and Pattern Recognition (CVPR), pages 13263–13273, 2022

  23. [23]

    Contactpose: A dataset of grasps with object contact and hand pose

    Samarth Brahmbhatt, Chengcheng Tang, Christopher D Twigg, Charles C Kemp, and James Hays. Contactpose: A dataset of grasps with object contact and hand pose. InEuropean Conference on Computer Vision (ECCV), pages 361–378. Springer, 2020

  24. [24]

    Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox. DexYCB: A benchmark for capturing hand grasping of objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9044–9053, 2021

  25. [25]

    GraspXL: Generating grasping motions for diverse objects at scale

    Hui Zhang, Sammy Christen, Zicong Fan, Otmar Hilliges, and Jie Song. GraspXL: Generating grasping motions for diverse objects at scale. InEuropean Conference on Computer Vision (ECCV), 2024

  26. [26]

    DeXtreme: Transfer of agile in-hand manipulation from simulation to reality

    Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State. DeXtreme: Transfer of agile in-hand manipulation from simulation to reality. InInternational Conference on Rob...

  27. [27]

    Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm.arXiv preprint arXiv:2503.07541, 2025

    Zhao-Heng Yin, Changhao Wang, Luis Pineda, Krishna Bodduluri, Tingfan Wu, Pieter Abbeel, and Mustafa Mukadam. Geometric retargeting: A principled, ultrafast neural hand retargeting algorithm.arXiv preprint arXiv:2503.07541, 2025

  28. [28]

    UltraDexGrasp: Learning universal dexterous grasping for bimanual robots with synthetic data

    Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. UltraDexGrasp: Learning universal dexterous grasping for bimanual robots with synthetic data. arXiv preprint arXiv:2603.05312, 2026

  29. [29]

    Gen- DexGrasp: Generalizable dexterous grasping

    Puhao Li, Tengyu Liu, Yuyang Li, Yiran Geng, Yixin Zhu, Yaodong Yang, and Siyuan Huang. Gen- DexGrasp: Generalizable dexterous grasping. InInternational Conference on Robotics and Automation (ICRA), pages 8068–8074, 2023

  30. [30]

    DexGraspNet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

    Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. DexGraspNet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. InConference on Robot Learning (CoRL), 2024

  31. [31]

    Dexterous grasp transformer

    Guo-Hao Xu, Yi-Lin Wei, Dian Zheng, Xiao-Ming Wu, and Wei-Shi Zheng. Dexterous grasp transformer. InConference on Computer Vision and Pattern Recognition (CVPR), pages 17933–17942, 2024

  32. [32]

    Hand-object contact consistency reasoning for human grasps generation

    Hanwen Jiang, Shaowei Liu, Jiashun Wang, and Xiaolong Wang. Hand-object contact consistency reasoning for human grasps generation. InInternational Conference on Computer Vision (ICCV), pages 11107–11116, 2021. 19

  33. [33]

    Dhagrasp: Synthesizing affordance-aware dual-hand grasps with text instructions.arXiv preprint arXiv:2509.22175, 2025

    Quanzhou Li, Zhonghua Wu, Jingbo Wang, Chen Change Loy, and Bo Dai. Dhagrasp: Synthesizing affordance-aware dual-hand grasps with text instructions.arXiv preprint arXiv:2509.22175, 2025

  34. [34]

    Bimanual grasp synthesis for dexterous robot hands.IEEE Robotics and Automation Letters, 9(12):11377–11384, 2024

    Yanming Shao and Chenxi Xiao. Bimanual grasp synthesis for dexterous robot hands.IEEE Robotics and Automation Letters, 9(12):11377–11384, 2024

  35. [35]

    Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes.arXiv preprint arXiv:2604.06589, 2026

    Mu Lin, Yi-Lin Wei, Jiaxuan Chen, Yuhao Lin, Shuoyu Chen, Jiangran Lyu, Jiayi Chen, Yansong Tang, He Wang, and Wei-Shi Zheng. Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes.arXiv preprint arXiv:2604.06589, 2026

  36. [36]

    On computing three-finger force-closure grasps of 2-d and 3-d objects.IEEE T ransactions on Robotics and Automation, 19(1):155–161, 2003

    Jia-Wei Li, Hong Liu, and He-Gao Cai. On computing three-finger force-closure grasps of 2-d and 3-d objects.IEEE T ransactions on Robotics and Automation, 19(1):155–161, 2003

  37. [37]

    CRC press, 1994

    Richard M Murray, Zexiang Li, and S Shankar Sastry.A mathematical introduction to robotic manipula- tion. CRC press, 1994

  38. [38]

    Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano- Munoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  39. [39]

    cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023

    Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. cuRobo: Parallelized collision-free minimum-jerk robot motion generation.arXiv preprint arXiv:2310.17274, 2023

  40. [40]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11097–11107, 2020

  41. [41]

    Qi, Li Yi, Hao Su, and Leonidas J

    Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems, 2017

  42. [42]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

  43. [43]

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations (ICLR), 2023

  44. [44]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18000–18010, 2023

  45. [45]

    DexHiL: A human-in-the-loop framework for vision-language-action model post-training in dexterous manipulation.arXiv preprint arXiv:2603.09121, 2026

    Yifan Han, Zhongxi Chen, Yuxuan Zhao, Congsheng Xu, Yanming Shao, Yichuan Peng, Yao Mu, and Wenzhao Lian. DexHiL: A human-in-the-loop framework for vision-language-action model post-training in dexterous manipulation.arXiv preprint arXiv:2603.09121, 2026

  46. [46]

    Learning to transfer human hand skills for robot manipulations.arXiv preprint arXiv:2501.04169, 2025

    Sungjae Park, Seungho Lee, Mingi Choi, Jiye Lee, Jeonghwan Kim, Jisoo Kim, and Hanbyul Joo. Learning to transfer human hand skills for robot manipulations.arXiv preprint arXiv:2501.04169, 2025

  47. [47]

    A system for general in-hand object re-orientation

    Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning (CoRL), pages 297–307, 2022

  48. [48]

    Rotating without seeing: Towards in-hand dexterity through touch

    Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. InRobotics: Science and Systems (RSS), 2023

  49. [49]

    Towards human-level bimanual dexterous manipulation with reinforcement learning

    Yuanpei Chen, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen McAleer, Hao Dong, Song-Chun Zhu, and Yaodong Yang. Towards human-level bimanual dexterous manipulation with reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks T rack, 2022

  50. [50]

    Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

    Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. InConference on Computer Vision and Pattern Recognition (CVPR), pages 21190–21200, 2023. 20

  51. [51]

    CyberDemo: Augmenting simulated human demonstration for real-world dexterous manipulation

    Jun Wang, Yuzhe Qin, Kaiming Kuang, Yigit Korkmaz, Akhilan Gurumoorthy, Hao Su, and Xiao- long Wang. CyberDemo: Augmenting simulated human demonstration for real-world dexterous manipulation. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

  52. [52]

    URLhttps://www

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InProceedings of Robotics: Science and Systems (RSS), 2018. doi: 10.15607/RSS.2018.XIV .049

  53. [53]

    What matters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei- Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

  54. [54]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

  55. [55]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. volume 44, pages 1684–1704. Sage Publications Sage UK: London, England, 2025

  56. [56]

    Chain-of-action: Trajectory autoregressive modeling for robotic manipulation

    Wenbo Zhang, Tianrun Hu, Yanyuan Qiao, Hanbo Zhang, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, and Xiao Ma. Chain-of-action: Trajectory autoregressive modeling for robotic manipulation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  57. [57]

    Dextrack: Towards generaliz- able neural tracking control for dexterous manipulation from human references

    Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, and Li Yi. Dextrack: Towards generaliz- able neural tracking control for dexterous manipulation from human references. InInternational Conference on Learning Representations (ICLR), 2025

  58. [58]

    Learning diverse bimanual dexterous manipulation skills from human demonstrations.arXiv preprint arXiv:2410.02477, 2024

    Bohan Zhou, Haoqi Yuan, Yuhui Fu, and Zongqing Lu. Learning diverse bimanual dexterous manipulation skills from human demonstrations.arXiv preprint arXiv:2410.02477, 2024

  59. [59]

    UniDex- Grasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

    Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. UniDex- Grasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3891–3902, 2023

  60. [60]

    6-dof graspnet: Variational grasp generation for object manipulation

    Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipulation. InInternational Conference on Computer Vision (ICCV), pages 2901–2910, 2019

  61. [61]

    GANHand: Predicting human grasp affordances in multi-object scenes

    Enric Corona, Albert Pumarola, Guillem Alenyà, Francesc Moreno-Noguer, and Gregory Rogez. GANHand: Predicting human grasp affordances in multi-object scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5031–5041, 2020

  62. [62]

    Black, Krikamol Muandet, and Siyu Tang

    Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael J. Black, Krikamol Muandet, and Siyu Tang. Grasping field: Learning implicit representations for human grasps. InInternational Conference on 3D Vision (3DV), pages 333–344, 2020

  63. [63]

    Twigg, Minh Vo, Samarth Brahmbhatt, and Charles C

    Patrick Grady, Chengcheng Tang, Christopher D. Twigg, Minh Vo, Samarth Brahmbhatt, and Charles C. Kemp. ContactOpt: Optimizing contact to improve grasps. InConference on Computer Vision and Pattern Recognition (CVPR), pages 1471–1481, 2021

  64. [64]

    Grasping a handful: Sequential multi-object dexterous grasp generation.IEEE Robotics and Automation Letters, 2025

    Haofei Lu, Yifei Dong, Zehang Weng, Florian Pokorny, Jens Lundell, and Danica Kragic. Grasping a handful: Sequential multi-object dexterous grasp generation.IEEE Robotics and Automation Letters, 2025

  65. [65]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  66. [66]

    DeepSeek-VL2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. DeepSeek-VL2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 21 Algorithm 1SynManDex as proposal, refinement, and executable filtering. Require:Object mesh...

  67. [67]

    Human grasp priors are therefore used as initialization rather than as executable labels

    (14) Here h0 is a generated MANO pre-grasp, Rψ maps it to a robot seed, and the final optimization is performed in robot configuration space. Human grasp priors are therefore used as initialization rather than as executable labels. Compared with random initialization, a human-prior seed starts in a functional region of Qbi, after which contact, collision,...

  68. [68]

    Use the grasp keyframe as the initial possession state

  69. [69]

    Assign explicit roles to the left and right hands

  70. [70]

    Propose only allowed primitives from the primitive library

  71. [71]

    Express motion as object-relative waypoints or bounded deltas, not as joint torques or raw robot commands

  72. [72]

    Preserve possession unless the release condition is explicitly satisfied

  73. [73]

    The executor will check IK, collision, possession, force-closure, and terminal task success

    Do not assume feasibility. The executor will check IK, collision, possession, force-closure, and terminal task success

  74. [74]

    keyframe_id

    If a task would require unmodeled fluid, buttons, articulation, or tactile sensing, phrase the goal as a geometric proxy, e.g., ’tilt the teapot by 35 degrees while maintaining possession’ rather than ’pour liquid’. K.4 User Prompt Template You are given one SynManDex validated grasp keyframe. [VISUAL INPUT] - Multi-view images: front, left, right, top, w...