pith. sign in

arxiv: 2506.15953 · v2 · pith:5YWQV6COnew · submitted 2025-06-19 · 💻 cs.RO

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Pith reviewed 2026-05-19 09:42 UTC · model grok-4.3

classification 💻 cs.RO
keywords visuo-tactile fusiondexterous manipulationcross-attentiontactile predictionimitation learninganthropomorphic handroboticscross-modal representation
0
0 comments X

The pith

ViTacFormer fuses vision and touch via cross-attention and autoregressive tactile prediction to reach roughly 50 percent higher success rates in real-world dexterous manipulation and complete 11-stage tasks lasting 2.5 minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViTacFormer as a way to learn joint representations from vision and high-resolution touch that support precise control by anthropomorphic hands. A cross-attention encoder combines the two modalities while an autoregressive head predicts upcoming tactile signals, and an easy-to-challenging curriculum progressively sharpens the shared latent space. The resulting representation is then used for imitation learning policies. If the approach works as described, robots could execute longer sequences of fine-grained actions even when vision is blocked or unreliable. The reported gains appear on real hardware across multiple benchmarks.

Core claim

ViTacFormer couples a cross-attention encoder that fuses high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. An easy-to-challenging curriculum steadily refines the visual-tactile latent space. The learned representation then drives imitation learning for multi-fingered hands, producing approximately 50 percent higher success rates than prior state-of-the-art systems on real-world benchmarks and enabling the first autonomous completion of long-horizon tasks that require up to 11 sequential stages and 2.5 minutes of continuous operation.

What carries the argument

Cross-attention encoder fused with an autoregressive tactile prediction head, refined by an easy-to-challenging curriculum, that builds a shared visual-tactile latent space for imitation learning policies.

If this is right

  • Imitation learning policies gain precision and adaptability for multi-fingered hands in contact-rich tasks.
  • Manipulation remains effective in visually occluded settings where tactile feedback becomes primary.
  • Continuous operation extends to sequences of 11 or more stages without external resets.
  • Success rates rise across a range of challenging real-world benchmarks compared with earlier approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion and prediction architecture could be tested on different robot hands or additional sensor types to check transfer.
  • The curriculum progression might shorten training time for other fine-control robotics problems that mix vision and force data.
  • Long-horizon stability suggests the representation could support more complex sequences in household or factory settings if generalization holds.
  • Evaluating performance under changed lighting or partial sensor failure would test robustness claims directly.

Load-bearing premise

The cross-modal representation learned from the reported benchmarks and hardware will generalize to new objects, sensor calibrations, and unstructured environments instead of overfitting to the specific training conditions.

What would settle it

Deploy the trained policy on a collection of previously unseen objects with novel shapes, sizes, and surface properties while keeping the same hand and sensors, then measure whether success rates fall below the claimed improvement over prior methods.

Figures

Figures reproduced from arXiv: 2506.15953 by Haoran Geng, Jitendra Malik, Kaifeng Zhang, Liang Heng, Pieter Abbeel.

Figure 1
Figure 1. Figure 1: An overview of our system hardware and teleoperation setup. (a) Our hardware system setup. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The neural network architecture for ViTacFormer is a conditional variational auto-encoder. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-attention-based multimodal integration [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Four short-horizon visuo-tactile tasks, from left to right, i.e., peg insertion, cap twist, vase [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Successful model rollout on long-horizon task, i.e., making hamburger. We show the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation Study. The comparison between several ablated algorithms. Each component of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure Study. The first row is peg insertion failure, the second row is cap twist failure, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Four types of camera views We use four synchronized camera views as vi￾sual input: a stereo pair (180×320) from top￾mounted ZED Mini cameras ( [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Short-horizon task setup. (a) All four short-horizon tasks share a common set of objects. (b) The tabletop workspace is marked with a grid; the top-left corner is defined as the origin (0, 0). Each object is positioned at a predefined grid point during training. The four short-horizon tasks share a standardized tabletop workspace and a common set of objects, as shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 10
Figure 10. Figure 10: Execution examples for short-horizon tasks. Representative keyframes from four tasks: peg insertion, cap twist, vase wipe, and book flip. Each task demonstrates a full execution sequence from perception to manipulation. Inference Results [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative failure cases across all tasks. Each row corresponds to one task, with two failure case sequences shown side by side. The robot uses its right hand to rotate a cap off a bottle and place it on the table. The cap is initially tightened at a clockwise offset of about 100 degrees from the open position. Representative execution frames are shown in the second row of [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 12
Figure 12. Figure 12: Long-horizon task setup. Seven com￾ponents are placed in predefined zones—circular (ingredients) or rectangular (tools). Objects are randomly initialized within these areas to test spa￾tial generalization. Task Description The long-horizon task involves a full hamburger assembly sequence requiring precise tool use and multi-stage coordination. The robot be￾gins by flipping a wooden card from “closed” to “… view at source ↗
read the original abstract

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ViTacFormer, a cross-modal representation learning framework for visuo-tactile dexterous manipulation. It combines a cross-attention encoder for fusing vision and high-resolution touch with an autoregressive tactile prediction head and trains it using an easy-to-challenging curriculum. The learned representation is used for imitation learning on multi-fingered anthropomorphic hands, with claims of ~50% improvement in success rates over prior SOTA and pioneering long-horizon tasks of up to 11 stages lasting 2.5 minutes on real-world benchmarks.

Significance. Should the empirical claims prove robust upon detailed verification, this work would represent a notable advance in robotic manipulation by demonstrating how cross-modal visuo-tactile representations can enable precise, adaptive, and extended autonomous behaviors in dexterous tasks. The proposed architecture and curriculum offer a concrete approach to addressing the challenges of integrating tactile sensing with vision for unstructured settings.

major comments (2)
  1. Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.
  2. Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.
minor comments (2)
  1. Methods: The description of the curriculum progression could benefit from more explicit details on the schedule and hyperparameters to allow reproducibility.
  2. Abstract: Clarify the specific hardware (e.g., hand model and tactile sensor type) used in the experiments for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and address concerns about substantiation and generalization.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.

    Authors: We agree that the abstract would benefit from additional context to support the performance claims. We have revised the abstract to include a brief reference to the evaluation protocol, noting that results are drawn from real-world experiments with multiple trials per task and direct comparisons against prior state-of-the-art methods, with statistical significance reported in the main text. Full details on trial counts, baselines, and p-values remain in Sections 4 and 5 due to abstract length limits. revision: yes

  2. Referee: Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.

    Authors: Our benchmark suite already incorporates object pose variations, lighting changes, and contact dynamics across the reported tasks to probe robustness. Nevertheless, we acknowledge that explicit tests on held-out objects and deliberate calibration shifts would provide stronger evidence for generalization. In the revised manuscript we have added a dedicated paragraph in the Discussion section that explicitly discusses this limitation, clarifies the scope of the current benchmarks, and outlines directions for future generalization experiments while preserving the claims supported by the existing long-horizon results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on measured robot experiments

full rationale

The paper describes an architecture (cross-attention encoder plus autoregressive tactile prediction head) trained via an easy-to-challenging curriculum to produce a visuo-tactile latent space, then applies the resulting representation to imitation learning on an anthropomorphic hand. All headline numbers—approximately 50% higher success rates and successful execution of up to 11-stage, 2.5-minute tasks—are reported as outcomes of real-world benchmark trials after training. No equation, prediction, or uniqueness claim is shown to reduce by construction to a fitted parameter, self-citation, or input quantity; the central results are externally measured success rates on fixed hardware and task suites rather than quantities defined or forced inside the model itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an empirical deep-learning robotics paper the central claim rests on standard neural-network training assumptions plus the domain-specific premise that cross-modal attention plus future-touch prediction yields a useful latent space. No new physical entities are introduced.

free parameters (2)
  • Curriculum progression schedule
    The mapping from task difficulty to training stage is chosen by the authors and tuned on the target hardware.
  • Transformer and prediction-head hyperparameters
    Learning rate, number of attention heads, and loss weights are optimized on the collected manipulation data.
axioms (2)
  • domain assumption Cross-attention can produce a fused representation that is more useful for control than separate modality encoders
    Invoked by the choice of encoder architecture in the abstract.
  • domain assumption Autoregressive tactile prediction improves downstream manipulation policy performance
    Basis for adding the prediction head and for claiming the representation is refined.

pith-pipeline@v0.9.0 · 5733 in / 1641 out tokens · 51168 ms · 2026-05-19T09:42:55.151670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multimodal Diffusion Forcing for Forceful Manipulation

    cs.RO 2025-11 unverdicted novelty 7.0

    Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.

  2. DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

    cs.RO 2026-05 conditional novelty 6.0

    DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.

  3. DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

    cs.RO 2026-05 unverdicted novelty 6.0

    DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...

  4. Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding

    cs.RO 2026-03 unverdicted novelty 6.0

    Contact-Grounded Policy predicts coupled robot-state and tactile trajectories with a diffusion model and maps them via a learned consistency function to executable targets for compliance controllers, outperforming sta...

  5. Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation

    cs.RO 2025-12 unverdicted novelty 6.0

    DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.

  6. Learning Versatile Humanoid Manipulation with Touch Dreaming

    cs.RO 2026-04 conditional novelty 5.0

    HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...

  7. Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

    cs.RO 2026-05 unverdicted novelty 4.0

    A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 7 Pith papers · 5 internal anchors

  1. [1]

    Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

    Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21190–21200, June 2023

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. https://arxiv. org/abs/2410.24164

  3. [3]

    A system for general in-hand object re-orientation

    Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, pages 297–307. PMLR, 2022

  4. [4]

    Bi-dexhands: Towards human-level bimanual dexterous manipulation

    Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46(5):2804–2818, 2024. doi: 10.1109/TPAMI.2023.3339515

  5. [5]

    Karen Liu

    Yuanpei Chen, Chen Wang, Yaodong Yang, and C. Karen Liu. Object-centric dexterous manipulation from human motion data, 2024. URL https://arxiv.org/abs/2411.04005

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , page 02783649241273668, 2023

  7. [7]

    Dreher, T

    Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733

  8. [8]

    A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

    Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232, 2024

  9. [9]

    Gapartnet: Cross- category domain-generalizable object perception and manip- ulation via generalizable and actionable parts.arXiv preprint arXiv:2211.05272, 2022

    Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022

  10. [10]

    Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations

    Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, and He Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023

  11. [11]

    Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

    Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

  12. [12]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

  13. [13]

    Dextreme: Transfer of agile in-hand manipulation from simulation to reality

    Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. arXiv, 2022

  14. [14]

    Sparsh: Self-supervised touch representations for vision- based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

    Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing. arXiv preprint arXiv:2410.24090, 2024

  15. [15]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

  16. [16]

    Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024

    Taoran Jiang, Liqian Ma, Yixuan Guan, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Jing Xu, and Rui Chen. Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024. URL https://arxiv.org/abs/ 2409.08750

  17. [17]

    Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

    Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. arXiv preprint arXiv:2407.04689, 2024

  18. [18]

    Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,

    Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, and Yue Wang. Skillblender: Towards versatile humanoid whole-body loco- manipulation via skill blending. arXiv preprint arXiv:2506.09366, 2025

  19. [19]

    Making sense of vision and touch: Learning multimodal representations for contact-rich tasks

    Michelle A Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics , 36(3): 582–596, 2020

  20. [20]

    Tenenbaum, and Chuang Gan

    Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics, 2023. URL https://arxiv.org/abs/2304.03223

  21. [21]

    Learning visuotactile skills with two multifingered hands

    Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands. IEEE International Conference on Robotics & Automation (ICRA) , 2025

  22. [22]

    Physpart: Physically plausible part completion for interactable objects

    Rundong Luo*, Haoran Geng*, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guidbas, and Siyuan Huang. Physpart: Physically plausible part completion for interactable objects. International Conference on Robotics and Automation (ICRA) , 2025. URL https: //arxiv.org/abs/2408.13724

  23. [23]

    In-Hand Object Rotation via Rapid Motor Adaptation

    Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-Hand Object Rotation via Rapid Motor Adaptation. In Conference on Robot Learning (CoRL) , 2022

  24. [24]

    General in-hand object rotation with vision and touch

    Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In Conference on Robot Learning, pages 2549–2564. PMLR, 2023

  25. [25]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025. URL https://arxiv.org/abs/2505.22642

  26. [26]

    The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning

    Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, and Pieter Abbeel. The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 9698–9705. IEEE, 2024

  27. [27]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning (CoRL), pages 894–906. PMLR, 2022. 11

  28. [28]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL) , pages 785–799. PMLR, 2023

  29. [29]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  30. [30]

    Octo: An open-source generalist robot policy, 2023

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy, 2023

  31. [31]

    Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

    Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023

  32. [32]

    Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

    Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simula- tion. arXiv preprint arXiv:2210.02697, 2022

  33. [33]

    Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy

    Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, and Hao Dong. Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy, 2025. URL https://arxiv.org/abs/2505.11032

  34. [34]

    Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024

    Tianhao Wu, Jinzhou Li, Jiyao Zhang, Mingdong Wu, and Hao Dong. Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024. URL https://arxiv.org/abs/2409.17549

  35. [35]

    Learning to manipulate deformable objects without demonstrations, 2020

    Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/ abs/1910.13439

  36. [36]

    Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

    Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023

  37. [37]

    Unit: Unified tactile representation for robot learning

    Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, and Yu She. Unit: Unified tactile representation for robot learning. arXiv preprint arXiv:2408.06481, 2024

  38. [38]

    Binding touch to everything: Learn- ing unified multimodal tactile representations

    Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learn- ing unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26340–26353, 2024

  39. [39]

    Rotating without seeing: Towards in-hand dexterity through touch.arXiv preprint arXiv:2303.10880, 2023

    Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023

  40. [40]

    Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation

    Kelin Yu, Yunhai Han, Qixian Wang, Vaibhav Saxena, Danfei Xu, and Ye Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023

  41. [41]

    Robot synesthesia: In-hand manipulation with visuotactile sensing

    Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manipulation with visuotactile sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6558–6565. IEEE, 2024

  42. [42]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024

  43. [43]

    Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation

    Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, and Otmar Hilliges. Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation. In 2024 International Conference on 3D Vision (3DV) , pages 235–246, 2024. doi: 10.1109/3DV62453.2024.00016. 12

  44. [44]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

    Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In 8th Annual Conference on Robot Learning

  45. [45]

    Rodrigues Network for Learning Robot Actions

    Jialiang Zhang, Haoran Geng, Yang You, Congyue Deng, Pieter Abbeel, Jitendra Malik, and Leonidas Guibas. Rodrigues network for learning robot actions, 2025. URL https: //arxiv.org/abs/2506.02618

  46. [46]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  47. [47]

    Aloha unleashed: A simple recipe for robot dexterity

    Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126, 2024

  48. [48]

    10610948

    Sun Zhaole, Jihong Zhu, and Robert B. Fisher. Dexdlo: Learning goal-conditioned dexterous policy for dynamic manipulation of deformable linear objects. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16009–16015, 2024. doi: 10.1109/ ICRA57147.2024.10610754. 13 Appendix A. Implementation and training details, including senso...