ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Haoran Geng; Jitendra Malik; Kaifeng Zhang; Liang Heng; Pieter Abbeel

arxiv: 2506.15953 · v2 · pith:5YWQV6COnew · submitted 2025-06-19 · 💻 cs.RO

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Liang Heng , Haoran Geng , Kaifeng Zhang , Pieter Abbeel , Jitendra Malik This is my paper

Pith reviewed 2026-05-19 09:42 UTC · model grok-4.3

classification 💻 cs.RO

keywords visuo-tactile fusiondexterous manipulationcross-attentiontactile predictionimitation learninganthropomorphic handroboticscross-modal representation

0 comments

The pith

ViTacFormer fuses vision and touch via cross-attention and autoregressive tactile prediction to reach roughly 50 percent higher success rates in real-world dexterous manipulation and complete 11-stage tasks lasting 2.5 minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViTacFormer as a way to learn joint representations from vision and high-resolution touch that support precise control by anthropomorphic hands. A cross-attention encoder combines the two modalities while an autoregressive head predicts upcoming tactile signals, and an easy-to-challenging curriculum progressively sharpens the shared latent space. The resulting representation is then used for imitation learning policies. If the approach works as described, robots could execute longer sequences of fine-grained actions even when vision is blocked or unreliable. The reported gains appear on real hardware across multiple benchmarks.

Core claim

ViTacFormer couples a cross-attention encoder that fuses high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. An easy-to-challenging curriculum steadily refines the visual-tactile latent space. The learned representation then drives imitation learning for multi-fingered hands, producing approximately 50 percent higher success rates than prior state-of-the-art systems on real-world benchmarks and enabling the first autonomous completion of long-horizon tasks that require up to 11 sequential stages and 2.5 minutes of continuous operation.

What carries the argument

Cross-attention encoder fused with an autoregressive tactile prediction head, refined by an easy-to-challenging curriculum, that builds a shared visual-tactile latent space for imitation learning policies.

If this is right

Imitation learning policies gain precision and adaptability for multi-fingered hands in contact-rich tasks.
Manipulation remains effective in visually occluded settings where tactile feedback becomes primary.
Continuous operation extends to sequences of 11 or more stages without external resets.
Success rates rise across a range of challenging real-world benchmarks compared with earlier approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion and prediction architecture could be tested on different robot hands or additional sensor types to check transfer.
The curriculum progression might shorten training time for other fine-control robotics problems that mix vision and force data.
Long-horizon stability suggests the representation could support more complex sequences in household or factory settings if generalization holds.
Evaluating performance under changed lighting or partial sensor failure would test robustness claims directly.

Load-bearing premise

The cross-modal representation learned from the reported benchmarks and hardware will generalize to new objects, sensor calibrations, and unstructured environments instead of overfitting to the specific training conditions.

What would settle it

Deploy the trained policy on a collection of previously unseen objects with novel shapes, sizes, and surface properties while keeping the same hand and sensors, then measure whether success rates fall below the claimed improvement over prior methods.

Figures

Figures reproduced from arXiv: 2506.15953 by Haoran Geng, Jitendra Malik, Kaifeng Zhang, Liang Heng, Pieter Abbeel.

**Figure 2.** Figure 2: The neural network architecture for ViTacFormer is a conditional variational auto-encoder. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-attention-based multimodal integration [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Four short-horizon visuo-tactile tasks, from left to right, i.e., peg insertion, cap twist, vase [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Successful model rollout on long-horizon task, i.e., making hamburger. We show the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation Study. The comparison between several ablated algorithms. Each component of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Failure Study. The first row is peg insertion failure, the second row is cap twist failure, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Four types of camera views We use four synchronized camera views as visual input: a stereo pair (180×320) from topmounted ZED Mini cameras ( [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Short-horizon task setup. (a) All four short-horizon tasks share a common set of objects. (b) The tabletop workspace is marked with a grid; the top-left corner is defined as the origin (0, 0). Each object is positioned at a predefined grid point during training. The four short-horizon tasks share a standardized tabletop workspace and a common set of objects, as shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 10.** Figure 10: Execution examples for short-horizon tasks. Representative keyframes from four tasks: peg insertion, cap twist, vase wipe, and book flip. Each task demonstrates a full execution sequence from perception to manipulation. Inference Results [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Representative failure cases across all tasks. Each row corresponds to one task, with two failure case sequences shown side by side. The robot uses its right hand to rotate a cap off a bottle and place it on the table. The cap is initially tightened at a clockwise offset of about 100 degrees from the open position. Representative execution frames are shown in the second row of [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 12.** Figure 12: Long-horizon task setup. Seven components are placed in predefined zones—circular (ingredients) or rectangular (tools). Objects are randomly initialized within these areas to test spatial generalization. Task Description The long-horizon task involves a full hamburger assembly sequence requiring precise tool use and multi-stage coordination. The robot begins by flipping a wooden card from “closed” to “… view at source ↗

read the original abstract

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViTacFormer pairs cross-attention vision-touch fusion with an autoregressive tactile predictor and easy-to-hard curriculum for imitation on a dexterous hand, claiming 50% gains and 11-stage runs, but the generalization story rests on untested assumptions.

read the letter

The paper's main move is to fuse high-resolution vision and touch through cross-attention, add an autoregressive head that predicts upcoming tactile signals, and train the whole thing with a staged curriculum before feeding the representation into imitation learning. That combination is not a routine extension of prior visuo-tactile work and gives a concrete way to anticipate contact changes during long sequences on an anthropomorphic hand. The curriculum itself is a practical choice for these tasks, since starting simple and increasing difficulty can stabilize learning when contact dynamics are hard to model directly. If the real-robot results hold, the reported lift in success rate and the ability to run 11-stage, multi-minute behaviors would be useful for anyone trying to move dexterous manipulation out of short scripted demos. The stress-test concern lands: the experiments stay inside the same object set, hand, and sensor configuration, with no held-out categories, calibration shifts, or scene clutter described. Without those checks the claim that the learned representation generalizes rather than fits the benchmark suite remains open. The abstract also gives no protocol details, trial counts, or ablation numbers, so it is difficult to judge how much the new components actually drive the numbers versus implementation tweaks. Readers working on multi-modal robot policies or tactile-rich control would find the architecture worth examining for ideas. The work is coherent enough on its own terms to merit peer review, mainly so the experimental gaps can be addressed with concrete additions rather than desk rejection.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ViTacFormer, a cross-modal representation learning framework for visuo-tactile dexterous manipulation. It combines a cross-attention encoder for fusing vision and high-resolution touch with an autoregressive tactile prediction head and trains it using an easy-to-challenging curriculum. The learned representation is used for imitation learning on multi-fingered anthropomorphic hands, with claims of ~50% improvement in success rates over prior SOTA and pioneering long-horizon tasks of up to 11 stages lasting 2.5 minutes on real-world benchmarks.

Significance. Should the empirical claims prove robust upon detailed verification, this work would represent a notable advance in robotic manipulation by demonstrating how cross-modal visuo-tactile representations can enable precise, adaptive, and extended autonomous behaviors in dexterous tasks. The proposed architecture and curriculum offer a concrete approach to addressing the challenges of integrating tactile sensing with vision for unstructured settings.

major comments (2)

Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.
Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.

minor comments (2)

Methods: The description of the curriculum progression could benefit from more explicit details on the schedule and hyperparameters to allow reproducibility.
Abstract: Clarify the specific hardware (e.g., hand model and tactile sensor type) used in the experiments for better context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and address concerns about substantiation and generalization.

read point-by-point responses

Referee: Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.

Authors: We agree that the abstract would benefit from additional context to support the performance claims. We have revised the abstract to include a brief reference to the evaluation protocol, noting that results are drawn from real-world experiments with multiple trials per task and direct comparisons against prior state-of-the-art methods, with statistical significance reported in the main text. Full details on trial counts, baselines, and p-values remain in Sections 4 and 5 due to abstract length limits. revision: yes
Referee: Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.

Authors: Our benchmark suite already incorporates object pose variations, lighting changes, and contact dynamics across the reported tasks to probe robustness. Nevertheless, we acknowledge that explicit tests on held-out objects and deliberate calibration shifts would provide stronger evidence for generalization. In the revised manuscript we have added a dedicated paragraph in the Discussion section that explicitly discusses this limitation, clarifies the scope of the current benchmarks, and outlines directions for future generalization experiments while preserving the claims supported by the existing long-horizon results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on measured robot experiments

full rationale

The paper describes an architecture (cross-attention encoder plus autoregressive tactile prediction head) trained via an easy-to-challenging curriculum to produce a visuo-tactile latent space, then applies the resulting representation to imitation learning on an anthropomorphic hand. All headline numbers—approximately 50% higher success rates and successful execution of up to 11-stage, 2.5-minute tasks—are reported as outcomes of real-world benchmark trials after training. No equation, prediction, or uniqueness claim is shown to reduce by construction to a fitted parameter, self-citation, or input quantity; the central results are externally measured success rates on fixed hardware and task suites rather than quantities defined or forced inside the model itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an empirical deep-learning robotics paper the central claim rests on standard neural-network training assumptions plus the domain-specific premise that cross-modal attention plus future-touch prediction yields a useful latent space. No new physical entities are introduced.

free parameters (2)

Curriculum progression schedule
The mapping from task difficulty to training stage is chosen by the authors and tuned on the target hardware.
Transformer and prediction-head hyperparameters
Learning rate, number of attention heads, and loss weights are optimized on the collected manipulation data.

axioms (2)

domain assumption Cross-attention can produce a fused representation that is more useful for control than separate modality encoders
Invoked by the choice of encoder architecture in the abstract.
domain assumption Autoregressive tactile prediction improves downstream manipulation policy performance
Basis for adding the prediction head and for claiming the representation is refined.

pith-pipeline@v0.9.0 · 5733 in / 1641 out tokens · 51168 ms · 2026-05-19T09:42:55.151670+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

easy-to-challenging curriculum that steadily refines the visual-tactile latent space
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

long-horizon dexterous manipulation tasks ... 11 sequential stages

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multimodal Diffusion Forcing for Forceful Manipulation
cs.RO 2025-11 unverdicted novelty 7.0

Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
cs.RO 2026-05 conditional novelty 6.0

DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
cs.RO 2026-05 unverdicted novelty 6.0

DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding
cs.RO 2026-03 unverdicted novelty 6.0

Contact-Grounded Policy predicts coupled robot-state and tactile trajectories with a diffusion model and maps them via a learned consistency function to executable targets for compliance controllers, outperforming sta...
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
cs.RO 2025-12 unverdicted novelty 6.0

DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
Learning Versatile Humanoid Manipulation with Touch Dreaming
cs.RO 2026-04 conditional novelty 5.0

HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms
cs.RO 2026-05 unverdicted novelty 4.0

A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 7 Pith papers · 5 internal anchors

[1]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21190–21200, June 2023

work page 2023
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. https://arxiv. org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

A system for general in-hand object re-orientation

Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, pages 297–307. PMLR, 2022

work page 2022
[4]

Bi-dexhands: Towards human-level bimanual dexterous manipulation

Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46(5):2804–2818, 2024. doi: 10.1109/TPAMI.2023.3339515

work page doi:10.1109/tpami.2023.3339515 2024
[5]

Karen Liu

Yuanpei Chen, Chen Wang, Yaodong Yang, and C. Karen Liu. Object-centric dexterous manipulation from human motion data, 2024. URL https://arxiv.org/abs/2411.04005

work page arXiv 2024
[6]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , page 02783649241273668, 2023

work page 2023
[7]

Dreher, T

Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733

work page doi:10.1109/iros58592.2024 2024
[8]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024
[9]

Gapartnet: Cross- category domain-generalizable object perception and manip- ulation via generalizable and actionable parts.arXiv preprint arXiv:2211.05272, 2022

Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022

work page arXiv 2022
[10]

Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations

Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, and He Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023

work page arXiv 2023
[11]

Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

work page 2023
[12]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

work page arXiv 2025
[13]

Dextreme: Transfer of agile in-hand manipulation from simulation to reality

Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. arXiv, 2022

work page 2022
[14]

Sparsh: Self-supervised touch representations for vision- based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing. arXiv preprint arXiv:2410.24090, 2024

work page arXiv 2024
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

work page 2020
[16]

Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024

Taoran Jiang, Liqian Ma, Yixuan Guan, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Jing Xu, and Rui Chen. Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024. URL https://arxiv.org/abs/ 2409.08750

work page arXiv 2024
[17]

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. arXiv preprint arXiv:2407.04689, 2024

work page arXiv 2024
[18]

Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,

Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, and Yue Wang. Skillblender: Towards versatile humanoid whole-body loco- manipulation via skill blending. arXiv preprint arXiv:2506.09366, 2025

work page arXiv 2025
[19]

Making sense of vision and touch: Learning multimodal representations for contact-rich tasks

Michelle A Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics , 36(3): 582–596, 2020

work page 2020
[20]

Tenenbaum, and Chuang Gan

Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics, 2023. URL https://arxiv.org/abs/2304.03223

work page arXiv 2023
[21]

Learning visuotactile skills with two multifingered hands

Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands. IEEE International Conference on Robotics & Automation (ICRA) , 2025

work page 2025
[22]

Physpart: Physically plausible part completion for interactable objects

Rundong Luo*, Haoran Geng*, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guidbas, and Siyuan Huang. Physpart: Physically plausible part completion for interactable objects. International Conference on Robotics and Automation (ICRA) , 2025. URL https: //arxiv.org/abs/2408.13724

work page arXiv 2025
[23]

In-Hand Object Rotation via Rapid Motor Adaptation

Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-Hand Object Rotation via Rapid Motor Adaptation. In Conference on Robot Learning (CoRL) , 2022

work page 2022
[24]

General in-hand object rotation with vision and touch

Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In Conference on Robot Learning, pages 2549–2564. PMLR, 2023

work page 2023
[25]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025. URL https://arxiv.org/abs/2505.22642

work page arXiv 2025
[26]

The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning

Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, and Pieter Abbeel. The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 9698–9705. IEEE, 2024

work page 2024
[27]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning (CoRL), pages 894–906. PMLR, 2022. 11

work page 2022
[28]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL) , pages 785–799. PMLR, 2023

work page 2023
[29]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[30]

Octo: An open-source generalist robot policy, 2023

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy, 2023

work page 2023
[31]

Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023

work page arXiv 2023
[32]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simula- tion. arXiv preprint arXiv:2210.02697, 2022

work page arXiv 2022
[33]

Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy

Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, and Hao Dong. Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy, 2025. URL https://arxiv.org/abs/2505.11032

work page arXiv 2025
[34]

Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024

Tianhao Wu, Jinzhou Li, Jiyao Zhang, Mingdong Wu, and Hao Dong. Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024. URL https://arxiv.org/abs/2409.17549

work page arXiv 2024
[35]

Learning to manipulate deformable objects without demonstrations, 2020

Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/ abs/1910.13439

work page arXiv 2020
[36]

Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023

work page arXiv 2023
[37]

Unit: Unified tactile representation for robot learning

Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, and Yu She. Unit: Unified tactile representation for robot learning. arXiv preprint arXiv:2408.06481, 2024

work page arXiv 2024
[38]

Binding touch to everything: Learn- ing unified multimodal tactile representations

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learn- ing unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26340–26353, 2024

work page 2024
[39]

Rotating without seeing: Towards in-hand dexterity through touch.arXiv preprint arXiv:2303.10880, 2023

Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023

work page arXiv 2023
[40]

Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation

Kelin Yu, Yunhai Han, Qixian Wang, Vaibhav Saxena, Danfei Xu, and Ye Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023

work page arXiv 2023
[41]

Robot synesthesia: In-hand manipulation with visuotactile sensing

Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manipulation with visuotactile sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6558–6565. IEEE, 2024

work page 2024
[42]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation

Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, and Otmar Hilliges. Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation. In 2024 International Conference on 3D Vision (3DV) , pages 235–246, 2024. doi: 10.1109/3DV62453.2024.00016. 12

work page doi:10.1109/3dv62453.2024.00016 2024
[44]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In 8th Annual Conference on Robot Learning

work page
[45]

Rodrigues Network for Learning Robot Actions

Jialiang Zhang, Haoran Geng, Yang You, Congyue Deng, Pieter Abbeel, Jitendra Malik, and Leonidas Guibas. Rodrigues network for learning robot actions, 2025. URL https: //arxiv.org/abs/2506.02618

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Aloha unleashed: A simple recipe for robot dexterity

Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126, 2024

work page arXiv 2024
[48]

10610948

Sun Zhaole, Jihong Zhu, and Robert B. Fisher. Dexdlo: Learning goal-conditioned dexterous policy for dynamic manipulation of deformable linear objects. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16009–16015, 2024. doi: 10.1109/ ICRA57147.2024.10610754. 13 Appendix A. Implementation and training details, including senso...

work page arXiv 2024

[1] [1]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21190–21200, June 2023

work page 2023

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. https://arxiv. org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

A system for general in-hand object re-orientation

Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, pages 297–307. PMLR, 2022

work page 2022

[4] [4]

Bi-dexhands: Towards human-level bimanual dexterous manipulation

Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46(5):2804–2818, 2024. doi: 10.1109/TPAMI.2023.3339515

work page doi:10.1109/tpami.2023.3339515 2024

[5] [5]

Karen Liu

Yuanpei Chen, Chen Wang, Yaodong Yang, and C. Karen Liu. Object-centric dexterous manipulation from human motion data, 2024. URL https://arxiv.org/abs/2411.04005

work page arXiv 2024

[6] [6]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , page 02783649241273668, 2023

work page 2023

[7] [7]

Dreher, T

Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733

work page doi:10.1109/iros58592.2024 2024

[8] [8]

A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024

[9] [9]

Gapartnet: Cross- category domain-generalizable object perception and manip- ulation via generalizable and actionable parts.arXiv preprint arXiv:2211.05272, 2022

Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022

work page arXiv 2022

[10] [10]

Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations

Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, and He Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023

work page arXiv 2023

[11] [11]

Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023

work page 2023

[12] [12]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

work page arXiv 2025

[13] [13]

Dextreme: Transfer of agile in-hand manipulation from simulation to reality

Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. arXiv, 2022

work page 2022

[14] [14]

Sparsh: Self-supervised touch representations for vision- based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing. arXiv preprint arXiv:2410.24090, 2024

work page arXiv 2024

[15] [15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

work page 2020

[16] [16]

Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024

Taoran Jiang, Liqian Ma, Yixuan Guan, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Jing Xu, and Rui Chen. Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024. URL https://arxiv.org/abs/ 2409.08750

work page arXiv 2024

[17] [17]

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. arXiv preprint arXiv:2407.04689, 2024

work page arXiv 2024

[18] [18]

Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,

Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, and Yue Wang. Skillblender: Towards versatile humanoid whole-body loco- manipulation via skill blending. arXiv preprint arXiv:2506.09366, 2025

work page arXiv 2025

[19] [19]

Making sense of vision and touch: Learning multimodal representations for contact-rich tasks

Michelle A Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics , 36(3): 582–596, 2020

work page 2020

[20] [20]

Tenenbaum, and Chuang Gan

Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics, 2023. URL https://arxiv.org/abs/2304.03223

work page arXiv 2023

[21] [21]

Learning visuotactile skills with two multifingered hands

Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands. IEEE International Conference on Robotics & Automation (ICRA) , 2025

work page 2025

[22] [22]

Physpart: Physically plausible part completion for interactable objects

Rundong Luo*, Haoran Geng*, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guidbas, and Siyuan Huang. Physpart: Physically plausible part completion for interactable objects. International Conference on Robotics and Automation (ICRA) , 2025. URL https: //arxiv.org/abs/2408.13724

work page arXiv 2025

[23] [23]

In-Hand Object Rotation via Rapid Motor Adaptation

Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-Hand Object Rotation via Rapid Motor Adaptation. In Conference on Robot Learning (CoRL) , 2022

work page 2022

[24] [24]

General in-hand object rotation with vision and touch

Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In Conference on Robot Learning, pages 2549–2564. PMLR, 2023

work page 2023

[25] [25]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025. URL https://arxiv.org/abs/2505.22642

work page arXiv 2025

[26] [26]

The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning

Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, and Pieter Abbeel. The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 9698–9705. IEEE, 2024

work page 2024

[27] [27]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning (CoRL), pages 894–906. PMLR, 2022. 11

work page 2022

[28] [28]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL) , pages 785–799. PMLR, 2023

work page 2023

[29] [29]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[30] [30]

Octo: An open-source generalist robot policy, 2023

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy, 2023

work page 2023

[31] [31]

Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023

work page arXiv 2023

[32] [32]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simula- tion. arXiv preprint arXiv:2210.02697, 2022

work page arXiv 2022

[33] [33]

Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy

Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, and Hao Dong. Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy, 2025. URL https://arxiv.org/abs/2505.11032

work page arXiv 2025

[34] [34]

Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024

Tianhao Wu, Jinzhou Li, Jiyao Zhang, Mingdong Wu, and Hao Dong. Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024. URL https://arxiv.org/abs/2409.17549

work page arXiv 2024

[35] [35]

Learning to manipulate deformable objects without demonstrations, 2020

Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/ abs/1910.13439

work page arXiv 2020

[36] [36]

Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023

work page arXiv 2023

[37] [37]

Unit: Unified tactile representation for robot learning

Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, and Yu She. Unit: Unified tactile representation for robot learning. arXiv preprint arXiv:2408.06481, 2024

work page arXiv 2024

[38] [38]

Binding touch to everything: Learn- ing unified multimodal tactile representations

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learn- ing unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26340–26353, 2024

work page 2024

[39] [39]

Rotating without seeing: Towards in-hand dexterity through touch.arXiv preprint arXiv:2303.10880, 2023

Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023

work page arXiv 2023

[40] [40]

Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation

Kelin Yu, Yunhai Han, Qixian Wang, Vaibhav Saxena, Danfei Xu, and Ye Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023

work page arXiv 2023

[41] [41]

Robot synesthesia: In-hand manipulation with visuotactile sensing

Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manipulation with visuotactile sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6558–6565. IEEE, 2024

work page 2024

[42] [42]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation

Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, and Otmar Hilliges. Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation. In 2024 International Conference on 3D Vision (3DV) , pages 235–246, 2024. doi: 10.1109/3DV62453.2024.00016. 12

work page doi:10.1109/3dv62453.2024.00016 2024

[44] [44]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In 8th Annual Conference on Robot Learning

work page

[45] [45]

Rodrigues Network for Learning Robot Actions

Jialiang Zhang, Haoran Geng, Yang You, Congyue Deng, Pieter Abbeel, Jitendra Malik, and Leonidas Guibas. Rodrigues network for learning robot actions, 2025. URL https: //arxiv.org/abs/2506.02618

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Aloha unleashed: A simple recipe for robot dexterity

Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126, 2024

work page arXiv 2024

[48] [48]

10610948

Sun Zhaole, Jihong Zhu, and Robert B. Fisher. Dexdlo: Learning goal-conditioned dexterous policy for dynamic manipulation of deformable linear objects. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16009–16015, 2024. doi: 10.1109/ ICRA57147.2024.10610754. 13 Appendix A. Implementation and training details, including senso...

work page arXiv 2024