ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
Pith reviewed 2026-05-19 09:42 UTC · model grok-4.3
The pith
ViTacFormer fuses vision and touch via cross-attention and autoregressive tactile prediction to reach roughly 50 percent higher success rates in real-world dexterous manipulation and complete 11-stage tasks lasting 2.5 minutes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViTacFormer couples a cross-attention encoder that fuses high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. An easy-to-challenging curriculum steadily refines the visual-tactile latent space. The learned representation then drives imitation learning for multi-fingered hands, producing approximately 50 percent higher success rates than prior state-of-the-art systems on real-world benchmarks and enabling the first autonomous completion of long-horizon tasks that require up to 11 sequential stages and 2.5 minutes of continuous operation.
What carries the argument
Cross-attention encoder fused with an autoregressive tactile prediction head, refined by an easy-to-challenging curriculum, that builds a shared visual-tactile latent space for imitation learning policies.
If this is right
- Imitation learning policies gain precision and adaptability for multi-fingered hands in contact-rich tasks.
- Manipulation remains effective in visually occluded settings where tactile feedback becomes primary.
- Continuous operation extends to sequences of 11 or more stages without external resets.
- Success rates rise across a range of challenging real-world benchmarks compared with earlier approaches.
Where Pith is reading between the lines
- The same fusion and prediction architecture could be tested on different robot hands or additional sensor types to check transfer.
- The curriculum progression might shorten training time for other fine-control robotics problems that mix vision and force data.
- Long-horizon stability suggests the representation could support more complex sequences in household or factory settings if generalization holds.
- Evaluating performance under changed lighting or partial sensor failure would test robustness claims directly.
Load-bearing premise
The cross-modal representation learned from the reported benchmarks and hardware will generalize to new objects, sensor calibrations, and unstructured environments instead of overfitting to the specific training conditions.
What would settle it
Deploy the trained policy on a collection of previously unseen objects with novel shapes, sizes, and surface properties while keeping the same hand and sensors, then measure whether success rates fall below the claimed improvement over prior methods.
Figures
read the original abstract
Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ViTacFormer, a cross-modal representation learning framework for visuo-tactile dexterous manipulation. It combines a cross-attention encoder for fusing vision and high-resolution touch with an autoregressive tactile prediction head and trains it using an easy-to-challenging curriculum. The learned representation is used for imitation learning on multi-fingered anthropomorphic hands, with claims of ~50% improvement in success rates over prior SOTA and pioneering long-horizon tasks of up to 11 stages lasting 2.5 minutes on real-world benchmarks.
Significance. Should the empirical claims prove robust upon detailed verification, this work would represent a notable advance in robotic manipulation by demonstrating how cross-modal visuo-tactile representations can enable precise, adaptive, and extended autonomous behaviors in dexterous tasks. The proposed architecture and curriculum offer a concrete approach to addressing the challenges of integrating tactile sensing with vision for unstructured settings.
major comments (2)
- Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.
- Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.
minor comments (2)
- Methods: The description of the curriculum progression could benefit from more explicit details on the schedule and hyperparameters to allow reproducibility.
- Abstract: Clarify the specific hardware (e.g., hand model and tactile sensor type) used in the experiments for better context.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and address concerns about substantiation and generalization.
read point-by-point responses
-
Referee: Abstract: The abstract asserts approximately 50% higher success rates and the first autonomous completion of 11-stage tasks, but provides no details on experimental protocol, baseline comparisons, number of trials, or statistical significance, which are essential to substantiate these central performance claims.
Authors: We agree that the abstract would benefit from additional context to support the performance claims. We have revised the abstract to include a brief reference to the evaluation protocol, noting that results are drawn from real-world experiments with multiple trials per task and direct comparisons against prior state-of-the-art methods, with statistical significance reported in the main text. Full details on trial counts, baselines, and p-values remain in Sections 4 and 5 due to abstract length limits. revision: yes
-
Referee: Experimental Results: Results are presented only for the fixed set of benchmark tasks using the same anthropomorphic hand and sensor setup; the absence of tests on held-out objects, deliberate calibration variations, or unstructured scene changes leaves the generalization of the learned cross-modal representation unverified, directly impacting the long-horizon autonomy claim.
Authors: Our benchmark suite already incorporates object pose variations, lighting changes, and contact dynamics across the reported tasks to probe robustness. Nevertheless, we acknowledge that explicit tests on held-out objects and deliberate calibration shifts would provide stronger evidence for generalization. In the revised manuscript we have added a dedicated paragraph in the Discussion section that explicitly discusses this limitation, clarifies the scope of the current benchmarks, and outlines directions for future generalization experiments while preserving the claims supported by the existing long-horizon results. revision: partial
Circularity Check
No circularity: empirical performance claims rest on measured robot experiments
full rationale
The paper describes an architecture (cross-attention encoder plus autoregressive tactile prediction head) trained via an easy-to-challenging curriculum to produce a visuo-tactile latent space, then applies the resulting representation to imitation learning on an anthropomorphic hand. All headline numbers—approximately 50% higher success rates and successful execution of up to 11-stage, 2.5-minute tasks—are reported as outcomes of real-world benchmark trials after training. No equation, prediction, or uniqueness claim is shown to reduce by construction to a fitted parameter, self-citation, or input quantity; the central results are externally measured success rates on fixed hardware and task suites rather than quantities defined or forced inside the model itself.
Axiom & Free-Parameter Ledger
free parameters (2)
- Curriculum progression schedule
- Transformer and prediction-head hyperparameters
axioms (2)
- domain assumption Cross-attention can produce a fused representation that is more useful for control than separate modality encoders
- domain assumption Autoregressive tactile prediction improves downstream manipulation policy performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
easy-to-challenging curriculum that steadily refines the visual-tactile latent space
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
long-horizon dexterous manipulation tasks ... 11 sequential stages
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.
-
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
-
Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding
Contact-Grounded Policy predicts coupled robot-state and tactile trajectories with a diffusion model and maps them via a learned consistency function to executable targets for compliance controllers, outperforming sta...
-
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
-
Learning Versatile Humanoid Manipulation with Touch Dreaming
HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...
-
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms
A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.
Reference graph
Works this paper leans on
-
[1]
Dexart: Benchmarking generalizable dexterous manipulation with articulated objects
Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21190–21200, June 2023
work page 2023
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. https://arxiv. org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
A system for general in-hand object re-orientation
Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, pages 297–307. PMLR, 2022
work page 2022
-
[4]
Bi-dexhands: Towards human-level bimanual dexterous manipulation
Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46(5):2804–2818, 2024. doi: 10.1109/TPAMI.2023.3339515
- [5]
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , page 02783649241273668, 2023
work page 2023
-
[7]
Yufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, and He Wang. Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7359–7366, 2024. doi: 10.1109/IROS58592.2024. 10802733
-
[8]
A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024
Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. arXiv preprint arXiv:2402.13232, 2024
-
[9]
Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts. arXiv preprint arXiv:2211.05272, 2022
-
[10]
Haoran Geng, Ziming Li, Yiran Geng, Jiayi Chen, Hao Dong, and He Wang. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. arXiv preprint arXiv:2303.16958, 2023
-
[11]
Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions, 2023
work page 2023
-
[12]
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...
-
[13]
Dextreme: Transfer of agile in-hand manipulation from simulation to reality
Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, Yashraj Narang, Jean-Francois Lafleche, Dieter Fox, and Gavriel State. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. arXiv, 2022
work page 2022
-
[14]
Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lan- caster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing. arXiv preprint arXiv:2410.24090, 2024
-
[15]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020
work page 2020
-
[16]
Taoran Jiang, Liqian Ma, Yixuan Guan, Jiaojiao Meng, Weihang Chen, Zecui Zeng, Lusong Li, Dan Wu, Jing Xu, and Rui Chen. Dexsim2real 2: Building explicit world model for precise articulated object dexterous manipulation, 2024. URL https://arxiv.org/abs/ 2409.08750
-
[17]
Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation
Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. arXiv preprint arXiv:2407.04689, 2024
-
[18]
Skillblender: Towards versatile humanoid whole-body loco-manipulation via skill blending,
Yuxuan Kuang, Haoran Geng, Amine Elhafsi, Tan-Dzung Do, Pieter Abbeel, Jitendra Malik, Marco Pavone, and Yue Wang. Skillblender: Towards versatile humanoid whole-body loco- manipulation via skill blending. arXiv preprint arXiv:2506.09366, 2025
-
[19]
Making sense of vision and touch: Learning multimodal representations for contact-rich tasks
Michelle A Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics , 36(3): 582–596, 2020
work page 2020
-
[20]
Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics, 2023. URL https://arxiv.org/abs/2304.03223
-
[21]
Learning visuotactile skills with two multifingered hands
Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, and Jitendra Malik. Learning visuotactile skills with two multifingered hands. IEEE International Conference on Robotics & Automation (ICRA) , 2025
work page 2025
-
[22]
Physpart: Physically plausible part completion for interactable objects
Rundong Luo*, Haoran Geng*, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guidbas, and Siyuan Huang. Physpart: Physically plausible part completion for interactable objects. International Conference on Robotics and Automation (ICRA) , 2025. URL https: //arxiv.org/abs/2408.13724
-
[23]
In-Hand Object Rotation via Rapid Motor Adaptation
Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-Hand Object Rotation via Rapid Motor Adaptation. In Conference on Robot Learning (CoRL) , 2022
work page 2022
-
[24]
General in-hand object rotation with vision and touch
Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In Conference on Robot Learning, pages 2549–2564. PMLR, 2023
work page 2023
-
[25]
Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025. URL https://arxiv.org/abs/2505.22642
-
[26]
Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, and Pieter Abbeel. The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 9698–9705. IEEE, 2024
work page 2024
-
[27]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning (CoRL), pages 894–906. PMLR, 2022. 11
work page 2022
-
[28]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL) , pages 785–799. PMLR, 2023
work page 2023
-
[29]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[30]
Octo: An open-source generalist robot policy, 2023
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy, 2023
work page 2023
-
[31]
Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023
-
[32]
Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,
Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simula- tion. arXiv preprint arXiv:2210.02697, 2022
-
[33]
Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy
Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, and Hao Dong. Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy, 2025. URL https://arxiv.org/abs/2505.11032
-
[34]
Tianhao Wu, Jinzhou Li, Jiyao Zhang, Mingdong Wu, and Hao Dong. Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning, 2024. URL https://arxiv.org/abs/2409.17549
-
[35]
Learning to manipulate deformable objects without demonstrations, 2020
Yilin Wu, Wilson Yan, Thanard Kurutach, Lerrel Pinto, and Pieter Abbeel. Learning to manipulate deformable objects without demonstrations, 2020. URL https://arxiv.org/ abs/1910.13439
-
[36]
Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023
-
[37]
Unit: Unified tactile representation for robot learning
Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, and Yu She. Unit: Unified tactile representation for robot learning. arXiv preprint arXiv:2408.06481, 2024
-
[38]
Binding touch to everything: Learn- ing unified multimodal tactile representations
Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learn- ing unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26340–26353, 2024
work page 2024
-
[39]
Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023
-
[40]
Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation
Kelin Yu, Yunhai Han, Qixian Wang, Vaibhav Saxena, Danfei Xu, and Ye Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023
-
[41]
Robot synesthesia: In-hand manipulation with visuotactile sensing
Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manipulation with visuotactile sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6558–6565. IEEE, 2024
work page 2024
-
[42]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation
Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, and Otmar Hilliges. Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation. In 2024 International Conference on 3D Vision (3DV) , pages 235–246, 2024. doi: 10.1109/3DV62453.2024.00016. 12
-
[44]
Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes
Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In 8th Annual Conference on Robot Learning
-
[45]
Rodrigues Network for Learning Robot Actions
Jialiang Zhang, Haoran Geng, Yang You, Congyue Deng, Pieter Abbeel, Jitendra Malik, and Leonidas Guibas. Rodrigues network for learning robot actions, 2025. URL https: //arxiv.org/abs/2506.02618
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Aloha unleashed: A simple recipe for robot dexterity
Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126, 2024
-
[48]
Sun Zhaole, Jihong Zhu, and Robert B. Fisher. Dexdlo: Learning goal-conditioned dexterous policy for dynamic manipulation of deformable linear objects. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 16009–16015, 2024. doi: 10.1109/ ICRA57147.2024.10610754. 13 Appendix A. Implementation and training details, including senso...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.