NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

Avirup Das; Soham Patil; Sourabh Bhosale; Spandan Roy

arxiv: 2606.24450 · v1 · pith:XJJ2BPOAnew · submitted 2026-06-23 · 💻 cs.RO · cs.AI

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

Soham Patil , Avirup Das , Sourabh Bhosale , Spandan Roy This is my paper

Pith reviewed 2026-06-25 23:53 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords contact estimationdexterous manipulationvision-based sensingproprioceptionreinforcement learningin-hand manipulationtransformer modelpseudo-tactile signal

0 comments

The pith

A robot can infer binary contact states from RGB-D vision and proprioception to enable dexterous in-hand manipulation without tactile sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a transformer-based multimodal framework that fuses RGB-D images with the robot's proprioceptive data to predict binary contact states during hand-object interactions. This prediction serves as a pseudo-tactile signal in place of dedicated hardware sensors. A single model trained across multiple objects produces contact estimates that support reinforcement learning agents for in-hand object reorientation tasks. The approach demonstrates generalization to novel objects and succeeds in both simulation and real-robot experiments.

Core claim

The central claim is that a multimodal transformer fusing RGB-D vision with proprioception can infer binary contact states accurately enough to act as a pseudo-tactile signal, enabling reinforcement learning policies for in-hand object reorientation that generalize to novel objects, with validation through both simulated and physical robot experiments.

What carries the argument

Transformer-based multimodal fusion of RGB-D vision and proprioception to output binary contact predictions as a pseudo-tactile signal.

If this is right

A single contact prediction model trained on multiple objects enables generalization to novel objects in downstream RL tasks.
The inferred contact signal directly supports RL agents for in-hand object reorientation without requiring tactile hardware.
Validation occurs in both simulation and on a physical robot, confirming feasibility for real-world use.
The method provides a scalable alternative to dedicated tactile sensors specifically for binary contact estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could allow existing vision-equipped robots to perform contact-dependent tasks without adding fragile hardware.
The binary contact signal might serve as a foundation for estimating richer contact properties such as force direction or slip in future extensions.
Policies trained this way could transfer more readily across different hand morphologies if the vision-proprioception fusion remains robust.
Integration with other modalities like audio could further improve contact inference in noisy or occluded scenarios.

Load-bearing premise

Binary contact states inferred from RGB-D vision and proprioception are accurate and informative enough to train effective reinforcement learning policies that generalize to novel objects.

What would settle it

An experiment in which an RL policy trained using the inferred contact signal fails to achieve reliable in-hand reorientation on novel objects in the real world, even though the same policy succeeds when given ground-truth contact data.

Figures

Figures reproduced from arXiv: 2606.24450 by Avirup Das, Soham Patil, Sourabh Bhosale, Spandan Roy.

**Figure 1.** Figure 1: Training: We collect vision, proprioception and contact data through a simulation environment by running a pretrained in-hand rotation policy. NoContactNoWorries: From synchronized RGB-D and proprioception at time t, the frozen encoder Φ(It ,Dt) produces spatial features that are downsampled into visual tokens vt . Current and commanded joint configurations (qt , q com t ) are embedded by ψpose and ψcom; t… view at source ↗

**Figure 2.** Figure 2: Data augmentation in simulation with randomized background [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Object Sets for Experiments. (a) Five primitive objects seen during training: cuboid, pentagonal prism, extruded star, dodecahedron and stairs. (b) Novel objects: an extruded letter ‘R’ and a hexagonal prism held out from all training, used to evaluate zero-shot generalization. A. Ablations and Baselines To isolate the role of each sensing modality and architectural component, we evaluate several controll… view at source ↗

**Figure 5.** Figure 5: Visual Occlusion during In-Hand Manipulation. Simulated views from the wrist-mounted camera during interaction with the Hexagonal Prism. (Left) A lightly occluded configuration where fingertip contact regions are largely visible. (Right) A heavily occluded configuration where the object geometry obstructs the view of the distal phalanges. As quantified in Table II, vision-only variants degrade in such fram… view at source ↗

**Figure 6.** Figure 6: Statistical dependence between joint tracking error and contact [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Real-World Experimental Evaluation. Examples of objects being used for downstream policy testing with predicted contacts. in-hand object rotation task. Our approach leverages poseconditioned cross-attention and temporal modeling to resolve visual ambiguities and align these cues with motion intent. We validated the approach extensively in both simulation and on physical hardware. The model generalized rob… view at source ↗

read the original abstract

Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body's pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot's proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: https://soham2560.github.io/no-contact-no-worries/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a transformer fusing RGB-D and proprioception can predict binary contact states to support RL reorientation policies that generalize to new objects, with validation on both sim and real hardware.

read the letter

The core result here is that vision plus proprioception can stand in for tactile sensing on the binary contact task. They train one model across several objects, feed the predicted contacts into an RL policy for in-hand reorientation, and report that it works on held-out objects in both simulation and on a physical robot.

What stands out is the end-to-end check: the contact signal is not just evaluated in isolation but actually used downstream, and the real-robot transfer is shown. That is more than many vision-only contact papers deliver.

The main limitation is that the abstract gives no numbers. No accuracy on contact prediction, no comparison to proprioception-only or vision-only baselines, no ablation on the transformer architecture, and no error bars on the RL success rates. Without those, it is hard to know whether the method is solving a real gap or just clearing a low bar. If the full paper has solid tables and ablations, this concern shrinks; if it stays at the level of the abstract, the claim is harder to assess.

The work is aimed at people building dexterous hands who want to avoid fragile tactile hardware. It is a practical engineering paper rather than a theoretical one. The approach is incremental but the real-world validation gives it enough substance to warrant referee time. I would send it out for review rather than desk-reject, with the expectation that the authors will need to add quantitative comparisons and failure analysis.

Referee Report

1 major / 0 minor

Summary. The paper introduces NoContactNoWorries, a transformer-based multimodal framework fusing RGB-D vision and robot proprioception to infer binary contact states as a pseudo-tactile signal for in-hand dexterous manipulation. It claims that a single contact prediction model trained on multiple objects enables downstream RL agents to perform in-hand object reorientation, with generalization to novel objects, and validates the approach through experiments in both simulation and on a real-world robot.

Significance. If the empirical results hold with proper quantitative support, the work offers a scalable, hardware-free alternative to tactile sensing for binary contact estimation, which could broaden access to dexterous manipulation research. The single-model training across objects and the downstream RL integration represent potentially useful contributions if substantiated.

major comments (1)

[Abstract] Abstract: The central claim of validation in simulation and on a real robot, with generalization to novel objects for supporting effective RL policies, is asserted without any reported metrics, training details, baselines, or error analysis. This absence is load-bearing because the soundness of the pseudo-tactile signal for downstream tasks cannot be assessed from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for stronger quantitative support in the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of validation in simulation and on a real robot, with generalization to novel objects for supporting effective RL policies, is asserted without any reported metrics, training details, baselines, or error analysis. This absence is load-bearing because the soundness of the pseudo-tactile signal for downstream tasks cannot be assessed from the provided text.

Authors: We agree that the abstract, as currently written, is a high-level summary and does not include specific quantitative metrics, training details, baselines, or error analysis. The full manuscript provides these details in the experimental sections (contact prediction performance and ablations in simulation, RL policy results with baselines and generalization to novel objects, and real-robot validation). To address the concern and make the central claims more assessable from the abstract alone, we will revise the abstract to include a small number of key quantitative highlights while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML pipeline is self-contained

full rationale

The paper describes an empirical training procedure for a multimodal transformer that maps RGB-D images and proprioceptive states to binary contact labels, then feeds the resulting pseudo-tactile signal into separate RL policies for reorientation. No equations, derivations, or parameter-fitting steps are presented that would allow any claimed prediction to reduce to its own training inputs by construction. Validation rests on held-out simulation and real-robot experiments across novel objects, which are externally falsifiable and do not rely on self-citation chains or uniqueness theorems. The approach therefore contains no load-bearing circular reductions of the kinds enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical ML paper, the claim rests on standard supervised learning assumptions for contact classification and RL policy training; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Standard assumptions in deep learning hold for the transformer model trained on contact labels derived from simulation or real data.
The single model trained on multiple objects is assumed to produce usable contact signals for RL without further specification.

pith-pipeline@v0.9.1-grok · 5735 in / 1206 out tokens · 24917 ms · 2026-06-25T23:53:36.889066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references

[1]

A system for general in-hand object re-orientation,

T. Chen, J. Xu, and P. Agrawal, “A system for general in-hand object re-orientation,” inCoRL. PMLR, 2022, pp. 297–307

2022
[2]

Anyrotate: Gravity-invariant in- hand object rotation with sim-to-real touch,

M. Yang, A. Church, Y . Lin, C. J. Ford, H. Li, E. Psomopoulou, D. A. Barton, N. F. Leporaet al., “Anyrotate: Gravity-invariant in- hand object rotation with sim-to-real touch,” inCoRL. PMLR, 2025, pp. 4727–4747

2025
[3]

Rotating without seeing: Towards in-hand dexterity through touch,

Z.-H. Yin, B. Huang, Y . Qin, Q. Chen, and X. Wang, “Rotating without seeing: Towards in-hand dexterity through touch,”RSS, 2023

2023
[4]

Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning,

Q. Liu, Y . Cui, Z. Sun, G. Li, J. Chen, and Q. Ye, “Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning,” inThe Thirteenth Interna- tional Conference on Learning Representations, 2025

2025
[5]

Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,”IEEE TRo, vol. 36, no. 3, pp. 582–596, 2020

2020
[6]

Gelsight: High-resolution robot tactile sensors for estimating geometry and force,

W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, p. 2762, 2017

2017
[7]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammereret al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE RAL, vol. 5, no. 3, pp. 3838–3845, 2020

2020
[8]

Humans integrate visual and haptic information in a statistically optimal fashion,

M. O. Ernst and M. S. Banks, “Humans integrate visual and haptic information in a statistically optimal fashion,”Nature, vol. 415, no. 6870, pp. 429–433, 2002

2002
[9]

Functional organization of inferior area 6 in the macaque monkey: Ii. area f5 and the control of distal movements,

G. Rizzolatti, R. Camarda, L. Fogassi, M. Gentilucci, G. Luppino, and M. Matelli, “Functional organization of inferior area 6 in the macaque monkey: Ii. area f5 and the control of distal movements,”Experimental brain research, vol. 71, pp. 491–507, 1988

1988
[10]

Neuronal correlates of subjective sensory experience,

V . de Lafuente and R. Romo, “Neuronal correlates of subjective sensory experience,”Nature neuroscience, vol. 8, no. 12, pp. 1698– 1703, 2005

2005
[11]

Vividex: Learning vision-based dexterous manipulation from human videos,

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid, “Vividex: Learning vision-based dexterous manipulation from human videos,” in2025 ICRA. IEEE, 2025, pp. 3336–3343

2025
[12]

Learning deep visuomotor policies for dexterous hand manipulation,

D. Jain, A. Li, S. Singhal, A. Rajeswaran, V . Kumar, and E. Todorov, “Learning deep visuomotor policies for dexterous hand manipulation,” in2019 ICRA. IEEE, 2019, pp. 3636–3643

2019
[13]

Visual dexterity: In-hand reorientation of novel and complex object shapes,

T. Chen, M. Tippur, S. Wu, V . Kumar, E. Adelson, and P. Agrawal, “Visual dexterity: In-hand reorientation of novel and complex object shapes,”Science Robotics, vol. 8, no. 84, p. eadc9244, 2023

2023
[14]

Diffusion policy: Visuomotor policy learning via ac- tion diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via ac- tion diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023

2023
[15]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in2nd Workshop on Dexterous Manipulation: Design, Perception and Control (RSS), 2024

2024
[16]

When would vision- proprioception policies fail in robotic manipulation?

J. Lu, W. Xia, Y . Wu, Z. Lu, and D. Hu, “When would vision- proprioception policies fail in robotic manipulation?” inThe F our- teenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=2RIqqNqALN

2026
[17]

Learning in-hand translation using tactile skin with shear and normal force sensing,

J. Yin, H. Qi, J. Malik, J. Pikul, M. Yim, and T. Hellebrekers, “Learning in-hand translation using tactile skin with shear and normal force sensing,” in2025 ICRA. IEEE, 2025, pp. 5850–5856

2025
[18]

Tactile-driven dexterous in-hand writing via extrinsic contact sensing,

C. Zhao, L. Xie, B. Huang, S. Wang, and D. Ma, “Tactile-driven dexterous in-hand writing via extrinsic contact sensing,”IEEE RAL, 2025

2025
[19]

Robot synesthesia: In-hand manipulation with visuotactile sensing,

Y . Yuan, H. Che, Y . Qin, B. Huang, Z.-H. Yin, K.-W. Lee, Y . Wu, S.- C. Lim, and X. Wang, “Robot synesthesia: In-hand manipulation with visuotactile sensing,” in2024 ICRA. IEEE, 2024, pp. 6558–6565

2024
[20]

3d-vitac: Learn- ing fine-grained manipulation with visuo-tactile sensing,

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li, “3d-vitac: Learn- ing fine-grained manipulation with visuo-tactile sensing,” inCoRL. PMLR, 2025, pp. 2557–2578

2025
[21]

General in-hand object rotation with vision and touch,

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik, “General in-hand object rotation with vision and touch,” inCoRL. PMLR, 2023, pp. 2549–2564

2023
[22]

The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning,

C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel, “The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning,” in2024 IROS. IEEE, 2024, pp. 9698– 9705

2024
[23]

Masked visual- tactile pre-training for robot manipulation,

Q. Liu, Q. Ye, Z. Sun, Y . Cui, G. Li, and J. Chen, “Masked visual- tactile pre-training for robot manipulation,” in2024 ICRA, 2024, pp. 13 859–13 875

2024
[24]

Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation,

L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik, “Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 15953

2025
[25]

Visuo-tactile transformers for manipulation,

Y . Chen, M. Van der Merwe, A. Sipos, and N. Fazeli, “Visuo-tactile transformers for manipulation,” inCoRL. PMLR, 2023, pp. 2026– 2040

2023
[26]

Interrep: A visual interaction representation for robotic grasping,

Y . Cui, Q. Ye, Q. Liu, A. Chen, G. Li, and J. Chen, “Interrep: A visual interaction representation for robotic grasping,” in2024 ICRA. IEEE, 2024, pp. 6448–6454

2024
[27]

Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile ma- nipulation policies,

A. George, S. Gano, P. Katragadda, and A. B. Farimani, “Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile ma- nipulation policies,” in2025 ICRA. IEEE, 2025, pp. 258–264

2025
[28]

Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface,

F. Liu, C. Li, Y . Qin, A. Shaw, J. Xu, P. Abbeel, and R. Chen, “Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface,”arXiv preprint arXiv:2504.06156, 2025

arXiv 2025
[29]

Upvital: Unpaired visual-tactile self-supervised representation learning for dex- terous robotic manipulation,

G. Han, Q. Liu, Y . Cui, A. Chen, J. Chen, and Q. Ye, “Upvital: Unpaired visual-tactile self-supervised representation learning for dex- terous robotic manipulation,” in2025 ICRA, 2025, pp. 11 838–11 844

2025
[30]

Policy invariance under reward transformations: Theory and application to reward shaping

A. Ng, “Policy invariance under reward transformations: Theory and application to reward shaping.” inProceedings of the 16th ICML, 1999, p. 278

1999
[31]

Asymformer: Asym- metrical cross-modal representation learning for mobile platform real- time rgb-d semantic segmentation,

S. Du, W. Wang, R. Guo, R. Wang, and S. Tang, “Asymformer: Asym- metrical cross-modal representation learning for mobile platform real- time rgb-d semantic segmentation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 7608–7615

2024
[32]

Leap hand: Low-cost, effi- cient, and anthropomorphic hand for robot learning,

K. Shaw, A. Agarwal, and D. Pathak, “Leap hand: Low-cost, effi- cient, and anthropomorphic hand for robot learning,”arXiv preprint arXiv:2309.06440, 2023

arXiv 2023
[33]

Earl: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation,

B. Huang, J. Yu, and S. Jain, “Earl: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation,” in2023 IROS. IEEE, 2023, pp. 2963–2970

2023
[34]

Learning dexterous in-hand manipulation with multifingered hands via visuomotor diffusion,

P. Koczy, M. C. Welle, and D. Kragic, “Learning dexterous in-hand manipulation with multifingered hands via visuomotor diffusion,” in 2025 IROS. IEEE, 2025, pp. 121–127

2025
[35]

In-hand object rotation via rapid motor adaptation,

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik, “In-hand object rotation via rapid motor adaptation,” inCoRL. PMLR, 2023, pp. 1722–1732

2023
[36]

Transforce: Transferable force prediction for vision-based tactile sensors with sequential image translation,

Z. Chen, N. Ou, X. Zhang, and S. Luo, “Transforce: Transferable force prediction for vision-based tactile sensors with sequential image translation,” in2025 ICRA. IEEE, 2025, pp. 237–243

2025

[1] [1]

A system for general in-hand object re-orientation,

T. Chen, J. Xu, and P. Agrawal, “A system for general in-hand object re-orientation,” inCoRL. PMLR, 2022, pp. 297–307

2022

[2] [2]

Anyrotate: Gravity-invariant in- hand object rotation with sim-to-real touch,

M. Yang, A. Church, Y . Lin, C. J. Ford, H. Li, E. Psomopoulou, D. A. Barton, N. F. Leporaet al., “Anyrotate: Gravity-invariant in- hand object rotation with sim-to-real touch,” inCoRL. PMLR, 2025, pp. 4727–4747

2025

[3] [3]

Rotating without seeing: Towards in-hand dexterity through touch,

Z.-H. Yin, B. Huang, Y . Qin, Q. Chen, and X. Wang, “Rotating without seeing: Towards in-hand dexterity through touch,”RSS, 2023

2023

[4] [4]

Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning,

Q. Liu, Y . Cui, Z. Sun, G. Li, J. Chen, and Q. Ye, “Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning,” inThe Thirteenth Interna- tional Conference on Learning Representations, 2025

2025

[5] [5]

Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,”IEEE TRo, vol. 36, no. 3, pp. 582–596, 2020

2020

[6] [6]

Gelsight: High-resolution robot tactile sensors for estimating geometry and force,

W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, p. 2762, 2017

2017

[7] [7]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammereret al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE RAL, vol. 5, no. 3, pp. 3838–3845, 2020

2020

[8] [8]

Humans integrate visual and haptic information in a statistically optimal fashion,

M. O. Ernst and M. S. Banks, “Humans integrate visual and haptic information in a statistically optimal fashion,”Nature, vol. 415, no. 6870, pp. 429–433, 2002

2002

[9] [9]

Functional organization of inferior area 6 in the macaque monkey: Ii. area f5 and the control of distal movements,

G. Rizzolatti, R. Camarda, L. Fogassi, M. Gentilucci, G. Luppino, and M. Matelli, “Functional organization of inferior area 6 in the macaque monkey: Ii. area f5 and the control of distal movements,”Experimental brain research, vol. 71, pp. 491–507, 1988

1988

[10] [10]

Neuronal correlates of subjective sensory experience,

V . de Lafuente and R. Romo, “Neuronal correlates of subjective sensory experience,”Nature neuroscience, vol. 8, no. 12, pp. 1698– 1703, 2005

2005

[11] [11]

Vividex: Learning vision-based dexterous manipulation from human videos,

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid, “Vividex: Learning vision-based dexterous manipulation from human videos,” in2025 ICRA. IEEE, 2025, pp. 3336–3343

2025

[12] [12]

Learning deep visuomotor policies for dexterous hand manipulation,

D. Jain, A. Li, S. Singhal, A. Rajeswaran, V . Kumar, and E. Todorov, “Learning deep visuomotor policies for dexterous hand manipulation,” in2019 ICRA. IEEE, 2019, pp. 3636–3643

2019

[13] [13]

Visual dexterity: In-hand reorientation of novel and complex object shapes,

T. Chen, M. Tippur, S. Wu, V . Kumar, E. Adelson, and P. Agrawal, “Visual dexterity: In-hand reorientation of novel and complex object shapes,”Science Robotics, vol. 8, no. 84, p. eadc9244, 2023

2023

[14] [14]

Diffusion policy: Visuomotor policy learning via ac- tion diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via ac- tion diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023

2023

[15] [15]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in2nd Workshop on Dexterous Manipulation: Design, Perception and Control (RSS), 2024

2024

[16] [16]

When would vision- proprioception policies fail in robotic manipulation?

J. Lu, W. Xia, Y . Wu, Z. Lu, and D. Hu, “When would vision- proprioception policies fail in robotic manipulation?” inThe F our- teenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=2RIqqNqALN

2026

[17] [17]

Learning in-hand translation using tactile skin with shear and normal force sensing,

J. Yin, H. Qi, J. Malik, J. Pikul, M. Yim, and T. Hellebrekers, “Learning in-hand translation using tactile skin with shear and normal force sensing,” in2025 ICRA. IEEE, 2025, pp. 5850–5856

2025

[18] [18]

Tactile-driven dexterous in-hand writing via extrinsic contact sensing,

C. Zhao, L. Xie, B. Huang, S. Wang, and D. Ma, “Tactile-driven dexterous in-hand writing via extrinsic contact sensing,”IEEE RAL, 2025

2025

[19] [19]

Robot synesthesia: In-hand manipulation with visuotactile sensing,

Y . Yuan, H. Che, Y . Qin, B. Huang, Z.-H. Yin, K.-W. Lee, Y . Wu, S.- C. Lim, and X. Wang, “Robot synesthesia: In-hand manipulation with visuotactile sensing,” in2024 ICRA. IEEE, 2024, pp. 6558–6565

2024

[20] [20]

3d-vitac: Learn- ing fine-grained manipulation with visuo-tactile sensing,

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li, “3d-vitac: Learn- ing fine-grained manipulation with visuo-tactile sensing,” inCoRL. PMLR, 2025, pp. 2557–2578

2025

[21] [21]

General in-hand object rotation with vision and touch,

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik, “General in-hand object rotation with vision and touch,” inCoRL. PMLR, 2023, pp. 2549–2564

2023

[22] [22]

The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning,

C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel, “The power of the senses: Generalizable manipulation from vision and touch through masked multimodal learning,” in2024 IROS. IEEE, 2024, pp. 9698– 9705

2024

[23] [23]

Masked visual- tactile pre-training for robot manipulation,

Q. Liu, Q. Ye, Z. Sun, Y . Cui, G. Li, and J. Chen, “Masked visual- tactile pre-training for robot manipulation,” in2024 ICRA, 2024, pp. 13 859–13 875

2024

[24] [24]

Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation,

L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik, “Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506. 15953

2025

[25] [25]

Visuo-tactile transformers for manipulation,

Y . Chen, M. Van der Merwe, A. Sipos, and N. Fazeli, “Visuo-tactile transformers for manipulation,” inCoRL. PMLR, 2023, pp. 2026– 2040

2023

[26] [26]

Interrep: A visual interaction representation for robotic grasping,

Y . Cui, Q. Ye, Q. Liu, A. Chen, G. Li, and J. Chen, “Interrep: A visual interaction representation for robotic grasping,” in2024 ICRA. IEEE, 2024, pp. 6448–6454

2024

[27] [27]

Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile ma- nipulation policies,

A. George, S. Gano, P. Katragadda, and A. B. Farimani, “Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile ma- nipulation policies,” in2025 ICRA. IEEE, 2025, pp. 258–264

2025

[28] [28]

Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface,

F. Liu, C. Li, Y . Qin, A. Shaw, J. Xu, P. Abbeel, and R. Chen, “Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface,”arXiv preprint arXiv:2504.06156, 2025

arXiv 2025

[29] [29]

Upvital: Unpaired visual-tactile self-supervised representation learning for dex- terous robotic manipulation,

G. Han, Q. Liu, Y . Cui, A. Chen, J. Chen, and Q. Ye, “Upvital: Unpaired visual-tactile self-supervised representation learning for dex- terous robotic manipulation,” in2025 ICRA, 2025, pp. 11 838–11 844

2025

[30] [30]

Policy invariance under reward transformations: Theory and application to reward shaping

A. Ng, “Policy invariance under reward transformations: Theory and application to reward shaping.” inProceedings of the 16th ICML, 1999, p. 278

1999

[31] [31]

Asymformer: Asym- metrical cross-modal representation learning for mobile platform real- time rgb-d semantic segmentation,

S. Du, W. Wang, R. Guo, R. Wang, and S. Tang, “Asymformer: Asym- metrical cross-modal representation learning for mobile platform real- time rgb-d semantic segmentation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 7608–7615

2024

[32] [32]

Leap hand: Low-cost, effi- cient, and anthropomorphic hand for robot learning,

K. Shaw, A. Agarwal, and D. Pathak, “Leap hand: Low-cost, effi- cient, and anthropomorphic hand for robot learning,”arXiv preprint arXiv:2309.06440, 2023

arXiv 2023

[33] [33]

Earl: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation,

B. Huang, J. Yu, and S. Jain, “Earl: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation,” in2023 IROS. IEEE, 2023, pp. 2963–2970

2023

[34] [34]

Learning dexterous in-hand manipulation with multifingered hands via visuomotor diffusion,

P. Koczy, M. C. Welle, and D. Kragic, “Learning dexterous in-hand manipulation with multifingered hands via visuomotor diffusion,” in 2025 IROS. IEEE, 2025, pp. 121–127

2025

[35] [35]

In-hand object rotation via rapid motor adaptation,

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik, “In-hand object rotation via rapid motor adaptation,” inCoRL. PMLR, 2023, pp. 1722–1732

2023

[36] [36]

Transforce: Transferable force prediction for vision-based tactile sensors with sequential image translation,

Z. Chen, N. Ou, X. Zhang, and S. Luo, “Transforce: Transferable force prediction for vision-based tactile sensors with sequential image translation,” in2025 ICRA. IEEE, 2025, pp. 237–243

2025