Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

· 2018 · cs.RO · arXiv 1810.10191

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented.

representative citing papers

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

cs.RO · 2025-11-18 · unverdicted · novelty 6.0

MSDP pretrains a transformer encoder via masked multisensory reconstruction and feeds the embeddings into an asymmetric actor-critic RL setup, yielding faster learning and high real-robot success rates with only 6,000 interactions.

Grasping Using Tactile Sensing and Deep Calibration

cs.RO · 2019-07-23 · unverdicted · novelty 3.0

A tactile feedback approach for robot grasping evaluated on a real robot, using deep learning to eliminate bias in force-torque sensor data.

citing papers explorer

Showing 2 of 2 citing papers.

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning cs.RO · 2025-11-18 · unverdicted · none · ref 4 · internal anchor
MSDP pretrains a transformer encoder via masked multisensory reconstruction and feeds the embeddings into an asymmetric actor-critic RL setup, yielding faster learning and high real-robot success rates with only 6,000 interactions.
Grasping Using Tactile Sensing and Deep Calibration cs.RO · 2019-07-23 · unverdicted · none · ref 1 · internal anchor
A tactile feedback approach for robot grasping evaluated on a real robot, using deep learning to eliminate bias in force-torque sensor data.

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

fields

years

verdicts

representative citing papers

citing papers explorer