arxiv: 2511.14427 · v3 · submitted 2025-11-18 · 💻 cs.RO · cs.LG

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Rickmer Krohn , Vignesh Prasad , Gabriele Tiboni , Georgia Chalvatzaki This is my paper

Pith reviewed 2026-05-17 21:01 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords multisensory pretrainingmasked autoencodingrobot reinforcement learningcontact-rich manipulationtransformer encoderasymmetric architecturesensor fusionself-supervised learning

0 comments

The pith

Self-supervised multisensory pretraining allows robots to learn contact-rich manipulation with few real-world trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MultiSensory Dynamic Pretraining to create useful representations from vision, force, and proprioception data using masked autoencoding. A transformer encoder learns to reconstruct observations from partial sensor inputs, promoting cross-modal fusion. In the reinforcement learning phase, a frozen encoder feeds into an asymmetric actor-critic where the critic uses cross-attention for task-specific dynamics while the actor uses stable pooled features. This setup leads to faster convergence and robustness against sensor noise and changing object properties in both simulation and real robot experiments. The method achieves high success rates using only around 6000 online interactions on physical hardware.

Core claim

MSDP trains a transformer encoder via masked autoencoding to reconstruct multisensory observations from subsets of sensor embeddings, fostering cross-modal prediction. For policy learning, the asymmetric architecture freezes the encoder and lets the critic extract dynamic features via cross-attention while the actor receives pooled representations, resulting in accelerated learning and robust performance in contact-rich tasks.

What carries the argument

The MultiSensory Dynamic Pretraining (MSDP) based on masked autoencoding of multisensory observations with a transformer encoder, combined with an asymmetric actor-critic architecture for downstream reinforcement learning.

If this is right

Accelerated learning in multiple contact-rich robot manipulation tasks
Robust performance under sensor noise and changes in object dynamics
High success rates on real robots with as few as 6000 online interactions
Effective sensor fusion through cross-modal prediction during pretraining

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such pretraining could be applied to additional sensory modalities like touch to further improve manipulation skills.
The separation of stable actor input and dynamic critic features might generalize to other RL settings where representations need to balance consistency and adaptability.
Offline pretraining on multisensory data may help bridge simulation to real-world transfer in robotics.

Load-bearing premise

The representations learned by masked autoencoding on multisensory observations contain the dynamic, task-relevant features needed by the critic without requiring additional fine-tuning or task-specific adaptation during pretraining.

What would settle it

A direct comparison showing whether removing the masked autoencoding pretraining or the cross-attention mechanism in the critic leads to significantly lower success rates and less robustness in the real-robot contact-rich tasks.

Figures

Figures reproduced from arXiv: 2511.14427 by Gabriele Tiboni, Georgia Chalvatzaki, Rickmer Krohn, Vignesh Prasad.

**Figure 2.** Figure 2: The MSDP framework with MSDP-Encoder (left), Pretraining (top right) and downstream RL (bottom right): The current multisensory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Multisensory contact-rich robot environments [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison between MSDP-P and MSDP-R to the baselines in Peg Insertion, Push Cube, Close Drawer Gently and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Peg Insertion Sensor Ablation: Proprioception is crucial to [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 8.** Figure 8: Pretraining the representation for Peg Insertion with multiple [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Critic’s cross-attention maps in the Push Cube task. The [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: Real world setup and experimental results. Our MSDP framework enables training RL policies directly in the real world, with first [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

read the original abstract

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSDP's masked multisensory pretraining plus asymmetric frozen-encoder critic is a sensible architecture tweak for contact-rich RL, but the real-robot gains with 6000 steps still need clearer ablations to show the pretraining itself is doing the heavy lifting.

read the letter

The paper's main move is to pretrain a transformer encoder with masked reconstruction across vision, force, and proprioception, then plug the frozen embeddings into an asymmetric actor-critic where the critic cross-attends for task features and the actor gets a pooled summary. That asymmetric split is the clearest new piece relative to standard multimodal RL or single-modality pretraining baselines. It targets a real issue: RL struggles with noisy, shifting multisensory signals in contact tasks, and the reconstruction objective at least forces some cross-modal fusion before policy learning starts. The reported robustness to sensor noise and object dynamics changes, plus the real-robot success after only 6000 interactions, are the practical claims worth noting if they hold up in the full experiments. The approach is straightforward enough that someone could try the architecture on their own setup without too much overhead. The soft spot is exactly the one the stress-test flags. Because pretraining is purely reconstructive and action-free, the embeddings may mostly capture static correlations rather than the forward dynamics or force transients the critic needs for value estimation. Without strong ablations that turn the pretraining off or swap in random embeddings and still measure the same gains, it's hard to know how much credit belongs to the masked autoencoding versus the asymmetric design or the task choices themselves. The abstract also leaves out quantitative baselines, pretraining dataset details, and statistical reporting, which makes the central claims harder to assess at face value. This is for roboticists and RL people working on sample-efficient multisensory manipulation. A reader who wants concrete ideas for freezing encoders while letting the critic adapt via attention would get something usable here. It deserves peer review because the topic matters and the architecture has enough structure to be worth referee scrutiny, even if the experiments will probably need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MultiSensory Dynamic Pretraining (MSDP), a self-supervised framework that pretrains a transformer encoder via masked autoencoding on multisensory observations (vision, force, proprioception) to induce cross-modal fusion. For downstream reinforcement learning, it introduces an asymmetric actor-critic architecture in which the critic extracts task-specific dynamic features from the frozen embeddings via cross-attention while the actor receives a pooled representation. The paper claims that this yields accelerated learning, robustness to sensor noise and object dynamics changes, and high real-robot success rates in contact-rich manipulation tasks using as few as 6,000 online interactions.

Significance. If the empirical claims are substantiated, the work could provide a practical route to improved sample efficiency and robustness in multisensory robotic RL by separating representation learning from task-specific fine-tuning. The asymmetric critic design and emphasis on cross-modal prediction address a recognized challenge in contact-rich settings.

major comments (2)

[Abstract] Abstract: the headline claim of high real-robot success with only 6,000 interactions is presented without quantitative baselines, ablation studies, statistical details, or description of the pretraining corpus and downstream task suite. This absence prevents evaluation of whether the reported gains are driven by MSDP rather than task selection or architecture alone.
[Pretraining and downstream sections] Pretraining and downstream sections: the masked reconstruction objective is purely reconstructive and action-free. No analysis is supplied showing that the resulting embeddings encode forward dynamics or force transients required by the critic for value estimation; if the embeddings primarily capture static correlations, the robustness and sample-efficiency claims would rest on the asymmetric architecture rather than the pretraining.

minor comments (2)

[Method] Clarify the precise masking ratio, sensor-specific embedding dimensions, and reconstruction loss weighting across modalities to support reproducibility.
[Experiments] Real-robot results should report trial counts, success-rate confidence intervals, and perturbation magnitudes for the claimed robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our contributions and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of high real-robot success with only 6,000 interactions is presented without quantitative baselines, ablation studies, statistical details, or description of the pretraining corpus and downstream task suite. This absence prevents evaluation of whether the reported gains are driven by MSDP rather than task selection or architecture alone.

Authors: We agree that the abstract, as a concise summary, omits specific quantitative details that would better contextualize the results. In the revised manuscript we will expand the abstract to report key success rates with statistical significance, reference the scale of the pretraining corpus, and briefly describe the downstream task suite. We will also explicitly note that full baselines, ablations, and statistical analyses appear in Sections 4 and 5. These additions will make it clearer that the reported gains are attributable to MSDP rather than task choice alone. revision: yes
Referee: [Pretraining and downstream sections] Pretraining and downstream sections: the masked reconstruction objective is purely reconstructive and action-free. No analysis is supplied showing that the resulting embeddings encode forward dynamics or force transients required by the critic for value estimation; if the embeddings primarily capture static correlations, the robustness and sample-efficiency claims would rest on the asymmetric architecture rather than the pretraining.

Authors: The masked multisensory reconstruction objective is indeed reconstructive and action-free; however, because the model must reconstruct force and proprioceptive signals from partial or missing visual and proprioceptive inputs across time, it is forced to capture cross-modal temporal correlations and force transients. Our experiments already demonstrate that MSDP embeddings yield faster learning and greater robustness to object-dynamics changes and sensor noise than baselines that use the same asymmetric architecture without pretraining. To directly address the concern, we will add a new analysis subsection that probes the frozen embeddings for their ability to predict short-term force changes and state transitions, together with an ablation isolating the contribution of pretraining versus the critic’s cross-attention mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity; pretraining objective independent of downstream task

full rationale

The paper's core derivation separates masked autoencoding pretraining (reconstructive, action-free, on multisensory observations) from the downstream asymmetric actor-critic RL stage. No equations or steps reduce the reported success rates or robustness claims to a fitted quantity defined by the evaluation data itself. The pretraining loss is defined independently of task rewards, and the frozen embeddings are used without task-specific adaptation during pretraining. This satisfies the default expectation of a self-contained pipeline with no load-bearing self-definition or fitted-input-as-prediction patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details; therefore no free parameters, domain axioms, or invented entities can be identified beyond the standard assumption that transformer-based masked reconstruction yields transferable multisensory features.

pith-pipeline@v0.9.0 · 5529 in / 1257 out tokens · 59701 ms · 2026-05-17T21:01:17.451795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The critic uses a single cross-attention layer with a learnable query and the multisensory embeddings from the MSDP encoder as keys and values.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 6 internal anchors

[1]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” Dec. 2013, arXiv:1312.5602 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

Robot Parkour Learning,

Z. Zhuang, Z. Fu, J. Wang, C. Atkeson, S. Schwertfeger, C. Finn, and H. Zhao, “Robot Parkour Learning,” Sep. 2023, arXiv:2309.05665 [cs]

work page arXiv 2023
[3]

Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control

F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke, “Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control,” Nov. 2015, arXiv:1511.03791 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making Sense of Vision and Touch: Self- Supervised Learning of Multimodal Representations for Contact-Rich Tasks,”CoRR, vol. abs/1810.10191, 2018, _eprint: 1810.10191

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,

C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel, “The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,” Nov. 2023, arXiv:2311.00924 [cs]

work page arXiv 2023
[6]

Sensor fusion for compliant robot motion control,

J. G. García, A. Robertsson, J. G. Ortega, and R. Johansson, “Sensor fusion for compliant robot motion control,”IEEE Transactions on Robotics, vol. 24, no. 2, pp. 430–441, 2008

work page 2008
[7]

Dexterous robotic manipulation of deformable objects with multi-sensory feedback-a review,

F. F. Khalil and P. Payeur, “Dexterous robotic manipulation of deformable objects with multi-sensory feedback-a review,”Robot Manipulators Trends and Development, no. March 2010, 2010

work page 2010
[8]

Development of sensory-motor fusion-based manipulation and grasping control for a robotic hand-eye system,

Y . Hu, Z. Li, G. Li, P. Yuan, C. Yang, and R. Song, “Development of sensory-motor fusion-based manipulation and grasping control for a robotic hand-eye system,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 7, pp. 1169–1180, 2016

work page 2016
[9]

A review on sensory perception for dexterous robotic manipulation,

Z. Xia, Z. Deng, B. Fang, Y . Yang, and F. Sun, “A review on sensory perception for dexterous robotic manipulation,”International Journal of Advanced Robotic Systems, vol. 19, no. 2, p. 17298806221095974, 2022

work page 2022
[10]

A review on challenges of autonomous mobile robot and sensor fusion methods,

M. B. Alatise and G. P. Hancke, “A review on challenges of autonomous mobile robot and sensor fusion methods,”IEEE Access, vol. 8, pp. 39 830–39 846, 2020

work page 2020
[11]

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation,” Dec. 2022, arXiv:2212.03858 [cs]

work page arXiv 2022
[12]

Masked Imitation Learning: Discovering Environment-Invariant Modalities in Multimodal Demonstrations,

Y . Hao, R. Wang, Z. Cao, Z. Wang, Y . Cui, and D. Sadigh, “Masked Imitation Learning: Discovering Environment-Invariant Modalities in Multimodal Demonstrations,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2023, pp. 1–7, iSSN: 2153-0866

work page 2023
[13]

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models,

S. Saxena, M. Sharma, and O. Kroemer, “MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models,” Jan. 2024, arXiv:2401.14502 [cs]

work page arXiv 2024
[14]

Data quality in imitation learning,

S. Belkhale, Y . Cui, and D. Sadigh, “Data quality in imitation learning,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[15]

Visuo-Tactile Transformers for Manipulation,

Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-Tactile Transformers for Manipulation,” Sep. 2022, arXiv:2210.00121 [cs]

work page arXiv 2022
[16]

Q. Liu, Z. Sun, Y . Cui, L. Gaofeng, Q. Ye, and J. Chen,Masked Visual-Tactile Pre-training for Robot Manipulation, Feb. 2024

work page 2024
[17]

MultiMAE: Multi-modal Multi-task Masked Autoencoders,

R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir, “MultiMAE: Multi-modal Multi-task Masked Autoencoders,” Apr. 2022

work page 2022
[18]

Multimodal Masked Autoencoders Learn Transferable Representations,

X. Geng, H. Liu, L. Lee, D. Schuurmans, S. Levine, and P. Abbeel, “Multimodal Masked Autoencoders Learn Transferable Representations,” May 2022

work page 2022
[19]

Simple Masked Training Strategies Yield Control Policies That Are Robust to Sensor Failure,

S. Skand, B. Pandit, C. Kim, L. Fuxin, and S. Lee, “Simple Masked Training Strategies Yield Control Policies That Are Robust to Sensor Failure,” Sep. 2024

work page 2024
[20]

Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation,

G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor, “Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation,” 2017

work page 2017
[21]

Real-World Robot Learning with Masked Visual Pre-training,

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-World Robot Learning with Masked Visual Pre-training,” Oct. 2022, arXiv:2210.03109 [cs]

work page arXiv 2022
[22]

Historical perspective and state of the art in robot force control,

D. E. Whitney, “Historical perspective and state of the art in robot force control,”The International Journal of Robotics Research, vol. 6, no. 1, pp. 3–14, 1987

work page 1987
[23]

Quasi-static assembly of compliantly supported rigid parts,

D. E. Whitneyet al., “Quasi-static assembly of compliantly supported rigid parts,”Journal of Dynamic Systems, Measurement, and Control, vol. 104, no. 1, pp. 65–77, 1982

work page 1982
[24]

Learning the peg-into-hole assembly operation with a connectionist reinforcement technique,

M. Nuttin and H. Van Brussel, “Learning the peg-into-hole assembly operation with a connectionist reinforcement technique,”Computers in Industry, vol. 33, no. 1, pp. 101–109, 1997

work page 1997
[25]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016

work page 2016
[26]

A review on reinforce- ment learning for contact-rich robotic manipulation tasks,

Í. Elguea-Aguinaco, A. Serrano-Muñoz, D. Chrysostomou, I. Inziarte- Hidalgo, S. Bøgh, and N. Arana-Arexolaleiba, “A review on reinforce- ment learning for contact-rich robotic manipulation tasks,”Robotics and Computer-Integrated Manufacturing, vol. 81, p. 102517, 2023

work page 2023
[27]

A survey of robot manipulation in contact,

M. Suomalainen, Y . Karayiannidis, and V . Kyrki, “A survey of robot manipulation in contact,”Robotics and Autonomous Systems, vol. 156, p. 104224, 2022

work page 2022
[28]

Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,

R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and B. Dresp-Langley, “Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,”Robotics, vol. 10, no. 1, p. 22, 2021

work page 2021
[29]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale,”CoRR, vol. abs/2010.11929, 2020, _eprint: 2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010
[30]

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model,

A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model,” Oct. 2020, arXiv:1907.00953 [cs]

work page arXiv 2020
[31]

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation,

J. Mejia, V . Dean, T. Hellebrekers, and A. Gupta, “Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation,” May 2024, arXiv:2405.08576 [cs]

work page arXiv 2024
[32]

Audio-Visual In- stance Discrimination with Cross-Modal Agreement,

P. Morgado, N. Vasconcelos, and I. Misra, “Audio-Visual In- stance Discrimination with Cross-Modal Agreement,” Mar. 2021, arXiv:2004.12943 [cs]

work page arXiv 2021
[33]

Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,”IEEE Transactions on Robotics, vol. 36, no. 3, pp. 582–596, 2020

work page 2020
[34]

See, hear, and feel: Smart sensory fusion for robotic manipulation,

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 1368–1378

work page 2023
[35]

Reinforcement learning strategy based on multimodal representations for high-precision assembly tasks,

A. Li, R. Liu, X. Yang, and Y . Lou, “Reinforcement learning strategy based on multimodal representations for high-precision assembly tasks,” inIntelligent Robotics and Applications: 14th International Conference, ICIRA 2021, Yantai, China, October 22–25, 2021, Proceedings, Part I

work page 2021
[36]

Springer, 2021, pp. 56–66

work page 2021
[37]

Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play,

I. Guzey, B. Evans, S. Chintala, and L. Pinto, “Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play,” arXiv preprint arXiv:2303.12076, 2023

work page arXiv 2023
[38]

Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer,

Y . Han, K. Yu, R. Batra, N. Boyd, C. Mehta, T. Zhao, Y . She, S. Hutchinson, and Y . Zhao, “Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer,” IEEE/ASME Transactions on Mechatronics, 2024

work page 2024
[39]

Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,

R. Feng, D. Hu, W. Ma, and X. Li, “Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,” in8th Annual Conference on Robot Learning, 2024

work page 2024
[40]

Multimodal visual-tactile rep- resentation learning through self-supervised contrastive pre-training,

V . Dave, F. Lygerakis, and E. Rueckert, “Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre- Training,” Jan. 2024, arXiv:2401.12024 [cs]

work page arXiv 2024
[41]

M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representa- tion Learning for Robotic Manipulation,

F. Lygerakis, V . Dave, and E. Rueckert, “M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representa- tion Learning for Robotic Manipulation,” Jun. 2024, arXiv:2401.17032 [cs]

work page arXiv 2024
[42]

Partially Observable Markov Decision Processes (POMDPs) and Robotics,

H. Kurniawati, “Partially Observable Markov Decision Processes (POMDPs) and Robotics,” Jul. 2021, arXiv:2107.07599 [cs]

work page arXiv 2021
[43]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft Actor- Critic Algorithms and Applications,” 2019, _eprint: 1812.05905

work page internal anchor Pith review Pith/arXiv arXiv 2019
[44]

Efficient Online Rein- forcement Learning with Offline Data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient Online Rein- forcement Learning with Offline Data,” May 2023, arXiv:2302.02948 [cs]

work page arXiv 2023
[45]

Masked Autoencoders Are Scalable Vision Learners

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked Au- toencoders Are Scalable Vision Learners,” Dec. 2021, arXiv:2111.06377 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Context Autoencoder for Self-Supervised Representation Learning,

X. Chen, M. Ding, X. Wang, Y . Xin, S. Mo, Y . Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context Autoencoder for Self-Supervised Representation Learning,” Aug. 2023, arXiv:2202.03026 [cs]

work page arXiv 2023
[47]

Masked World Models for Visual Control,

Y . Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel, “Masked World Models for Visual Control,” May 2023, arXiv:2206.14244 [cs]

work page arXiv 2023
[48]

Early Convolutions Help Transformers See Better,

T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Gir- shick, “Early Convolutions Help Transformers See Better,” Oct. 2021, arXiv:2106.14881 [cs]

work page arXiv 2021
[49]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

work page 2017
[50]

Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked Visual Pre-training for Motor Control,” Mar. 2022, arXiv:2203.06173

work page arXiv 2022
[51]

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?” Feb. 2024, arXiv:2303.18240 [cs]

work page arXiv 2024
[52]

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning,

S. Garcin, T. McInroe, P. S. Castro, P. Panangaden, C. G. Lucas, D. Abel, and S. V . Albrecht, “Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning,” Mar. 2025, arXiv:2503.06343 [cs]

work page arXiv 2025
[53]

panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,

Q. Gallouédec, N. Cazin, E. Dellandréa, and L. Chen, “panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,” 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS, 2021

work page 2021
[54]

PyBullet, a Python module for physics simulation for games, robotics and machine learning,

E. Coumans and Y . Bai, “PyBullet, a Python module for physics simulation for games, robotics and machine learning,” 2016

work page 2016
[55]

Serl: A software suite for sample-efficient robotic reinforcement learning,

J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine, “Serl: A software suite for sample-efficient robotic reinforcement learning,” 2024

work page 2024