pith. machine review for the scientific record. sign in

arxiv: 2511.14427 · v3 · submitted 2025-11-18 · 💻 cs.RO · cs.LG

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Pith reviewed 2026-05-17 21:01 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords multisensory pretrainingmasked autoencodingrobot reinforcement learningcontact-rich manipulationtransformer encoderasymmetric architecturesensor fusionself-supervised learning
0
0 comments X

The pith

Self-supervised multisensory pretraining allows robots to learn contact-rich manipulation with few real-world trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MultiSensory Dynamic Pretraining to create useful representations from vision, force, and proprioception data using masked autoencoding. A transformer encoder learns to reconstruct observations from partial sensor inputs, promoting cross-modal fusion. In the reinforcement learning phase, a frozen encoder feeds into an asymmetric actor-critic where the critic uses cross-attention for task-specific dynamics while the actor uses stable pooled features. This setup leads to faster convergence and robustness against sensor noise and changing object properties in both simulation and real robot experiments. The method achieves high success rates using only around 6000 online interactions on physical hardware.

Core claim

MSDP trains a transformer encoder via masked autoencoding to reconstruct multisensory observations from subsets of sensor embeddings, fostering cross-modal prediction. For policy learning, the asymmetric architecture freezes the encoder and lets the critic extract dynamic features via cross-attention while the actor receives pooled representations, resulting in accelerated learning and robust performance in contact-rich tasks.

What carries the argument

The MultiSensory Dynamic Pretraining (MSDP) based on masked autoencoding of multisensory observations with a transformer encoder, combined with an asymmetric actor-critic architecture for downstream reinforcement learning.

If this is right

  • Accelerated learning in multiple contact-rich robot manipulation tasks
  • Robust performance under sensor noise and changes in object dynamics
  • High success rates on real robots with as few as 6000 online interactions
  • Effective sensor fusion through cross-modal prediction during pretraining

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such pretraining could be applied to additional sensory modalities like touch to further improve manipulation skills.
  • The separation of stable actor input and dynamic critic features might generalize to other RL settings where representations need to balance consistency and adaptability.
  • Offline pretraining on multisensory data may help bridge simulation to real-world transfer in robotics.

Load-bearing premise

The representations learned by masked autoencoding on multisensory observations contain the dynamic, task-relevant features needed by the critic without requiring additional fine-tuning or task-specific adaptation during pretraining.

What would settle it

A direct comparison showing whether removing the masked autoencoding pretraining or the cross-attention mechanism in the critic leads to significantly lower success rates and less robustness in the real-robot contact-rich tasks.

Figures

Figures reproduced from arXiv: 2511.14427 by Gabriele Tiboni, Georgia Chalvatzaki, Rickmer Krohn, Vignesh Prasad.

Figure 1
Figure 1. Figure 1: Multisensory Dynamic Pretraining fuses multiple sensors, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The MSDP framework with MSDP-Encoder (left), Pretraining (top right) and downstream RL (bottom right): The current multisensory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multisensory contact-rich robot environments [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison between MSDP-P and MSDP-R to the baselines in Peg Insertion, Push Cube, Close Drawer Gently and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Peg Insertion Sensor Ablation: Proprioception is crucial to [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pretraining the representation for Peg Insertion with multiple [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Critic’s cross-attention maps in the Push Cube task. The [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Real world setup and experimental results. Our MSDP framework enables training RL policies directly in the real world, with first [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
read the original abstract

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MultiSensory Dynamic Pretraining (MSDP), a self-supervised framework that pretrains a transformer encoder via masked autoencoding on multisensory observations (vision, force, proprioception) to induce cross-modal fusion. For downstream reinforcement learning, it introduces an asymmetric actor-critic architecture in which the critic extracts task-specific dynamic features from the frozen embeddings via cross-attention while the actor receives a pooled representation. The paper claims that this yields accelerated learning, robustness to sensor noise and object dynamics changes, and high real-robot success rates in contact-rich manipulation tasks using as few as 6,000 online interactions.

Significance. If the empirical claims are substantiated, the work could provide a practical route to improved sample efficiency and robustness in multisensory robotic RL by separating representation learning from task-specific fine-tuning. The asymmetric critic design and emphasis on cross-modal prediction address a recognized challenge in contact-rich settings.

major comments (2)
  1. [Abstract] Abstract: the headline claim of high real-robot success with only 6,000 interactions is presented without quantitative baselines, ablation studies, statistical details, or description of the pretraining corpus and downstream task suite. This absence prevents evaluation of whether the reported gains are driven by MSDP rather than task selection or architecture alone.
  2. [Pretraining and downstream sections] Pretraining and downstream sections: the masked reconstruction objective is purely reconstructive and action-free. No analysis is supplied showing that the resulting embeddings encode forward dynamics or force transients required by the critic for value estimation; if the embeddings primarily capture static correlations, the robustness and sample-efficiency claims would rest on the asymmetric architecture rather than the pretraining.
minor comments (2)
  1. [Method] Clarify the precise masking ratio, sensor-specific embedding dimensions, and reconstruction loss weighting across modalities to support reproducibility.
  2. [Experiments] Real-robot results should report trial counts, success-rate confidence intervals, and perturbation magnitudes for the claimed robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our contributions and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of high real-robot success with only 6,000 interactions is presented without quantitative baselines, ablation studies, statistical details, or description of the pretraining corpus and downstream task suite. This absence prevents evaluation of whether the reported gains are driven by MSDP rather than task selection or architecture alone.

    Authors: We agree that the abstract, as a concise summary, omits specific quantitative details that would better contextualize the results. In the revised manuscript we will expand the abstract to report key success rates with statistical significance, reference the scale of the pretraining corpus, and briefly describe the downstream task suite. We will also explicitly note that full baselines, ablations, and statistical analyses appear in Sections 4 and 5. These additions will make it clearer that the reported gains are attributable to MSDP rather than task choice alone. revision: yes

  2. Referee: [Pretraining and downstream sections] Pretraining and downstream sections: the masked reconstruction objective is purely reconstructive and action-free. No analysis is supplied showing that the resulting embeddings encode forward dynamics or force transients required by the critic for value estimation; if the embeddings primarily capture static correlations, the robustness and sample-efficiency claims would rest on the asymmetric architecture rather than the pretraining.

    Authors: The masked multisensory reconstruction objective is indeed reconstructive and action-free; however, because the model must reconstruct force and proprioceptive signals from partial or missing visual and proprioceptive inputs across time, it is forced to capture cross-modal temporal correlations and force transients. Our experiments already demonstrate that MSDP embeddings yield faster learning and greater robustness to object-dynamics changes and sensor noise than baselines that use the same asymmetric architecture without pretraining. To directly address the concern, we will add a new analysis subsection that probes the frozen embeddings for their ability to predict short-term force changes and state transitions, together with an ablation isolating the contribution of pretraining versus the critic’s cross-attention mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity; pretraining objective independent of downstream task

full rationale

The paper's core derivation separates masked autoencoding pretraining (reconstructive, action-free, on multisensory observations) from the downstream asymmetric actor-critic RL stage. No equations or steps reduce the reported success rates or robustness claims to a fitted quantity defined by the evaluation data itself. The pretraining loss is defined independently of task rewards, and the frozen embeddings are used without task-specific adaptation during pretraining. This satisfies the default expectation of a self-contained pipeline with no load-bearing self-definition or fitted-input-as-prediction patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details; therefore no free parameters, domain axioms, or invented entities can be identified beyond the standard assumption that transformer-based masked reconstruction yields transferable multisensory features.

pith-pipeline@v0.9.0 · 5529 in / 1257 out tokens · 59701 ms · 2026-05-17T21:01:17.451795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 6 internal anchors

  1. [1]

    Playing Atari with Deep Reinforcement Learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- stra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” Dec. 2013, arXiv:1312.5602 [cs]

  2. [2]

    Robot Parkour Learning,

    Z. Zhuang, Z. Fu, J. Wang, C. Atkeson, S. Schwertfeger, C. Finn, and H. Zhao, “Robot Parkour Learning,” Sep. 2023, arXiv:2309.05665 [cs]

  3. [3]

    Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control

    F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke, “Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control,” Nov. 2015, arXiv:1511.03791 [cs]

  4. [4]

    Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

    M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making Sense of Vision and Touch: Self- Supervised Learning of Multimodal Representations for Contact-Rich Tasks,”CoRR, vol. abs/1810.10191, 2018, _eprint: 1810.10191

  5. [5]

    The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,

    C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel, “The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,” Nov. 2023, arXiv:2311.00924 [cs]

  6. [6]

    Sensor fusion for compliant robot motion control,

    J. G. García, A. Robertsson, J. G. Ortega, and R. Johansson, “Sensor fusion for compliant robot motion control,”IEEE Transactions on Robotics, vol. 24, no. 2, pp. 430–441, 2008

  7. [7]

    Dexterous robotic manipulation of deformable objects with multi-sensory feedback-a review,

    F. F. Khalil and P. Payeur, “Dexterous robotic manipulation of deformable objects with multi-sensory feedback-a review,”Robot Manipulators Trends and Development, no. March 2010, 2010

  8. [8]

    Development of sensory-motor fusion-based manipulation and grasping control for a robotic hand-eye system,

    Y . Hu, Z. Li, G. Li, P. Yuan, C. Yang, and R. Song, “Development of sensory-motor fusion-based manipulation and grasping control for a robotic hand-eye system,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 7, pp. 1169–1180, 2016

  9. [9]

    A review on sensory perception for dexterous robotic manipulation,

    Z. Xia, Z. Deng, B. Fang, Y . Yang, and F. Sun, “A review on sensory perception for dexterous robotic manipulation,”International Journal of Advanced Robotic Systems, vol. 19, no. 2, p. 17298806221095974, 2022

  10. [10]

    A review on challenges of autonomous mobile robot and sensor fusion methods,

    M. B. Alatise and G. P. Hancke, “A review on challenges of autonomous mobile robot and sensor fusion methods,”IEEE Access, vol. 8, pp. 39 830–39 846, 2020

  11. [11]

    See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation,

    H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation,” Dec. 2022, arXiv:2212.03858 [cs]

  12. [12]

    Masked Imitation Learning: Discovering Environment-Invariant Modalities in Multimodal Demonstrations,

    Y . Hao, R. Wang, Z. Cao, Z. Wang, Y . Cui, and D. Sadigh, “Masked Imitation Learning: Discovering Environment-Invariant Modalities in Multimodal Demonstrations,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2023, pp. 1–7, iSSN: 2153-0866

  13. [13]

    MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models,

    S. Saxena, M. Sharma, and O. Kroemer, “MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models,” Jan. 2024, arXiv:2401.14502 [cs]

  14. [14]

    Data quality in imitation learning,

    S. Belkhale, Y . Cui, and D. Sadigh, “Data quality in imitation learning,” Advances in Neural Information Processing Systems, vol. 36, 2024

  15. [15]

    Visuo-Tactile Transformers for Manipulation,

    Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-Tactile Transformers for Manipulation,” Sep. 2022, arXiv:2210.00121 [cs]

  16. [16]

    Q. Liu, Z. Sun, Y . Cui, L. Gaofeng, Q. Ye, and J. Chen,Masked Visual-Tactile Pre-training for Robot Manipulation, Feb. 2024

  17. [17]

    MultiMAE: Multi-modal Multi-task Masked Autoencoders,

    R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir, “MultiMAE: Multi-modal Multi-task Masked Autoencoders,” Apr. 2022

  18. [18]

    Multimodal Masked Autoencoders Learn Transferable Representations,

    X. Geng, H. Liu, L. Lee, D. Schuurmans, S. Levine, and P. Abbeel, “Multimodal Masked Autoencoders Learn Transferable Representations,” May 2022

  19. [19]

    Simple Masked Training Strategies Yield Control Policies That Are Robust to Sensor Failure,

    S. Skand, B. Pandit, C. Kim, L. Fuxin, and S. Lee, “Simple Masked Training Strategies Yield Control Policies That Are Robust to Sensor Failure,” Sep. 2024

  20. [20]

    Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation,

    G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor, “Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation,” 2017

  21. [21]

    Real-World Robot Learning with Masked Visual Pre-training,

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-World Robot Learning with Masked Visual Pre-training,” Oct. 2022, arXiv:2210.03109 [cs]

  22. [22]

    Historical perspective and state of the art in robot force control,

    D. E. Whitney, “Historical perspective and state of the art in robot force control,”The International Journal of Robotics Research, vol. 6, no. 1, pp. 3–14, 1987

  23. [23]

    Quasi-static assembly of compliantly supported rigid parts,

    D. E. Whitneyet al., “Quasi-static assembly of compliantly supported rigid parts,”Journal of Dynamic Systems, Measurement, and Control, vol. 104, no. 1, pp. 65–77, 1982

  24. [24]

    Learning the peg-into-hole assembly operation with a connectionist reinforcement technique,

    M. Nuttin and H. Van Brussel, “Learning the peg-into-hole assembly operation with a connectionist reinforcement technique,”Computers in Industry, vol. 33, no. 1, pp. 101–109, 1997

  25. [25]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”Journal of Machine Learning Research, vol. 17, no. 39, pp. 1–40, 2016

  26. [26]

    A review on reinforce- ment learning for contact-rich robotic manipulation tasks,

    Í. Elguea-Aguinaco, A. Serrano-Muñoz, D. Chrysostomou, I. Inziarte- Hidalgo, S. Bøgh, and N. Arana-Arexolaleiba, “A review on reinforce- ment learning for contact-rich robotic manipulation tasks,”Robotics and Computer-Integrated Manufacturing, vol. 81, p. 102517, 2023

  27. [27]

    A survey of robot manipulation in contact,

    M. Suomalainen, Y . Karayiannidis, and V . Kyrki, “A survey of robot manipulation in contact,”Robotics and Autonomous Systems, vol. 156, p. 104224, 2022

  28. [28]

    Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,

    R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and B. Dresp-Langley, “Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,”Robotics, vol. 10, no. 1, p. 22, 2021

  29. [29]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale,”CoRR, vol. abs/2010.11929, 2020, _eprint: 2010.11929

  30. [30]

    Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model,

    A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model,” Oct. 2020, arXiv:1907.00953 [cs]

  31. [31]

    Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation,

    J. Mejia, V . Dean, T. Hellebrekers, and A. Gupta, “Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation,” May 2024, arXiv:2405.08576 [cs]

  32. [32]

    Audio-Visual In- stance Discrimination with Cross-Modal Agreement,

    P. Morgado, N. Vasconcelos, and I. Misra, “Audio-Visual In- stance Discrimination with Cross-Modal Agreement,” Mar. 2021, arXiv:2004.12943 [cs]

  33. [33]

    Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,

    M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,”IEEE Transactions on Robotics, vol. 36, no. 3, pp. 582–596, 2020

  34. [34]

    See, hear, and feel: Smart sensory fusion for robotic manipulation,

    H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu, “See, hear, and feel: Smart sensory fusion for robotic manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 1368–1378

  35. [35]

    Reinforcement learning strategy based on multimodal representations for high-precision assembly tasks,

    A. Li, R. Liu, X. Yang, and Y . Lou, “Reinforcement learning strategy based on multimodal representations for high-precision assembly tasks,” inIntelligent Robotics and Applications: 14th International Conference, ICIRA 2021, Yantai, China, October 22–25, 2021, Proceedings, Part I

  36. [36]

    Springer, 2021, pp. 56–66

  37. [37]

    Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play,

    I. Guzey, B. Evans, S. Chintala, and L. Pinto, “Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play,” arXiv preprint arXiv:2303.12076, 2023

  38. [38]

    Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer,

    Y . Han, K. Yu, R. Batra, N. Boyd, C. Mehta, T. Zhao, Y . She, S. Hutchinson, and Y . Zhao, “Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer,” IEEE/ASME Transactions on Mechatronics, 2024

  39. [39]

    Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,

    R. Feng, D. Hu, W. Ma, and X. Li, “Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation,” in8th Annual Conference on Robot Learning, 2024

  40. [40]

    Multimodal visual-tactile rep- resentation learning through self-supervised contrastive pre-training,

    V . Dave, F. Lygerakis, and E. Rueckert, “Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre- Training,” Jan. 2024, arXiv:2401.12024 [cs]

  41. [41]

    M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representa- tion Learning for Robotic Manipulation,

    F. Lygerakis, V . Dave, and E. Rueckert, “M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representa- tion Learning for Robotic Manipulation,” Jun. 2024, arXiv:2401.17032 [cs]

  42. [42]

    Partially Observable Markov Decision Processes (POMDPs) and Robotics,

    H. Kurniawati, “Partially Observable Markov Decision Processes (POMDPs) and Robotics,” Jul. 2021, arXiv:2107.07599 [cs]

  43. [43]

    Soft Actor-Critic Algorithms and Applications

    T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft Actor- Critic Algorithms and Applications,” 2019, _eprint: 1812.05905

  44. [44]

    Efficient Online Rein- forcement Learning with Offline Data,

    P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient Online Rein- forcement Learning with Offline Data,” May 2023, arXiv:2302.02948 [cs]

  45. [45]

    Masked Autoencoders Are Scalable Vision Learners

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked Au- toencoders Are Scalable Vision Learners,” Dec. 2021, arXiv:2111.06377 [cs]

  46. [46]

    Context Autoencoder for Self-Supervised Representation Learning,

    X. Chen, M. Ding, X. Wang, Y . Xin, S. Mo, Y . Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context Autoencoder for Self-Supervised Representation Learning,” Aug. 2023, arXiv:2202.03026 [cs]

  47. [47]

    Masked World Models for Visual Control,

    Y . Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel, “Masked World Models for Visual Control,” May 2023, arXiv:2206.14244 [cs]

  48. [48]

    Early Convolutions Help Transformers See Better,

    T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Gir- shick, “Early Convolutions Help Transformers See Better,” Oct. 2021, arXiv:2106.14881 [cs]

  49. [49]

    Attention is All you Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

  50. [50]

    Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked Visual Pre-training for Motor Control,” Mar. 2022, arXiv:2203.06173

  51. [51]

    Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?” Feb. 2024, arXiv:2303.18240 [cs]

  52. [52]

    Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning,

    S. Garcin, T. McInroe, P. S. Castro, P. Panangaden, C. G. Lucas, D. Abel, and S. V . Albrecht, “Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning,” Mar. 2025, arXiv:2503.06343 [cs]

  53. [53]

    panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,

    Q. Gallouédec, N. Cazin, E. Dellandréa, and L. Chen, “panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning,” 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS, 2021

  54. [54]

    PyBullet, a Python module for physics simulation for games, robotics and machine learning,

    E. Coumans and Y . Bai, “PyBullet, a Python module for physics simulation for games, robotics and machine learning,” 2016

  55. [55]

    Serl: A software suite for sample-efficient robotic reinforcement learning,

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine, “Serl: A software suite for sample-efficient robotic reinforcement learning,” 2024