pith. sign in

arxiv: 2507.09180 · v4 · pith:52GRFCRCnew · submitted 2025-07-12 · 💻 cs.CV · cs.RO

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

Pith reviewed 2026-05-21 23:54 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords multimodal fusionsim-to-real transfervisual reinforcement learningvision transformercontrastive learningdomain randomizationzero-shot transferrobotic manipulation
0
0 comments X

The pith

A vision transformer fuses RGB and depth via separate CNN stems plus masked contrastive learning to improve zero-shot sim-to-real transfer in visual reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes processing RGB and depth images through separate CNN stems, then feeding the combined features into a scalable vision transformer, while adding masked-token contrastive learning and curriculum domain randomization. This combination is intended to produce visual representations that generalize across the simulation-to-reality gap. A sympathetic reader would care because visual reinforcement learning policies often fail when moved from simulation to physical robots, and a reliable fusion method could let training happen almost entirely in simulation. If the approach works, robots could learn manipulation skills in simulation and execute them directly in the real world without further real-world data collection or fine-tuning.

Core claim

Processing RGB and depth through separate CNN stems, delivering the combined convolutional features to a scalable vision transformer, and applying a contrastive learning scheme with masked and unmasked tokens together with curriculum-based domain randomization yields visual representations that outperform baseline methods in simulation and support successful zero-shot transfer to real-world robotic manipulation tasks.

What carries the argument

Multimodal fusion backbone that routes RGB and depth through separate CNN stems into a scalable vision transformer, augmented by masked-token contrastive learning.

If this is right

  • The fusion scheme produces higher performance than other baselines in simulation experiments.
  • The resulting model can execute real-world manipulation tasks through zero-shot transfer from simulation.
  • Curriculum domain randomization stabilizes training while the masked contrastive objective improves sample efficiency.
  • The combined representations generalize better across appearance and lighting changes than single-modality or unfused alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion pattern could be tested on non-manipulation tasks such as navigation or grasping in cluttered scenes.
  • Adding a third modality such as thermal images might further reduce sensitivity to visual distractors.
  • The scalable vision transformer component suggests the method could scale to higher-resolution inputs or longer training horizons without architectural redesign.

Load-bearing premise

That separate CNN processing of RGB and depth, followed by a vision transformer with masked contrastive learning and curriculum randomization, produces features robust enough to close the sim-to-real gap without any real-world adaptation.

What would settle it

Running the trained policy on the same real-world manipulation tasks and finding that success rates remain near zero despite strong simulation performance would show the fusion does not achieve the claimed generalization.

Figures

Figures reproduced from arXiv: 2507.09180 by Chenyu Guo, Jingdong Zhao, Lian Zhang, Liao Zhang, Qianxue Zhang, Xiao Zhang, Yiming Ren, Zengren Zhao, Zichun Xu.

Figure 1
Figure 1. Figure 1: Zero-shot transfer to complete manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our approach. We build a visual backbone with two CNN stems to process RGB and depth images, respectively. The convoluted [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark environments and unseen scenarios for agents during [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curves using different fusion approaches, where solid lines and shaded areas represent the mean and confidence interval over 5 random [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention visualization of RGB (left), depth (middle), and RGB-D [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Real-world setup for tasks Assembly, Lift, and PickAndPlace. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot transfer with the proposed visual backbone in standard (left) and challenging (right) real-world scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multimodal visual backbone for sim-to-real transfer in visual reinforcement learning. RGB and depth are processed by separate CNN stems whose features are fed to a scalable vision transformer; masked-token contrastive learning and curriculum domain randomization are added to improve sample efficiency and generalization. Simulation results are claimed to show outperformance over baselines, and zero-shot transfer to real-world manipulation tasks is reported as validation of feasibility.

Significance. If the empirical claims hold with quantitative support, the work could advance sim-to-real RL by demonstrating how explicit multimodal fusion plus contrastive objectives can produce more robust visual representations than domain randomization alone. The architecture is a straightforward combination of established components, and the curriculum randomization is a practical training detail.

major comments (2)
  1. [Real-world experiments] Real-world validation section: the zero-shot transfer claim rests on qualitative success for a small set of manipulation tasks. No success rates, statistical comparisons to baselines performed in the real world, or ablations that disable the RGB+depth fusion / masked contrastive components while retaining curriculum randomization are reported. This prevents isolation of the multimodal backbone's contribution to closing the sim-to-real gap.
  2. [Simulation experiments] Simulation results section: the abstract and results text assert outperformance over baselines, yet no quantitative metrics, baseline implementation details, error bars, or statistical significance tests are supplied in the visible evidence. This leaves the central empirical claim unsupported.
minor comments (2)
  1. [Method] The contrastive loss is described in prose but would be clearer if written as an explicit equation with token masking probability and temperature parameters.
  2. [Figure 2] Figure captions for the architecture diagram should explicitly label the separate CNN stems, feature concatenation point, and masked-token path.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support without altering the core contributions.

read point-by-point responses
  1. Referee: [Real-world experiments] Real-world validation section: the zero-shot transfer claim rests on qualitative success for a small set of manipulation tasks. No success rates, statistical comparisons to baselines performed in the real world, or ablations that disable the RGB+depth fusion / masked contrastive components while retaining curriculum randomization are reported. This prevents isolation of the multimodal backbone's contribution to closing the sim-to-real gap.

    Authors: We agree that quantitative metrics would better isolate the contribution of the multimodal fusion. In the revised manuscript we will report success rates over multiple real-world trials for the demonstrated manipulation tasks. Full statistical comparisons against every baseline and complete real-world ablations (disabling fusion or contrastive learning while keeping curriculum randomization) are challenging due to hardware and time constraints; we have added a limitations paragraph acknowledging this and instead rely on the simulation ablations plus the observed zero-shot feasibility to support the overall claim. revision: partial

  2. Referee: [Simulation experiments] Simulation results section: the abstract and results text assert outperformance over baselines, yet no quantitative metrics, baseline implementation details, error bars, or statistical significance tests are supplied in the visible evidence. This leaves the central empirical claim unsupported.

    Authors: We apologize for any lack of clarity in the presentation. The simulation section already contains mean performance metrics (episode rewards and task success rates) averaged over multiple random seeds for our method and the baselines. In the revision we will explicitly tabulate these values, add error bars, provide the precise baseline implementation details (including hyperparameters and training protocols), and include statistical significance tests (paired t-tests with p-values) to rigorously support the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated against external baselines

full rationale

The paper proposes an empirical multimodal fusion architecture (separate CNN stems for RGB and depth, combined features into a scalable vision transformer, masked-token contrastive learning, and curriculum domain randomization) and reports outperformance on simulation baselines plus zero-shot real-world transfer success. No derivation chain, equations, or first-principles results are presented that reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation load-bearing premises. Claims rest on comparisons to external baselines and real-robot outcomes rather than internal consistency alone, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that depth data is robust to appearance variation and supplies useful 3D structure; no free parameters, new physical entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Depth information is robust to scene appearance variations and inherently carries 3D spatial details.
    Explicitly stated in the opening sentence of the abstract as the justification for including the depth modality.

pith-pipeline@v0.9.0 · 5666 in / 1314 out tokens · 58432 ms · 2026-05-21T23:54:55.709483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Multimodality Driven Impedance-Based Sim2Real Transfer Learning for Robotic Multiple Peg-in-Hole Assembly,

    W. Chen, C. Zeng, H. Liang, F. Sun, and J. Zhang, “Multimodality Driven Impedance-Based Sim2Real Transfer Learning for Robotic Multiple Peg-in-Hole Assembly,” IEEE Transactions on Cybernetics , pp. 1–14, 2024

  2. [2]

    Toward Effective Deep Reinforcement Learn- ing for 3d Robotic Manipulation: Multimodal End-to-End Reinforce- ment Learning from Visual and Proprioceptive Feedback,

    S. Noh and H. Myung, “Toward Effective Deep Reinforcement Learn- ing for 3d Robotic Manipulation: Multimodal End-to-End Reinforce- ment Learning from Visual and Proprioceptive Feedback,” in Deep Reinforcement Learning Workshop NeurIPS 2022 , 2022

  3. [3]

    Visual-Force- Tactile Fusion for Gentle Intricate Insertion Tasks,

    P. Jin, B. Huang, W. W. Lee, T. Li, and W. Yang, “Visual-Force- Tactile Fusion for Gentle Intricate Insertion Tasks,” IEEE Robotics and Automation Letters , pp. 1–8, 2024

  4. [4]

    Visuotactile-RL: Learning Multimodal Manipulation Policies with Deep Reinforcement Learning,

    J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek, “Visuotactile-RL: Learning Multimodal Manipulation Policies with Deep Reinforcement Learning,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 8298–8304

  5. [5]

    Drm: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization,

    G. Xu, R. Zheng, Y . Liang, X. Wang, Z. Yuan, T. Ji, Y . Luo, X. Liu, J. Yuan, P. Hua, S. Li, Y . Ze, H. D. III, F. Huang, and H. Xu, “Drm: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization,” in International Conference on Learning Rep- resentations (ICLR), 2024

  6. [6]

    Learning Better with Less: Effective Aug- mentation for Sample-Efficient Visual Reinforcement Learning

    G. Ma, L. Zhang, H. Wang, L. Li, Z. Wang, Z. Wang, L. Shen, X. Wang, and D. Tao, “Learning Better with Less: Effective Aug- mentation for Sample-Efficient Visual Reinforcement Learning.” in Conference on Neural Information Processing Systems (NeurIPS) , 2023

  7. [7]

    Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning,

    R. Julian, B. Swanson, G. S. Sukhatme, S. Levine, C. Finn, and K. Hausman, “Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning,” arXiv preprint arXiv:2004.10190 , 2020

  8. [8]

    Visual Rein- forcement Learning With Self-Supervised 3d Representations

    Y . Ze, N. Hansen, Y . Chen, M. Jain, and X. Wang, “Visual Rein- forcement Learning With Self-Supervised 3d Representations.” IEEE Robotics and Automation Letters , vol. 8, no. 5, pp. 2890–2897, 2023

  9. [9]

    Real-World Robot Learning with Masked Visual Pre-training

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-World Robot Learning with Masked Visual Pre-training.” in Conference on Robot Learning (CoRL) . arXiv, 2022, pp. 416–426

  10. [10]

    Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

    Z. Yuan, Z. Xue, B. Yuan, X. Wang, Y . Wu, Y . Gao, and H. Xu, “Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning.” in Conference on Neural Information Processing Systems (NeurIPS), 2022

  11. [11]

    Efficient RL via Disentangled Environment and Agent Representations

    K. Gmelin, S. Bahl, R. Mendonca, and D. Pathak, “Efficient RL via Disentangled Environment and Agent Representations.” in Interna- tional Conference on Machine Learning (ICML) , 2023, pp. 11 525– 11 545

  12. [12]

    DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction,

    A. Pore, R. Muradore, and D. Dall’Alba, “DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2024, pp. 650–655

  13. [13]

    Sim- to-Real Transfer of Robotic Control with Dynamics Randomization,

    X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim- to-Real Transfer of Robotic Control with Dynamics Randomization,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3803–3810

  14. [14]

    Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement Learning for Robotic Manipula- tion Tasks,

    J. Josifovski, M. Malmir, N. Klarmann, B. L. ˇZagar, N. Navarro- Guerrero, and A. Knoll, “Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement Learning for Robotic Manipula- tion Tasks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2022, pp. 10 193–10 200

  15. [15]

    CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Trans- formers

    J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Trans- formers.” IEEE Transactions on Intelligent Transportation Systems , vol. 24, no. 12, pp. 14 679–14 694, 2023

  16. [16]

    Spatio-channel attention blocks for cross-modal crowd counting,

    Y . Zhang, S. Choi, and S. Hong, “Spatio-channel attention blocks for cross-modal crowd counting,” in Proceedings of the Asian Conference on Computer Vision , 2022, pp. 90–107

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” in International Con- ference on Learning Representations (ICLR) , 2021

  18. [18]

    Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning

    D. Bertoin, A. Zouitine, M. Zouitine, and E. Rachelson, “Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning.” in Conference on Neural Information Processing Systems (NeurIPS) , 2022

  19. [19]

    RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization

    Z. Yuan, S. Yang, P. Hua, C. Chang, K. Hu, and H. Xu, “RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization.” in Conference on Neural Information Processing Systems (NeurIPS) , 2023

  20. [20]

    Reinforcement Learning with Augmented Data

    M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement Learning with Augmented Data.” in Conference on Neural Information Processing Systems (NeurIPS) , 2020

  21. [21]

    Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels

    D. Yarats, I. Kostrikov, and R. Fergus, “Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels.” in International Conference on Learning Representations (ICLR) , 2021

  22. [22]

    Mastering Visual Con- tinuous Control: Improved Data-Augmented Reinforcement Learning

    D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering Visual Con- tinuous Control: Improved Data-Augmented Reinforcement Learning.” in International Conference on Learning Representations (ICLR) , 2022

  23. [23]

    Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

    N. Hansen, H. Su, and X. Wang, “Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation.” in Conference on Neural Information Processing Systems (NeurIPS) , 2021, pp. 3680–3693

  24. [24]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition.” in Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  25. [25]

    Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre- training for motor control,” arXiv preprint arXiv:2203.06173 , 2022

  26. [26]

    The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

    S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The Unsurprising Effectiveness of Pre-Trained Vision Models for Control.” in International Conference on Machine Learning (ICML) , 2022, pp. 17 359–17 371

  27. [27]

    R3M: A Universal Visual Representation for Robot Manipulation

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A Universal Visual Representation for Robot Manipulation.” in Confer- ence on Robot Learning (CoRL) , 2022, pp. 892–909

  28. [28]

    Look Closer: Bridging Egocentric and Third-Person Views With Transformers for Robotic Manipulation,

    R. Jangir, N. Hansen, S. Ghosal, M. Jain, and X. Wang, “Look Closer: Bridging Egocentric and Third-Person Views With Transformers for Robotic Manipulation,” IEEE Robotics and Automation Letters , vol. 7, no. 2, pp. 3046–3053, 2022

  29. [29]

    Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset,

    G. Jiang, Y . Sun, T. Huang, H. Li, Y . Liang, and H. Xu, “Robots pre- train robots: Manipulation-centric robotic representation from large- scale robot datasets,” arXiv preprint arXiv:2410.22325 , 2024

  30. [30]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis et al., “Droid: A large-scale in-the-wild robot manipulation dataset,” arXiv preprint arXiv:2403.12945, 2024

  31. [31]

    Sim-to-Real Transfer of Robotic Assembly with Visual Inputs Using CycleGAN and Force Control,

    C. Yuan, Y . Shi, Q. Feng, C. Chang, M. Liu, Z. Chen, A. C. Knoll, and J. Zhang, “Sim-to-Real Transfer of Robotic Assembly with Visual Inputs Using CycleGAN and Force Control,” in 2022 IEEE Interna- tional Conference on Robotics and Biomimetics (ROBIO) . IEEE, 2022, pp. 1426–1432

  32. [32]

    Domain randomization for transferring deep neural networks from simulation to the real world

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world.” in IEEE/RJS International Conference on Intelligent RObots and Systems (IROS) , 2017, pp. 23–30

  33. [33]

    Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning,

    Z. Yuan, T. Wei, S. Cheng, G. Zhang, Y . Chen, and H. Xu, “Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning,” in Conference on Robot Learning (CoRL) , 2024

  34. [34]

    Masked Autoencoders Are Scalable Vision Learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollar, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, 2022, pp. 15 979–15 988

  35. [35]

    The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,

    C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel, “The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning,” in 2024 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS) . IEEE, 2024, pp. 9698–9705

  36. [36]

    CURL: Contrastive Unsuper- vised Representations for Reinforcement Learning

    M. Laskin, A. Srinivas, and P. Abbeel, “CURL: Contrastive Unsuper- vised Representations for Reinforcement Learning.” in International Conference on Machine Learning (ICML) , 2020, pp. 5639–5650

  37. [37]

    Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras

    M. Dunion and S. V . Albrecht, “Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras.” in Reinforcement Learning Conference (RLC) , vol. 2, 2024, pp. 498–515

  38. [38]

    TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning

    R. Zheng, X. Wang, Y . Sun, S. Ma, J. Zhao, H. Xu, H. D. III, and F. Huang, “TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning.” in Conference on Neural Information Processing Systems (NeurIPS) , 2023

  39. [39]

    Continuous control with deep reinforcement learning,

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Sil- ver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), 2016

  40. [40]

    Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks,

    R. Martin-Martin, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1010–1017

  41. [41]

    Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” in IEEE International Conference on Computer Vision (ICCV) , 2017, pp. 618–626

  42. [42]

    Early Convolutions Help Transformers See Better

    T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll ´ar, and R. B. Girshick, “Early Convolutions Help Transformers See Better.” in Conference on Neural Information Processing Systems (NeurIPS) , 2021, pp. 30 392– 30 400

  43. [43]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002