arxiv: 2604.24661 · v3 · submitted 2026-04-27 · 💻 cs.RO

Recognition: no theorem link

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

Zhengru Fang , Yu Guo , Fei Liu , Yuang Zhang , Yihang Tao , Senkang Hu , Wenbo Ding , Yuguang Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:43 UTC · model grok-4.3

classification 💻 cs.RO

keywords visual controlrobust perceptionmixture of expertsobservation adaptationreinforcement learningimage degradationsim-to-realdynamic perturbations

0 comments

The pith

A plug-and-play adapter with mixture-of-experts restoration and foreground masking recovers 95.3 percent of clean visual control performance under dynamic perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual control agents in the real world encounter time-varying corruptions from weather, sensor issues, and background changes that standard restoration methods fail to handle without harming task performance. The paper identifies that pixel-faithful reconstruction embeds corruption details into the latent state, which contaminates policy learning. It creates the Visual Degraded Control Suite to test Markov-switching degradations and develops ACO-MoE, an offline-pretrained adapter that uses expert routing and simulation masks to focus on clean foreground elements. This yields substantial gains on multiple benchmarks while generalizing to new corruption types.

Core claim

From an information-bottleneck view, the work establishes that restoration-based representations force encoding of nuisance corruption information, and that instead anchoring to the clean foreground via masks avoids this while preserving task-critical content. The proposed ACO-MoE adapter implements this by combining a routed bank of restoration experts with a foreground-mask branch, pretrained solely on synthetic rendered data with automatic degradation pairs and masks, then deployed at inference on corrupted RGB alone without any labels or references.

What carries the argument

ACO-MoE, an agent-centric observation adapter that routes inputs through a mixture of restoration experts conditioned on a foreground mask branch to produce task-preserving cleaned observations.

Load-bearing premise

That the foreground masks derived from simulation accurately capture the task-relevant information and that the synthetic degradation model sufficiently represents real-world non-stationary corruptions for the adapter to transfer effectively.

What would settle it

Demonstrating that control performance with ACO-MoE falls to or below baseline levels when evaluated on real-world robot footage with actual dynamic perturbations not replicable in the synthetic benchmark.

read the original abstract

Real-world visual systems face time-varying perturbations, including weather, sensor noise, compression artifacts, and background distractions. Existing image restoration methods are typically designed for fixed corruption types and optimized for pixel-level fidelity, leaving open two questions: how restoration behaves under non-stationary corruption switching, and whether pixel-level fidelity preserves the task-relevant information needed by downstream models. To study this setting, we introduce the Visual Degraded Control Suite (VDCS), a benchmark that injects Markov-switching physical degradations into rendered scenes. We further identify a fundamental failure mode of reconstruction-based representations: faithfully reconstructing corrupted observations forces the latent state to encode corruption-specific nuisance information, thereby contaminating downstream models. From an information-bottleneck perspective, anchoring the representation to the clean foreground eliminates this contamination. Motivated by this analysis, we propose \emph{Agent-Centric Observations with Mixture-of-Experts} (ACO-MoE), a frozen, plug-and-play observation adapter that combines a routed bank of restoration experts with a foreground-mask branch. ACO-MoE is pretrained entirely offline on synthetic rendered data with automatically generated degradation pairs and simulation-derived foreground masks, requiring no manual annotation. At inference time, it takes only corrupted RGB as input without corruption labels, clean reference frames, or foreground masks. Across VDCS, DMC-GB, and RoboSuite, ACO-MoE consistently improves downstream control with both model-free and model-based backbones, recovering 95.3\% of clean-input performance under challenging Markov-switching corruptions. It also generalizes zero-shot to unseen visual perturbations excluded from adapter pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical adapter and benchmark for visual control under switching corruptions, but the foreground-mask assumption is the part that needs the most checking.

read the letter

The core contribution is a new benchmark, VDCS, that adds Markov-switching physical degradations to rendered scenes, plus ACO-MoE, a frozen adapter that routes corrupted inputs through restoration experts while anchoring the latent to a simulation-derived foreground mask. The authors argue from an information-bottleneck view that full reconstruction embeds nuisance factors, so the mask-plus-experts design should keep only task-relevant content. They report that the adapter lifts performance on VDCS, DMC-GB, and RoboSuite for both model-free and model-based controllers, reaching 95 percent of clean-input results and generalizing zero-shot to unseen corruptions after purely synthetic offline pretraining.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Visual Degraded Control Suite (VDCS) benchmark for non-stationary visual degradations and proposes ACO-MoE, a frozen plug-and-play observation adapter that combines a routed mixture-of-experts restoration bank with a simulation-derived foreground-mask branch. Pretrained offline on synthetic rendered pairs, ACO-MoE is claimed to eliminate nuisance contamination in latent representations per an information-bottleneck analysis, recovering 95.3% of clean-input performance across VDCS, DMC-GB, and RoboSuite for both model-free and model-based controllers while generalizing zero-shot to unseen perturbations.

Significance. If the results and the foreground-anchoring justification hold, the work offers a practical, label-free adapter for robust visual control under dynamic real-world corruptions, potentially reducing the need for policy retraining or online adaptation in robotics applications.

major comments (3)

[§3 (Information-Bottleneck Analysis)] §3 (Information-Bottleneck Analysis): The central motivation that anchoring latents to clean foreground masks eliminates nuisance contamination without discarding task-critical information is load-bearing for the performance claims, yet the analysis does not include a quantitative bound or ablation demonstrating that policy-relevant cues (e.g., peripheral dynamics or shadows in RoboSuite manipulation) are retained; if masks remove such context, downstream controllers would lose performance even with restored nuisances.
[§5 (Experiments)] §5 (Experiments): The reported 95.3% recovery and zero-shot generalization are the primary empirical support, but the results section provides insufficient detail on run counts, error bars, statistical tests, and component ablations (e.g., MoE routing vs. single expert, mask branch vs. full-image input); without these, it is impossible to verify that improvements are not due to the synthetic pretraining distribution or baseline weaknesses.
[§4.2 (ACO-MoE Architecture)] §4.2 (ACO-MoE Architecture): The transfer assumption that offline synthetic Markov-switching degradations plus simulation masks will handle real non-stationary corruptions at inference is central, but no analysis or cross-domain experiment quantifies the domain gap between rendered degradations and actual sensor/weather effects, risking overstatement of robustness.

minor comments (2)

[Notation] The notation for the information-bottleneck objective and expert routing could be made more explicit with a single equation block defining all mutual-information terms and gating weights.
[Figures] Figure captions for mask visualizations should include quantitative metrics (e.g., IoU with clean foreground) to allow readers to assess information preservation directly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses

Referee: [§3 (Information-Bottleneck Analysis)] §3 (Information-Bottleneck Analysis): The central motivation that anchoring latents to clean foreground masks eliminates nuisance contamination without discarding task-critical information is load-bearing for the performance claims, yet the analysis does not include a quantitative bound or ablation demonstrating that policy-relevant cues (e.g., peripheral dynamics or shadows in RoboSuite manipulation) are retained; if masks remove such context, downstream controllers would lose performance even with restored nuisances.

Authors: We agree that a more explicit demonstration of retained task-relevant information would strengthen the information-bottleneck argument in Section 3. The current analysis shows that foreground anchoring reduces mutual information with nuisance factors while the empirical recovery of 95.3% clean performance across environments (including RoboSuite) indicates that critical cues such as peripheral dynamics are preserved in practice. To directly address the concern, we will add a targeted ablation in the revised manuscript that isolates the mask branch's effect on control performance in tasks with prominent peripheral elements, quantifying any information loss. revision: yes
Referee: [§5 (Experiments)] §5 (Experiments): The reported 95.3% recovery and zero-shot generalization are the primary empirical support, but the results section provides insufficient detail on run counts, error bars, statistical tests, and component ablations (e.g., MoE routing vs. single expert, mask branch vs. full-image input); without these, it is impossible to verify that improvements are not due to the synthetic pretraining distribution or baseline weaknesses.

Authors: We acknowledge that the experimental reporting in Section 5 lacks sufficient statistical rigor and component-level ablations. In the revised manuscript we will report the exact number of independent runs (5 seeds per setting), include error bars on all performance plots, add statistical significance tests comparing ACO-MoE against baselines, and expand the ablation study to explicitly compare full ACO-MoE against a single-expert restoration variant and a mask-free full-image input variant. These additions will allow readers to verify that gains arise from the routed experts and foreground anchoring rather than pretraining artifacts. revision: yes
Referee: [§4.2 (ACO-MoE Architecture)] §4.2 (ACO-MoE Architecture): The transfer assumption that offline synthetic Markov-switching degradations plus simulation masks will handle real non-stationary corruptions at inference is central, but no analysis or cross-domain experiment quantifies the domain gap between rendered degradations and actual sensor/weather effects, risking overstatement of robustness.

Authors: The referee correctly notes that our evaluation remains within synthetic domains and does not quantify the synthetic-to-real domain gap. While zero-shot generalization to unseen synthetic perturbations provides evidence of robustness inside the simulated distribution, we do not claim direct equivalence to real sensor or weather effects. In the revised manuscript we will add an explicit limitations paragraph in Section 4.2 and the conclusion discussing this gap and suggesting future real-robot validation protocols. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; performance claims are empirical

full rationale

The paper's chain begins with an information-bottleneck analysis identifying a failure mode in reconstruction-based latents, then motivates the ACO-MoE architecture (frozen adapter with routed experts and foreground-mask branch) as a plug-and-play solution pretrained offline on synthetic degradation pairs and simulation-derived masks. No equations, derivations, or fitted parameters are presented that reduce the reported 95.3% recovery or zero-shot generalization to inputs by construction. The central claims rest on downstream empirical evaluations across VDCS, DMC-GB, and RoboSuite with model-free and model-based controllers, without self-citations serving as load-bearing uniqueness theorems or ansatzes. The method is self-contained as an empirical architecture whose validity is tested externally on benchmarks rather than tautologically derived from its own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that the information-bottleneck view correctly diagnoses the contamination problem and that foreground anchoring solves it; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Anchoring the representation to the clean foreground eliminates contamination from corruption-specific nuisance information
Invoked in abstract as the information-bottleneck motivation for the foreground-mask branch.

pith-pipeline@v0.9.0 · 5607 in / 1227 out tokens · 36176 ms · 2026-05-11T00:43:02.634294+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Look where you look! Saliency-guided q-networks for generalization in visual reinforcement learning.Advances in neural information processing systems, 35:30693–30706, 2022

David Bertoin, Adil Zouitine, Mehdi Zouitine, and Emmanuel Rachelson. Look where you look! Saliency-guided q-networks for generalization in visual reinforcement learning.Advances in neural information processing systems, 35:30693–30706, 2022

2022
[2]

Parameter-free online test-time adaptation

Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8344–8353, 2022

2022
[3]

Simple baselines for image restoration

Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In European Conference on Computer Vision (ECCV), volume 13667, pages 17–33, 2022

2022
[4]

InstructIR: High-quality image restoration following human instructions

Marcos V Conde, Gregor Geigle, and Radu Timofte. InstructIR: High-quality image restoration following human instructions. InEuropean Conference on Computer Vision, pages 1–21. Springer, 2024

2024
[5]

Robustbench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. RobustBench: A standardized adversarial robustness benchmark.arXiv preprint arXiv:2010.09670, 2020

work page arXiv 2010
[6]

Image denoising by sparse 3-D transform-domain collaborative filtering.IEEE Transactions on Image Processing, 16(8):2080–2095, 2007

Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering.IEEE Transactions on Image Processing, 16(8):2080–2095, 2007

2080
[7]

Sliding puzzles gym: A scalable benchmark for state representation in visual reinforcement learning.arXiv preprint arXiv:2410.14038, 2024

Bryan LM de Oliveira, Luana GB Martins, Bruno Brandão, Murilo L da Luz, Telma W de L Soares, and Luckeciano C Melo. Sliding puzzles gym: A scalable benchmark for state representation in visual reinforcement learning.arXiv preprint arXiv:2410.14038, 2024

work page arXiv 2024
[8]

MambaIR: A simple baseline for image restoration with state-space model

Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. MambaIR: A simple baseline for image restoration with state-space model. InEuropean Conference on Computer Vision (ECCV), pages 222–241. Springer, 2024

2024
[9]

Onerestore: A universal restoration framework for composite degradation

Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean Conference on Computer Vision, pages 255–272. Springer, 2024

2024
[10]

Neptune-x: Active x-to-maritime generation for universal maritime object detection.arXiv preprint arXiv:2509.20745, 2025

Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection.arXiv preprint arXiv:2509.20745, 2025. [11]David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page arXiv 2025
[11]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning, volume 97, pages 2555–2565. PMLR, 2019

2019
[12]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[13]

Dsp-reg: Domain-sensitive parameter regularization for robust domain generalization, 2026

Xudong Han, Senkang Hu, Yihang Tao, Yu Guo, Philip Birch, Sam Tak Wu Kwong, and Yuguang Fang. Dsp-reg: Domain-sensitive parameter regularization for robust domain generalization, 2026

2026
[14]

Generalizationinreinforcementlearningbysoftdataaugmentation

NicklasHansenandXiaolongWang. Generalizationinreinforcementlearningbysoftdataaugmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617, 2021. 13/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

2021
[15]

Self-supervised policy adaptation during deployment

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. InInternational Conference on Learning Representations, 2021

2021
[16]

Stabilizing deep q-learning with convnets and vision transformersunderdataaugmentation

Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformersunderdataaugmentation. InAdvancesinNeuralInformationProcessingSystems,volume34, pages 3680–3693, 2021

2021
[17]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2023

work page internal anchor Pith review arXiv 2023
[18]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review arXiv 1903
[19]

Agentscodriver: Large language model empowered collaborative driving with lifelong learning, 2024

Senkang Hu, Zhengru Fang, Zihan Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Agentscodriver: Large language model empowered collaborative driving with lifelong learning, 2024

2024
[20]

Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. Toward full- scene domain generalization in multi-agent collaborative bird’s eye view segmentation for connected and autonomous driving.IEEE Transactions on Intelligent Transportation Systems, 26(2):1783–1796, 2025

2025
[21]

Agentscomerge: Large language model empowered collaborative decision making for ramp merging.IEEE Transactions on Mobile Computing, 24(10):9791–9805, 2025

Senkang Hu, Zhengru Fang, Zihan Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Tak Wu Kwong. Agentscomerge: Large language model empowered collaborative decision making for ramp merging.IEEE Transactions on Mobile Computing, 24(10):9791–9805, 2025

2025
[22]

Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang. Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

work page arXiv 2026
[23]

Planning- oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, 2023

2023
[24]

Spectrum random masking for generalization in image-based reinforcement learning

Yangru Huang, Peixi Peng, Yifan Zhao, Guangyao Chen, and Yonghong Tian. Spectrum random masking for generalization in image-based reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, pages 20393–20406, 2022

2022
[25]

Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

1991
[26]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lelio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994. 14/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

1994
[28]

3D common corruptions and data augmentation

Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 3D common corruptions and data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18963–18974, 2022

2022
[29]

Make the pertinent salient: Task-relevant reconstruction for visual control with distractions.arXiv preprint arXiv:2410.09972, 2024

Kyungmin Kim, JB Lanier, Pierre Baldi, Charless Fowlkes, and Roy Fox. Make the pertinent salient: Task-relevant reconstruction for visual control with distractions.arXiv preprint arXiv:2410.09972, 2024

work page arXiv 2024
[30]

Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels

Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels.arXiv preprint arXiv:2004.13649, 2020

work page arXiv 2004
[31]

DeblurGAN: Blind motion deblurring using conditional adversarial networks

Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. DeblurGAN: Blind motion deblurring using conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8183–8192, 2018

2018
[32]

Reinforce- ment learning with augmented data

Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce- ment learning with augmented data. InAdvances in Neural Information Processing Systems, volume 33, pages 19884–19895, 2020

2020
[33]

CURL: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. InInternational Conference on Machine Learning, volume 119, pages 5639–5650. PMLR, 2020

2020
[34]

All-in-one image restoration for unknown corruption

Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-in-one image restoration for unknown corruption. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17452–17462, 2022

2022
[35]

Instruct2see: Learningtoremoveanyobstructions across distributions

JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions. InInternational Conference on Machine Learning, pages 34453–34470. PMLR, 2025

2025
[36]

Policy-independent behavioral metric-based representa- tion for deep reinforcement learning

Weijian Liao, Zongzhang Zhang, and Yang Yu. Policy-independent behavioral metric-based representa- tion for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8746–8754, 2023

2023
[37]

MoE-LLaVA: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. MoE-LLaVA: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026

2026
[38]

TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 21808–21820, 2021

Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 21808–21820, 2021

2021
[39]

Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B

Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, and Thomas B. Schön. Controlling vision-language models for multi-task image restoration. InInternational Conference on Learning Representations (ICLR), 2024

2024
[40]

Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

Vincent Micheli, Eloi Alonso, and Francois Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588, 2022

work page arXiv 2022
[41]

Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement lear...

2015
[42]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and SergeyLevine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024
[43]

Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

Joseph Ortiz, Antoine Dedieu, Wolfgang Lehrach, J Swaroop Guntupalli, Carter Wendelken, Ahmad Humayun, Sivaramakrishnan Swaminathan, Guangyao Zhou, Miguel Lázaro-Gredilla, and Kevin P Murphy. Dmc-vb: A benchmark for representation learning for control with visual distractors.Advances in Neural Information Processing Systems, 37:6574–6602, 2024

2024
[44]

Model-based rein- forcement learning with isolated imaginations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2788–2803, 2024

Minting Pan, Xiangming Zhu, Yitao Zheng, Yunbo Wang, and Xiaokang Yang. Model-based rein- forcement learning with isolated imaginations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):2788–2803, 2024

2024
[45]

PromptIR: Prompting for all-in-one blind image restoration

Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, and Fahad Shahbaz Khan. PromptIR: Prompting for all-in-one blind image restoration. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[46]

From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023

work page arXiv 2023
[47]

arXiv preprint arXiv:2402.08191

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191, 2024

work page arXiv 2024
[48]

MoE-DiffIR: Task-customized diffusion priors for universal compressed image restoration

Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, and Zhibo Chen. MoE-DiffIR: Task-customized diffusion priors for universal compressed image restoration. InEuropean Conference on Computer Vision (ECCV), volume 15067, pages 116–134, 2024

2024
[49]

Scaling vision with sparse mixture of experts

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. InAdvances in Neural Information Processing Systems, volume 34, pages 8583–8595, 2021

2021
[50]

Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

Jan Robine, Marc Hoftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

work page arXiv 2023
[51]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, volume 9351, pages 234–241. Springer, 2015

2015
[52]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017

2017
[53]

DriveX: Omni scene modeling for learning generalizable world knowledge in autonomous driving

Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, and Li Jiang. DriveX: Omni scene modeling for learning generalizable world knowledge in autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28599–28609, 2025

2025
[54]

A simple framework for generalization in visual RL under dynamic scene perturbations

Wonil Song, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. A simple framework for generalization in visual RL under dynamic scene perturbations. volume 37, pages 121790–121826, 2024

2024
[55]

The distracting control suite: a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

Austin Stone, Oscar Ramirez, Kurt Konolige, and Rico Jonschkowski. The distracting control suite: a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

work page arXiv 2021
[56]

Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024

Ruixiang Sun, Hongyu Zang, Xin Li, and Riashat Islam. Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024. 16/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

work page arXiv 2024
[57]

Proagentbench: Evaluating llm agents for proactive assistance with real-world data

Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. ProAgentBench: Evaluating llm agents for proactive assistance with real-world data.arXiv preprint arXiv:2602.04482, 2026

work page arXiv 2026
[58]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review arXiv 2018
[59]

Focus-Then-Reuse: Fast adaptation in visual perturbation environments

Jiahui Wang, Chao Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. Focus-Then-Reuse: Fast adaptation in visual perturbation environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[60]

GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.International journal of computer vision, 132(10):4541–4563, 2024

Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, and Hongdong Li. GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.International journal of computer vision, 132(10):4541–4563, 2024

2024
[61]

DriveDreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision (ECCV), pages 55–72. Springer, 2024

2024
[62]

Generalizable visual reinforcement learning with segment anything model.arXiv preprint arXiv:2312.17116, 2023

Ziyu Wang, Yanjie Ze, Yifei Sun, Zhecheng Yuan, and Huazhe Xu. Generalizable visual reinforcement learning with segment anything model.arXiv preprint arXiv:2312.17116, 2023

work page arXiv 2023
[63]

DiffIR: Efficient diffusion model for image restoration

Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, and Luc Van Gool. DiffIR: Efficient diffusion model for image restoration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13095–13105, 2023

2023
[64]

Image de-raining transformer.IEEE transactions on pattern analysis and machine intelligence, 45(11):12978–12995, 2022

Jie Xiao, Xueyang Fu, Aiping Liu, Feng Wu, and Zheng-Jun Zha. Image de-raining transformer.IEEE transactions on pattern analysis and machine intelligence, 45(11):12978–12995, 2022

2022
[65]

DrM:Mastering visual reinforcement learning through dormant ratio minimization.arXiv preprint arXiv:2310.19668, 2023

GuoweiXu, RuijieZheng, YongyuanLiang, XiyaoWang, ZhechengYuan, TianyingJi, YuLuo, XiaoyuLiu, JiaxinYuan,PuHua,ShuzhenLi,YanjieZe,HalDaume,FurongHuang,andHuazheXu. DrM:Mastering visual reinforcement learning through dormant ratio minimization.arXiv preprint arXiv:2310.19668, 2023

work page arXiv 2023
[66]

arXiv preprint arXiv:2310.061141(2), 6 (2023)

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page arXiv 2023
[67]

arXiv preprint arXiv:2107.09645 , year=

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning.arXiv preprint arXiv:2107.09645, 2021

work page arXiv 2021
[68]

The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

work page arXiv 2026
[69]

Pre- trained image encoder for generalizable visual reinforcement learning.Advances in Neural Information Processing Systems, 35:13022–13037, 2022

Zhecheng Yuan, Zhengrong Xue, Bo Yuan, Xueqian Wang, Yi Wu, Yang Gao, and Huazhe Xu. Pre- trained image encoder for generalizable visual reinforcement learning.Advances in Neural Information Processing Systems, 35:13022–13037, 2022

2022
[70]

Rl-vigen: Areinforcement learning benchmark for visual generalization.Advances in Neural Information Processing Systems, 36: 6720–6747, 2023

ZhechengYuan, SizheYang, PuHua, CanChang, KaizheHu, andHuazheXu. Rl-vigen: Areinforcement learning benchmark for visual generalization.Advances in Neural Information Processing Systems, 36: 6720–6747, 2023. 17/37 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

2023
[71]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming- Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

2022
[72]

Learning invariant representations for reinforcement learning without reconstruction

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. InInternational Conference on Learning Representations, 2021

2021
[73]

Focus On What Matters: Separated models for visual-based rl generalization.Advances in Neural Information Processing Systems, 37:116960–116986, 2024

Di Zhang, Bowen Lv, Hai Zhang, Feifan Yang, Junqiao Zhao, Hang Yu, Chang Huang, Hongtu Zhou, Chen Ye, et al. Focus On What Matters: Separated models for visual-based rl generalization.Advances in Neural Information Processing Systems, 37:116960–116986, 2024

2024
[74]

Image de-raining using a conditional generative adversarial network.IEEE transactions on circuits and systems for video technology, 30(11):3943–3956, 2019

He Zhang, Vishwanath Sindagi, and Vishal M Patel. Image de-raining using a conditional generative adversarial network.IEEE transactions on circuits and systems for video technology, 30(11):3943–3956, 2019

2019
[75]

STORM: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. STORM: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems, 36:27147–27166, 2023

2023
[76]

Perceive-IR: Learning to perceive degradation better for all-in-one image restoration.IEEE Transactions on Image Processing, 2025

Xu Zhang, Jiaqi Ma, Guoli Wang, Qian Zhang, Huan Zhang, and Lefei Zhang. Perceive-IR: Learning to perceive degradation better for all-in-one image restoration.IEEE Transactions on Image Processing, 2025

2025
[77]

SAC Flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling

Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, and Wenbo Ding. SAC Flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling. InInternational Conference on Learning Representations (ICLR), 2026

2026
[78]

TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning

Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning. Advances in Neural Information Processing Systems, 36:48203–48225, 2023

2023
[79]

OccWorld: Learning a 3D occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3D occupancy world model for autonomous driving. InEuropean Conference on Computer Vision (ECCV), pages 55–72. Springer, 2024

2024
[80]

2411.04983 , archiveprefix =

Gaoyue Zhou, Haizhou Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

work page arXiv 2024

Showing first 80 references.