Generalization in offline RL: The structure is more important than the amount of pessimism

Matthijs T.J. Spaan; Max Weltevrede; Wendelin B\"ohmer

arxiv: 2607.02288 · v1 · pith:7S3D5H7Bnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI

Generalization in offline RL: The structure is more important than the amount of pessimism

Max Weltevrede , Matthijs T.J. Spaan , Wendelin B\"ohmer This is my paper

Pith reviewed 2026-07-03 16:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learninggeneralizationpessimismsymmetriescontextual MDPsdata augmentationvalue function structure

0 comments

The pith

The structure of pessimism, not its amount, determines generalization in offline RL for contextual MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in offline reinforcement learning on contextual MDPs, whether a pessimistic value function generalizes well depends on whether its structure respects the symmetries of the optimal solution rather than on how conservative the pessimism is. The authors prove that a mildly pessimistic non-symmetric value function can generalize worse than an overly pessimistic symmetric one. Dataset coverage fixes the pessimism structure, so enforcing symmetry may require data augmentation applied as a consistency loss at policy extraction time instead of augmenting the full training set. This is shown to improve performance with IQL and CQL on a rotationally symmetric reacher task.

Core claim

In contextual MDPs, successful generalization under pessimism requires the pessimistic structure to match underlying symmetries of the optimal solution. A mildly pessimistic non-symmetric value function generalizes worse than an overly pessimistic symmetric one. The structure of pessimism is fixed by dataset coverage, and data augmentation works best when imposed through a consistency loss during policy extraction rather than by training on an augmented dataset.

What carries the argument

Symmetry-respecting pessimistic value function whose structure is fixed by dataset coverage in contextual MDPs, with consistency loss used to enforce symmetry at policy extraction.

If this is right

Overly pessimistic but symmetric value functions can achieve optimal generalization in CMDPs.
Mildly pessimistic but non-symmetric value functions can produce strictly worse generalization than overly pessimistic symmetric ones.
Dataset coverage determines the structure of pessimism, so symmetry must be enforced separately.
Data augmentation via consistency loss at policy extraction outperforms training on an augmented dataset for IQL and CQL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Enforcing symmetry could allow good generalization from smaller or less diverse offline datasets when the environment has known symmetries.
The same consistency-loss approach may extend to other RL domains with group symmetries such as robotics or navigation.
In environments without clear symmetries the benefit of forcing symmetry would need separate verification.

Load-bearing premise

The optimal solution possesses underlying symmetries such as rotational symmetry that the pessimistic value function must respect to generalize well.

What would settle it

An experiment on a rotationally symmetric CMDP where a mildly pessimistic non-symmetric value function is shown to generalize at least as well as an overly pessimistic symmetric one.

Figures

Figures reproduced from arXiv: 2607.02288 by Matthijs T.J. Spaan, Max Weltevrede, Wendelin B\"ohmer.

**Figure 1.** Figure 1: The agent trains on expert, mixed, and suboptimal datasets collected from context 1, with [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the one-step, rotationally invariant GTI-ZSPT instance used in the proof of [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 1.** Figure 1: The agent trains on expert, mixed, and suboptimal datasets collected from context 1, with [PITH_FULL_IMAGE:figures/full_fig_p030_1.png] view at source ↗

read the original abstract

While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims structure of pessimism trumps its amount for generalization in offline RL on symmetric CMDPs, via a proof and a consistency-loss DA tweak, but the separation rests on the optimum having exact symmetries.

read the letter

The key point from this paper is that in offline reinforcement learning for contextual MDPs, the structure of the pessimism applied to the value function matters more for generalization than simply how pessimistic it is. They show through a proof that a mildly pessimistic non-symmetric value function can generalize worse than an overly pessimistic but symmetric one, provided the optimal solution has underlying symmetries.

What stands out as new is this explicit distinction and the proof linking it to dataset coverage properties. They also propose applying data augmentation via a consistency loss specifically during policy extraction, rather than augmenting the dataset for the main training. This is tested with IQL and CQL on a rotationally symmetric reacher environment.

The paper does a good job laying out why symmetry respect is important and how it arises from coverage. The theoretical result organizes some intuitions about when pessimism helps or hurts generalization.

There are some soft spots though. The central comparison assumes the optimal Q* possesses exact symmetries, such as rotational symmetry in the reacher CMDP, and that the pessimism structure either matches or violates that. If the environment lacks such symmetry or if the function approximator disrupts the induced structure, the separation between amount and structure may not hold up. The empirical part would benefit from more details on whether confounding factors were controlled in the reacher experiments.

This work is aimed at researchers in offline RL who care about generalization bounds and practical algorithm tweaks for symmetric control problems. A reader looking for new theoretical angles on pessimism would find it useful. It has a clear claim, a proof, and experiments, so it deserves to go to peer review even if revisions are needed to tighten the assumptions.

I would recommend sending it out for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that in offline RL for contextual MDPs, generalization success depends on whether the structure of pessimism respects underlying symmetries of the optimal solution Q*, rather than the amount of pessimism. It proves that a mildly pessimistic non-symmetric value function can generalize worse than an overly pessimistic symmetric one, links pessimism structure to dataset coverage, proposes applying data augmentation via a consistency loss during policy extraction (instead of augmented offline training), and validates this empirically with IQL and CQL on a rotationally symmetric reacher environment.

Significance. If the result holds, the work provides a useful distinction between amount and structure of pessimism, with a concrete proof separating the two via symmetry and an empirical demonstration that consistency-loss DA can improve generalization. The explicit proof and the reacher experiments (with IQL/CQL) are strengths that make the contribution falsifiable and actionable for offline RL algorithm design.

major comments (2)

[§3] §3 (proof of the main claim): the derivation assumes the optimal Q* possesses exact symmetries (e.g., rotational symmetry in the reacher CMDP) that the pessimistic value function must match; this assumption is load-bearing because the constructed counter-example separating 'amount' from 'structure' of pessimism ceases to do so if the CMDP lacks the symmetry or if function approximation breaks the induced structure.
[§4] §4 (reacher experiments): the reported results do not appear to include controls that isolate dataset coverage symmetry from other confounding factors (e.g., coverage density or reward scaling); without such controls the empirical separation between structure and amount remains inconclusive.

minor comments (2)

[§2] Notation for the symmetry group and the induced pessimism operator should be introduced earlier and used consistently when linking dataset coverage to value-function structure.
[Abstract] The abstract and introduction could explicitly flag the symmetry assumption on Q* as a modeling choice rather than a general property of all CMDPs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The distinction between the amount and structure of pessimism is central to our contribution, and we address the concerns regarding the proof assumptions and experimental controls below. We believe the theoretical separation remains valid under the stated conditions of symmetric CMDPs, and we are prepared to strengthen the empirical section.

read point-by-point responses

Referee: [§3] §3 (proof of the main claim): the derivation assumes the optimal Q* possesses exact symmetries (e.g., rotational symmetry in the reacher CMDP) that the pessimistic value function must match; this assumption is load-bearing because the constructed counter-example separating 'amount' from 'structure' of pessimism ceases to do so if the CMDP lacks the symmetry or if function approximation breaks the induced structure.

Authors: The proof is explicitly constructed for CMDPs where Q* exhibits exact symmetries (e.g., rotational invariance in the reacher setting). Our claim is that, conditional on such symmetries existing, the structure of pessimism induced by dataset coverage determines generalization success more than its magnitude; the counter-example separates these factors precisely under that symmetry. We will revise §3 to state the assumption more prominently at the outset of the theorem and add a short discussion of the result's scope when symmetries are absent or broken by function approximation. revision: partial
Referee: [§4] §4 (reacher experiments): the reported results do not appear to include controls that isolate dataset coverage symmetry from other confounding factors (e.g., coverage density or reward scaling); without such controls the empirical separation between structure and amount remains inconclusive.

Authors: The reacher CMDP is designed with explicit rotational symmetry, and datasets are generated to produce coverage patterns that induce symmetric versus non-symmetric pessimism while holding the underlying MDP fixed. Nevertheless, we agree that explicit ablations varying coverage density independently of symmetry and checking sensitivity to reward scaling would further isolate the effect. We will add these controls to the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation proceeds from external CMDP coverage properties to a proof comparing non-symmetric vs. symmetric pessimism structures, with the structure of pessimism explicitly tied to dataset coverage as an input rather than derived from the result itself. The claim that structure trumps amount of pessimism follows from this coverage-to-symmetry linkage without reducing any prediction or theorem to a fitted parameter or self-referential definition. The DA consistency-loss recommendation is presented as inspired by (rather than presupposing) the theoretical comparison, and no load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in the abstract or described chain. The derivation therefore remains self-contained against the stated assumptions about symmetries and coverage.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are extractable beyond the standard CMDP setup assumed in offline RL.

pith-pipeline@v0.9.1-grok · 5722 in / 1103 out tokens · 16760 ms · 2026-07-03T16:39:48.325990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët

S. Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët. The Steady - State Control Problem for Markov Decision Processes . In Kaustubh R. Joshi, Markus Siegle, Mariëlle Stoelinga, and Pedro R. D'Argenio (eds.), Quantitative Evaluation of Systems - 10th International Conference , QEST 2013, Buenos Aires , Argentina , August 27-30, 2013. Proceedings ,...

work page doi:10.1007/978-3-642-40196-1_26 2013
[2]

Christensen

Abdulaziz Almuzairee, Nicklas Hansen, and Henrik I. Christensen. A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning . RLJ, 1: 0 130--157, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper26.html

2024
[3]

Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing S...

2021
[4]

Learning with Pseudo - Ensembles

Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with Pseudo - Ensembles . In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal , Quebec , Canada , pp...

2014
[5]

Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL

Peng Cheng, Xianyuan Zhan, Zhi-Hao Wu, Wenjia Zhang, Youfang Lin, Shoucheng Song, Han Wang, and Li Jiang. Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Ann...

2023
[6]

Daesol Cho, Dongseok Shim, and H. Jin Kim. S2P : State -conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 20...

2022
[7]

Corrado, Yuxiao Qu, John U

Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, and Josiah P. Hanna. Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning . RLJ, 1: 0 198--215, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper33.html

2024
[8]

The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024

Delft AI Cluster (DAIC) . The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024. URL https://doc.daic.tudelft.nl/

2024
[9]

DelftBlue Supercomputer ( Phase 2), 2024

Delft High Performance Computing Centre (DHPC) . DelftBlue Supercomputer ( Phase 2), 2024. URL https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase2

2024
[10]

A Minimalist Approach to Offline Reinforcement Learning

Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 20...

2021
[11]

Off- Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto, David Meger, and Doina Precup. Off- Policy Deep Reinforcement Learning without Exploration . In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning , ICML 2019, 9-15 June 2019, Long Beach , California , USA , volume 97 of Proceedings of Machine Learning Research , pp.\ 20...

2019
[12]

Fantastic Generalization Measures are Nowhere to be Found

Michael Gastpar, Ido Nachum, Jonathan Shafer, and Thomas Weinberger. Fantastic Generalization Measures are Nowhere to be Found . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NkmJotfL42

2024
[13]

Gerken and Pan Kessel

Jan E. Gerken and Pan Kessel. Emergent Equivariance in Deep Ensembles . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=plXXbXjvQ9

2024
[14]

Contextual Markov Decision Processes

Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov Decision Processes . CoRR, abs/1502.02259, 2015. URL http://arxiv.org/abs/1502.02259. arXiv: 1502.02259

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Generalization in Reinforcement Learning by Soft Data Augmentation

Nicklas Hansen and Xiaolong Wang. Generalization in Reinforcement Learning by Soft Data Augmentation . In IEEE International Conference on Robotics and Automation , ICRA 2021, Xi 'an, China , May 30 - June 5, 2021 , pp.\ 13611--13617. IEEE, 2021. doi:10.1109/ICRA48506.2021.9561103

work page doi:10.1109/icra48506.2021.9561103 2021
[16]

Goal- Conditioned Data Augmentation for Offline Reinforcement Learning

Xingshuai Huang, Di Wu, and Benoit Boulet. Goal- Conditioned Data Augmentation for Offline Reinforcement Learning . Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=8K16dplpE0

2025
[17]

Neural Tangent Kernel : Convergence and Generalization in Neural Networks

Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural Tangent Kernel : Convergence and Generalization in Neural Networks . In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 20...

2018
[18]

K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace

Junwoo Jang, Jungwoo Han, and Jinwhan Kim. K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace. Expert Syst. Appl., 225: 0 120136, 2023. doi:10.1016/J.ESWA.2023.120136. URL https://doi.org/10.1016/j.eswa.2023.120136

work page doi:10.1016/j.eswa.2023.120136 2023
[19]

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them . In 8th International Conference on Learning Representations , ICLR 2020, Addis Ababa , Ethiopia , April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH

2020
[20]

A Survey of Zero -shot Generalisation in Deep Reinforcement Learning

Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . J. Artif. Intell. Res., 76: 0 201--264, 2023. doi:10.1613/JAIR.1.14174. URL https://doi.org/10.1613/jair.1.14174

work page doi:10.1613/jair.1.14174 2023
[21]

Offline Reinforcement Learning with Implicit Q - Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q - Learning . In The Tenth International Conference on Learning Representations , ICLR 2022, Virtual Event , April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8

2022
[22]

Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informatio...

2019
[23]

Conservative Q - Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q - Learning for Offline Reinforcement Learning . In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2...

2020
[24]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information P...

2019
[25]

GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

Jaewoo Lee, Sujin Yun, Taeyoung Yun, and Jinkyoo Park. GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor...

2024
[26]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning : Tutorial , Review , and Perspectives on Open Problems . CoRR, abs/2005.01643, 2020. URL https://arxiv.org/abs/2005.01643. arXiv: 2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2005
[27]

Mildly Conservative Q - Learning for Offline Reinforcement Learning

Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly Conservative Q - Learning for Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans , LA ,...

2022
[28]

Reining Generalization in Offline Reinforcement Learning via Representation Distinction

Yi Ma, Hongyao Tang, Dong Li, and Zhaopeng Meng. Reining Generalization in Offline Reinforcement Learning via Representation Distinction . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

2023
[29]

Doubly Mild Generalization for Offline Reinforcement Learning

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly Mild Generalization for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Syste...

2024
[30]

Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Informatio...

2022
[31]

The Generalization Gap in Offline Reinforcement Learning

Ishita Mediratta, Qingfei You, Minqi Jiang, and Roberta Raileanu. The Generalization Gap in Offline Reinforcement Learning . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3w6xuXDOdY

2024
[32]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

work page doi:10.1038/nature14236 2015
[33]

Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems ...

2024
[34]

Whitney, and Martin A

Cristina Pinneri, Sarah Bechtle, Markus Wulfmeier, Arunkumar Byravan, Jingwei Zhang, William F. Whitney, and Martin A. Riedmiller. Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning . CoRR, abs/2309.07578, 2023. doi:10.48550/ARXIV.2309.07578. URL https://doi.org/10.48550/arXiv.2309.07578. arXiv: 2309.07578

work page doi:10.48550/arxiv.2309.07578 2023
[35]

Automatic Data Augmentation for Generalization in Reinforcement Learning

Roberta Raileanu, Maxwell Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Automatic Data Augmentation for Generalization in Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Inform...

2021
[36]

Strategically Conservative Q - Learning

Yutaka Shimizu, Joey Hong, Sergey Levine, and Masayoshi Tomizuka. Strategically Conservative Q - Learning . CoRR, abs/2406.04534, 2024. doi:10.48550/ARXIV.2406.04534. URL https://doi.org/10.48550/arXiv.2406.04534. arXiv: 2406.04534

work page doi:10.48550/arxiv.2406.04534 2024
[37]

S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics

Samarth Sinha, Ajay Mandlekar, and Animesh Garg. S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics . In Aleksandra Faust, David Hsu, and Gerhard Neumann (eds.), Conference on Robot Learning , 8-11 November 2021, London , UK , volume 164 of Proceedings of Machine Learning Research , pp.\ 907--917. PMLR, 2021. URL ...

2021
[38]

High-dimensional probability

Roman Vershynin. High-dimensional probability. Cambridge University Press Cambridge, UK, 2009

2009
[39]

Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting

Da Wang, Lin Li, Wei Wei, Qixian Yu, Jianye Hao, and Jiye Liang. Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=CV9PiQGt0i

2024
[40]

Zhiyong Wang, Chen Yang, John C. S. Lui, and Dongruo Zhou. Provable Zero - Shot Generalization in Offline Reinforcement Learning . In Forty-second International Conference on Machine Learning , ICML 2025, Vancouver , BC , Canada , July 13-19, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=1jx6bgemqg

2025
[41]

Zanger, Matthijs T

Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, and Wendelin Böhmer. How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning . CoRR, abs/2505.16581, 2025. doi:10.48550/ARXIV.2505.16581. URL https://doi.org/10.48550/arXiv.2505.16581. arXiv: 2505.16581

work page doi:10.48550/arxiv.2505.16581 2025
[42]

RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning

Qianlan Yang and Yu-Xiong Wang. RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning . In The Thirteenth International Conference on Learning Representations , ICLR 2025, Singapore , April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=0FK6tzqV76

2025
[43]

Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, and Tong Zhang. What is Essential for Unseen Goal Generalization of Offline Goal -conditioned RL ? In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning , ICML 2023, 23-29 July 2023, Honolulu , H...

2023
[44]

Ward, Inderjit S

Shuo Yang, Yijun Dong, Rachel A. Ward, Inderjit S. Dhillon, Sujay Sanghavi, and Qi Lei. Sample Efficiency of Data Augmentation Consistency Regularization . In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics , 25-27 April 2023, Palau de Congressos , Valencia , Spai...

2023

[1] [1]

Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët

S. Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët. The Steady - State Control Problem for Markov Decision Processes . In Kaustubh R. Joshi, Markus Siegle, Mariëlle Stoelinga, and Pedro R. D'Argenio (eds.), Quantitative Evaluation of Systems - 10th International Conference , QEST 2013, Buenos Aires , Argentina , August 27-30, 2013. Proceedings ,...

work page doi:10.1007/978-3-642-40196-1_26 2013

[2] [2]

Christensen

Abdulaziz Almuzairee, Nicklas Hansen, and Henrik I. Christensen. A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning . RLJ, 1: 0 130--157, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper26.html

2024

[3] [3]

Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing S...

2021

[4] [4]

Learning with Pseudo - Ensembles

Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with Pseudo - Ensembles . In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal , Quebec , Canada , pp...

2014

[5] [5]

Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL

Peng Cheng, Xianyuan Zhan, Zhi-Hao Wu, Wenjia Zhang, Youfang Lin, Shoucheng Song, Han Wang, and Li Jiang. Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Ann...

2023

[6] [6]

Daesol Cho, Dongseok Shim, and H. Jin Kim. S2P : State -conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 20...

2022

[7] [7]

Corrado, Yuxiao Qu, John U

Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, and Josiah P. Hanna. Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning . RLJ, 1: 0 198--215, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper33.html

2024

[8] [8]

The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024

Delft AI Cluster (DAIC) . The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024. URL https://doc.daic.tudelft.nl/

2024

[9] [9]

DelftBlue Supercomputer ( Phase 2), 2024

Delft High Performance Computing Centre (DHPC) . DelftBlue Supercomputer ( Phase 2), 2024. URL https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase2

2024

[10] [10]

A Minimalist Approach to Offline Reinforcement Learning

Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 20...

2021

[11] [11]

Off- Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto, David Meger, and Doina Precup. Off- Policy Deep Reinforcement Learning without Exploration . In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning , ICML 2019, 9-15 June 2019, Long Beach , California , USA , volume 97 of Proceedings of Machine Learning Research , pp.\ 20...

2019

[12] [12]

Fantastic Generalization Measures are Nowhere to be Found

Michael Gastpar, Ido Nachum, Jonathan Shafer, and Thomas Weinberger. Fantastic Generalization Measures are Nowhere to be Found . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NkmJotfL42

2024

[13] [13]

Gerken and Pan Kessel

Jan E. Gerken and Pan Kessel. Emergent Equivariance in Deep Ensembles . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=plXXbXjvQ9

2024

[14] [14]

Contextual Markov Decision Processes

Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov Decision Processes . CoRR, abs/1502.02259, 2015. URL http://arxiv.org/abs/1502.02259. arXiv: 1502.02259

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Generalization in Reinforcement Learning by Soft Data Augmentation

Nicklas Hansen and Xiaolong Wang. Generalization in Reinforcement Learning by Soft Data Augmentation . In IEEE International Conference on Robotics and Automation , ICRA 2021, Xi 'an, China , May 30 - June 5, 2021 , pp.\ 13611--13617. IEEE, 2021. doi:10.1109/ICRA48506.2021.9561103

work page doi:10.1109/icra48506.2021.9561103 2021

[16] [16]

Goal- Conditioned Data Augmentation for Offline Reinforcement Learning

Xingshuai Huang, Di Wu, and Benoit Boulet. Goal- Conditioned Data Augmentation for Offline Reinforcement Learning . Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=8K16dplpE0

2025

[17] [17]

Neural Tangent Kernel : Convergence and Generalization in Neural Networks

Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural Tangent Kernel : Convergence and Generalization in Neural Networks . In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 20...

2018

[18] [18]

K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace

Junwoo Jang, Jungwoo Han, and Jinwhan Kim. K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace. Expert Syst. Appl., 225: 0 120136, 2023. doi:10.1016/J.ESWA.2023.120136. URL https://doi.org/10.1016/j.eswa.2023.120136

work page doi:10.1016/j.eswa.2023.120136 2023

[19] [19]

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them . In 8th International Conference on Learning Representations , ICLR 2020, Addis Ababa , Ethiopia , April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH

2020

[20] [20]

A Survey of Zero -shot Generalisation in Deep Reinforcement Learning

Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . J. Artif. Intell. Res., 76: 0 201--264, 2023. doi:10.1613/JAIR.1.14174. URL https://doi.org/10.1613/jair.1.14174

work page doi:10.1613/jair.1.14174 2023

[21] [21]

Offline Reinforcement Learning with Implicit Q - Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q - Learning . In The Tenth International Conference on Learning Representations , ICLR 2022, Virtual Event , April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8

2022

[22] [22]

Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informatio...

2019

[23] [23]

Conservative Q - Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q - Learning for Offline Reinforcement Learning . In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2...

2020

[24] [24]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information P...

2019

[25] [25]

GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

Jaewoo Lee, Sujin Yun, Taeyoung Yun, and Jinkyoo Park. GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor...

2024

[26] [26]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning : Tutorial , Review , and Perspectives on Open Problems . CoRR, abs/2005.01643, 2020. URL https://arxiv.org/abs/2005.01643. arXiv: 2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2005

[27] [27]

Mildly Conservative Q - Learning for Offline Reinforcement Learning

Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly Conservative Q - Learning for Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans , LA ,...

2022

[28] [28]

Reining Generalization in Offline Reinforcement Learning via Representation Distinction

Yi Ma, Hongyao Tang, Dong Li, and Zhaopeng Meng. Reining Generalization in Offline Reinforcement Learning via Representation Distinction . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

2023

[29] [29]

Doubly Mild Generalization for Offline Reinforcement Learning

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly Mild Generalization for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Syste...

2024

[30] [30]

Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Informatio...

2022

[31] [31]

The Generalization Gap in Offline Reinforcement Learning

Ishita Mediratta, Qingfei You, Minqi Jiang, and Roberta Raileanu. The Generalization Gap in Offline Reinforcement Learning . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3w6xuXDOdY

2024

[32] [32]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

work page doi:10.1038/nature14236 2015

[33] [33]

Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems ...

2024

[34] [34]

Whitney, and Martin A

Cristina Pinneri, Sarah Bechtle, Markus Wulfmeier, Arunkumar Byravan, Jingwei Zhang, William F. Whitney, and Martin A. Riedmiller. Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning . CoRR, abs/2309.07578, 2023. doi:10.48550/ARXIV.2309.07578. URL https://doi.org/10.48550/arXiv.2309.07578. arXiv: 2309.07578

work page doi:10.48550/arxiv.2309.07578 2023

[35] [35]

Automatic Data Augmentation for Generalization in Reinforcement Learning

Roberta Raileanu, Maxwell Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Automatic Data Augmentation for Generalization in Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Inform...

2021

[36] [36]

Strategically Conservative Q - Learning

Yutaka Shimizu, Joey Hong, Sergey Levine, and Masayoshi Tomizuka. Strategically Conservative Q - Learning . CoRR, abs/2406.04534, 2024. doi:10.48550/ARXIV.2406.04534. URL https://doi.org/10.48550/arXiv.2406.04534. arXiv: 2406.04534

work page doi:10.48550/arxiv.2406.04534 2024

[37] [37]

S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics

Samarth Sinha, Ajay Mandlekar, and Animesh Garg. S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics . In Aleksandra Faust, David Hsu, and Gerhard Neumann (eds.), Conference on Robot Learning , 8-11 November 2021, London , UK , volume 164 of Proceedings of Machine Learning Research , pp.\ 907--917. PMLR, 2021. URL ...

2021

[38] [38]

High-dimensional probability

Roman Vershynin. High-dimensional probability. Cambridge University Press Cambridge, UK, 2009

2009

[39] [39]

Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting

Da Wang, Lin Li, Wei Wei, Qixian Yu, Jianye Hao, and Jiye Liang. Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=CV9PiQGt0i

2024

[40] [40]

Zhiyong Wang, Chen Yang, John C. S. Lui, and Dongruo Zhou. Provable Zero - Shot Generalization in Offline Reinforcement Learning . In Forty-second International Conference on Machine Learning , ICML 2025, Vancouver , BC , Canada , July 13-19, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=1jx6bgemqg

2025

[41] [41]

Zanger, Matthijs T

Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, and Wendelin Böhmer. How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning . CoRR, abs/2505.16581, 2025. doi:10.48550/ARXIV.2505.16581. URL https://doi.org/10.48550/arXiv.2505.16581. arXiv: 2505.16581

work page doi:10.48550/arxiv.2505.16581 2025

[42] [42]

RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning

Qianlan Yang and Yu-Xiong Wang. RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning . In The Thirteenth International Conference on Learning Representations , ICLR 2025, Singapore , April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=0FK6tzqV76

2025

[43] [43]

Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, and Tong Zhang. What is Essential for Unseen Goal Generalization of Offline Goal -conditioned RL ? In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning , ICML 2023, 23-29 July 2023, Honolulu , H...

2023

[44] [44]

Ward, Inderjit S

Shuo Yang, Yijun Dong, Rachel A. Ward, Inderjit S. Dhillon, Sujay Sanghavi, and Qi Lei. Sample Efficiency of Data Augmentation Consistency Regularization . In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics , 25-27 April 2023, Palau de Congressos , Valencia , Spai...

2023