Generalization in offline RL: The structure is more important than the amount of pessimism
Pith reviewed 2026-07-03 16:39 UTC · model grok-4.3
The pith
The structure of pessimism, not its amount, determines generalization in offline RL for contextual MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In contextual MDPs, successful generalization under pessimism requires the pessimistic structure to match underlying symmetries of the optimal solution. A mildly pessimistic non-symmetric value function generalizes worse than an overly pessimistic symmetric one. The structure of pessimism is fixed by dataset coverage, and data augmentation works best when imposed through a consistency loss during policy extraction rather than by training on an augmented dataset.
What carries the argument
Symmetry-respecting pessimistic value function whose structure is fixed by dataset coverage in contextual MDPs, with consistency loss used to enforce symmetry at policy extraction.
If this is right
- Overly pessimistic but symmetric value functions can achieve optimal generalization in CMDPs.
- Mildly pessimistic but non-symmetric value functions can produce strictly worse generalization than overly pessimistic symmetric ones.
- Dataset coverage determines the structure of pessimism, so symmetry must be enforced separately.
- Data augmentation via consistency loss at policy extraction outperforms training on an augmented dataset for IQL and CQL.
Where Pith is reading between the lines
- Enforcing symmetry could allow good generalization from smaller or less diverse offline datasets when the environment has known symmetries.
- The same consistency-loss approach may extend to other RL domains with group symmetries such as robotics or navigation.
- In environments without clear symmetries the benefit of forcing symmetry would need separate verification.
Load-bearing premise
The optimal solution possesses underlying symmetries such as rotational symmetry that the pessimistic value function must respect to generalize well.
What would settle it
An experiment on a rotationally symmetric CMDP where a mildly pessimistic non-symmetric value function is shown to generalize at least as well as an overly pessimistic symmetric one.
Figures
read the original abstract
While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in offline RL for contextual MDPs, generalization success depends on whether the structure of pessimism respects underlying symmetries of the optimal solution Q*, rather than the amount of pessimism. It proves that a mildly pessimistic non-symmetric value function can generalize worse than an overly pessimistic symmetric one, links pessimism structure to dataset coverage, proposes applying data augmentation via a consistency loss during policy extraction (instead of augmented offline training), and validates this empirically with IQL and CQL on a rotationally symmetric reacher environment.
Significance. If the result holds, the work provides a useful distinction between amount and structure of pessimism, with a concrete proof separating the two via symmetry and an empirical demonstration that consistency-loss DA can improve generalization. The explicit proof and the reacher experiments (with IQL/CQL) are strengths that make the contribution falsifiable and actionable for offline RL algorithm design.
major comments (2)
- [§3] §3 (proof of the main claim): the derivation assumes the optimal Q* possesses exact symmetries (e.g., rotational symmetry in the reacher CMDP) that the pessimistic value function must match; this assumption is load-bearing because the constructed counter-example separating 'amount' from 'structure' of pessimism ceases to do so if the CMDP lacks the symmetry or if function approximation breaks the induced structure.
- [§4] §4 (reacher experiments): the reported results do not appear to include controls that isolate dataset coverage symmetry from other confounding factors (e.g., coverage density or reward scaling); without such controls the empirical separation between structure and amount remains inconclusive.
minor comments (2)
- [§2] Notation for the symmetry group and the induced pessimism operator should be introduced earlier and used consistently when linking dataset coverage to value-function structure.
- [Abstract] The abstract and introduction could explicitly flag the symmetry assumption on Q* as a modeling choice rather than a general property of all CMDPs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The distinction between the amount and structure of pessimism is central to our contribution, and we address the concerns regarding the proof assumptions and experimental controls below. We believe the theoretical separation remains valid under the stated conditions of symmetric CMDPs, and we are prepared to strengthen the empirical section.
read point-by-point responses
-
Referee: [§3] §3 (proof of the main claim): the derivation assumes the optimal Q* possesses exact symmetries (e.g., rotational symmetry in the reacher CMDP) that the pessimistic value function must match; this assumption is load-bearing because the constructed counter-example separating 'amount' from 'structure' of pessimism ceases to do so if the CMDP lacks the symmetry or if function approximation breaks the induced structure.
Authors: The proof is explicitly constructed for CMDPs where Q* exhibits exact symmetries (e.g., rotational invariance in the reacher setting). Our claim is that, conditional on such symmetries existing, the structure of pessimism induced by dataset coverage determines generalization success more than its magnitude; the counter-example separates these factors precisely under that symmetry. We will revise §3 to state the assumption more prominently at the outset of the theorem and add a short discussion of the result's scope when symmetries are absent or broken by function approximation. revision: partial
-
Referee: [§4] §4 (reacher experiments): the reported results do not appear to include controls that isolate dataset coverage symmetry from other confounding factors (e.g., coverage density or reward scaling); without such controls the empirical separation between structure and amount remains inconclusive.
Authors: The reacher CMDP is designed with explicit rotational symmetry, and datasets are generated to produce coverage patterns that induce symmetric versus non-symmetric pessimism while holding the underlying MDP fixed. Nevertheless, we agree that explicit ablations varying coverage density independently of symmetry and checking sensitivity to reward scaling would further isolate the effect. We will add these controls to the revised experimental section. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation proceeds from external CMDP coverage properties to a proof comparing non-symmetric vs. symmetric pessimism structures, with the structure of pessimism explicitly tied to dataset coverage as an input rather than derived from the result itself. The claim that structure trumps amount of pessimism follows from this coverage-to-symmetry linkage without reducing any prediction or theorem to a fitted parameter or self-referential definition. The DA consistency-loss recommendation is presented as inspired by (rather than presupposing) the theoretical comparison, and no load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in the abstract or described chain. The derivation therefore remains self-contained against the stated assumptions about symmetries and coverage.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët
S. Akshay, Nathalie Bertrand, Serge Haddad, and Loïc Hélouët. The Steady - State Control Problem for Markov Decision Processes . In Kaustubh R. Joshi, Markus Siegle, Mariëlle Stoelinga, and Pedro R. D'Argenio (eds.), Quantitative Evaluation of Systems - 10th International Conference , QEST 2013, Buenos Aires , Argentina , August 27-30, 2013. Proceedings ,...
-
[2]
Christensen
Abdulaziz Almuzairee, Nicklas Hansen, and Henrik I. Christensen. A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning . RLJ, 1: 0 130--157, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper26.html
2024
-
[3]
Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble
Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty- Based Offline Reinforcement Learning with Diversified Q - Ensemble . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing S...
2021
-
[4]
Learning with Pseudo - Ensembles
Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with Pseudo - Ensembles . In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal , Quebec , Canada , pp...
2014
-
[5]
Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL
Peng Cheng, Xianyuan Zhan, Zhi-Hao Wu, Wenjia Zhang, Youfang Lin, Shoucheng Song, Han Wang, and Li Jiang. Look Beneath the Surface : Exploiting Fundamental Symmetry for Sample - Efficient Offline RL . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Ann...
2023
-
[6]
Daesol Cho, Dongseok Shim, and H. Jin Kim. S2P : State -conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 20...
2022
-
[7]
Corrado, Yuxiao Qu, John U
Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, and Josiah P. Hanna. Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning . RLJ, 1: 0 198--215, 2024. URL https://rlj.cs.umass.edu/2024/papers/Paper33.html
2024
-
[8]
The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024
Delft AI Cluster (DAIC) . The Delft AI Cluster ( DAIC ), RRID : SCR \_025091, 2024. URL https://doc.daic.tudelft.nl/
2024
-
[9]
DelftBlue Supercomputer ( Phase 2), 2024
Delft High Performance Computing Centre (DHPC) . DelftBlue Supercomputer ( Phase 2), 2024. URL https://www.tudelft.nl/dhpc/ark:/44463/DelftBluePhase2
2024
-
[10]
A Minimalist Approach to Offline Reinforcement Learning
Scott Fujimoto and Shixiang Shane Gu. A Minimalist Approach to Offline Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 20...
2021
-
[11]
Off- Policy Deep Reinforcement Learning without Exploration
Scott Fujimoto, David Meger, and Doina Precup. Off- Policy Deep Reinforcement Learning without Exploration . In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning , ICML 2019, 9-15 June 2019, Long Beach , California , USA , volume 97 of Proceedings of Machine Learning Research , pp.\ 20...
2019
-
[12]
Fantastic Generalization Measures are Nowhere to be Found
Michael Gastpar, Ido Nachum, Jonathan Shafer, and Thomas Weinberger. Fantastic Generalization Measures are Nowhere to be Found . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NkmJotfL42
2024
-
[13]
Gerken and Pan Kessel
Jan E. Gerken and Pan Kessel. Emergent Equivariance in Deep Ensembles . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=plXXbXjvQ9
2024
-
[14]
Contextual Markov Decision Processes
Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov Decision Processes . CoRR, abs/1502.02259, 2015. URL http://arxiv.org/abs/1502.02259. arXiv: 1502.02259
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Generalization in Reinforcement Learning by Soft Data Augmentation
Nicklas Hansen and Xiaolong Wang. Generalization in Reinforcement Learning by Soft Data Augmentation . In IEEE International Conference on Robotics and Automation , ICRA 2021, Xi 'an, China , May 30 - June 5, 2021 , pp.\ 13611--13617. IEEE, 2021. doi:10.1109/ICRA48506.2021.9561103
-
[16]
Goal- Conditioned Data Augmentation for Offline Reinforcement Learning
Xingshuai Huang, Di Wu, and Benoit Boulet. Goal- Conditioned Data Augmentation for Offline Reinforcement Learning . Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=8K16dplpE0
2025
-
[17]
Neural Tangent Kernel : Convergence and Generalization in Neural Networks
Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural Tangent Kernel : Convergence and Generalization in Neural Networks . In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 20...
2018
-
[18]
Junwoo Jang, Jungwoo Han, and Jinwhan Kim. K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace. Expert Syst. Appl., 225: 0 120136, 2023. doi:10.1016/J.ESWA.2023.120136. URL https://doi.org/10.1016/j.eswa.2023.120136
-
[19]
Fantastic Generalization Measures and Where to Find Them
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them . In 8th International Conference on Learning Representations , ICLR 2020, Addis Ababa , Ethiopia , April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH
2020
-
[20]
A Survey of Zero -shot Generalisation in Deep Reinforcement Learning
Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A Survey of Zero -shot Generalisation in Deep Reinforcement Learning . J. Artif. Intell. Res., 76: 0 201--264, 2023. doi:10.1613/JAIR.1.14174. URL https://doi.org/10.1613/jair.1.14174
-
[21]
Offline Reinforcement Learning with Implicit Q - Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q - Learning . In The Tenth International Conference on Learning Representations , ICLR 2022, Virtual Event , April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8
2022
-
[22]
Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing Off - Policy Q - Learning via Bootstrapping Error Reduction . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informatio...
2019
-
[23]
Conservative Q - Learning for Offline Reinforcement Learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q - Learning for Offline Reinforcement Learning . In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2...
2020
-
[24]
Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information P...
2019
-
[25]
GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning
Jaewoo Lee, Sujin Yun, Taeyoung Yun, and Jinkyoo Park. GTA : Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor...
2024
-
[26]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning : Tutorial , Review , and Perspectives on Open Problems . CoRR, abs/2005.01643, 2020. URL https://arxiv.org/abs/2005.01643. arXiv: 2005.01643
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[27]
Mildly Conservative Q - Learning for Offline Reinforcement Learning
Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly Conservative Q - Learning for Offline Reinforcement Learning . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans , LA ,...
2022
-
[28]
Reining Generalization in Offline Reinforcement Learning via Representation Distinction
Yi Ma, Hongyao Tang, Dong Li, and Zhaopeng Meng. Reining Generalization in Offline Reinforcement Learning via Representation Distinction . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...
2023
-
[29]
Doubly Mild Generalization for Offline Reinforcement Learning
Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly Mild Generalization for Offline Reinforcement Learning . In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Syste...
2024
-
[30]
Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions
Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Improving Zero - Shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions . In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Informatio...
2022
-
[31]
The Generalization Gap in Offline Reinforcement Learning
Ishita Mediratta, Qingfei You, Minqi Jiang, and Roberta Raileanu. The Generalization Gap in Offline Reinforcement Learning . In The Twelfth International Conference on Learning Representations , ICLR 2024, Vienna , Austria , May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=3w6xuXDOdY
2024
-
[32]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...
-
[33]
Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M
Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is Value Learning Really the Main Bottleneck in Offline RL ? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems ...
2024
-
[34]
Cristina Pinneri, Sarah Bechtle, Markus Wulfmeier, Arunkumar Byravan, Jingwei Zhang, William F. Whitney, and Martin A. Riedmiller. Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning . CoRR, abs/2309.07578, 2023. doi:10.48550/ARXIV.2309.07578. URL https://doi.org/10.48550/arXiv.2309.07578. arXiv: 2309.07578
-
[35]
Automatic Data Augmentation for Generalization in Reinforcement Learning
Roberta Raileanu, Maxwell Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus. Automatic Data Augmentation for Generalization in Reinforcement Learning . In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Inform...
2021
-
[36]
Strategically Conservative Q - Learning
Yutaka Shimizu, Joey Hong, Sergey Levine, and Masayoshi Tomizuka. Strategically Conservative Q - Learning . CoRR, abs/2406.04534, 2024. doi:10.48550/ARXIV.2406.04534. URL https://doi.org/10.48550/arXiv.2406.04534. arXiv: 2406.04534
-
[37]
S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics
Samarth Sinha, Ajay Mandlekar, and Animesh Garg. S4RL : Surprisingly Simple Self - Supervision for Offline Reinforcement Learning in Robotics . In Aleksandra Faust, David Hsu, and Gerhard Neumann (eds.), Conference on Robot Learning , 8-11 November 2021, London , UK , volume 164 of Proceedings of Machine Learning Research , pp.\ 907--917. PMLR, 2021. URL ...
2021
-
[38]
High-dimensional probability
Roman Vershynin. High-dimensional probability. Cambridge University Press Cambridge, UK, 2009
2009
-
[39]
Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting
Da Wang, Lin Li, Wei Wei, Qixian Yu, Jianye Hao, and Jiye Liang. Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting . In Forty-first International Conference on Machine Learning , ICML 2024, Vienna , Austria , July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=CV9PiQGt0i
2024
-
[40]
Zhiyong Wang, Chen Yang, John C. S. Lui, and Dongruo Zhou. Provable Zero - Shot Generalization in Offline Reinforcement Learning . In Forty-second International Conference on Machine Learning , ICML 2025, Vancouver , BC , Canada , July 13-19, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=1jx6bgemqg
2025
-
[41]
Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, and Wendelin Böhmer. How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning . CoRR, abs/2505.16581, 2025. doi:10.48550/ARXIV.2505.16581. URL https://doi.org/10.48550/arXiv.2505.16581. arXiv: 2505.16581
-
[42]
RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning
Qianlan Yang and Yu-Xiong Wang. RTDiff : Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning . In The Thirteenth International Conference on Learning Representations , ICLR 2025, Singapore , April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=0FK6tzqV76
2025
-
[43]
Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, and Tong Zhang. What is Essential for Unseen Goal Generalization of Offline Goal -conditioned RL ? In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning , ICML 2023, 23-29 July 2023, Honolulu , H...
2023
-
[44]
Ward, Inderjit S
Shuo Yang, Yijun Dong, Rachel A. Ward, Inderjit S. Dhillon, Sujay Sanghavi, and Qi Lei. Sample Efficiency of Data Augmentation Consistency Regularization . In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics , 25-27 April 2023, Palau de Congressos , Valencia , Spai...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.