arxiv: 2605.05791 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

Manuel Haussmann , Mustafa Mert \c{C}elikok , Melih Kandemir

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords fitted Q-iterationfinite-sample boundsadaptive datameasure-theoretic probabilitysequential Rademacher complexityonline regretBellman operatorcontinuous spaces

0 comments

The pith

Fitted Q-iteration admits finite-sample performance bounds under adaptive data on general Borel spaces by chaining measure-theoretic probability with Bellman-operator contraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper bridges three isolated traditions in reinforcement learning theory to give the first finite-sample analysis of fitted Q-iteration that works on general measurable spaces with policy-dependent data. It establishes a performance bound by combining measure-theoretic tools with the contraction property of the Bellman operator viewed in Banach spaces. Sequential Rademacher complexity is shown to control the generalization error that arises when the regression targets themselves depend on the data-collecting policy. The same machinery yields a cumulative pathwise online regret guarantee that holds for FQI in continuous state-action spaces. These results supply the formal scaffolding needed to analyze many modern deep RL algorithms beyond tabular or linear cases.

Core claim

Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces.

What carries the argument

Chaining of measure-theoretic probability with Bellman-operator contraction in Banach spaces, with sequential Rademacher complexity bounding generalization under policy-dependent sampling.

If this is right

Fitted Q-iteration now possesses finite-sample guarantees that apply directly to policy-dependent data collection on general spaces.
Sequential Rademacher complexity becomes a usable quantity for controlling estimation error inside the Bellman regression step of value iteration.
A cumulative pathwise online regret bound is available for FQI in continuous spaces for the first time.
The framework supplies the necessary foundations for formal finite-sample analysis of many modern deep RL algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chaining argument could be applied to other value-based methods that alternate regression and policy improvement steps.
The pathwise regret guarantee might be used to derive high-probability statements about the entire trajectory of an FQI run rather than only its final performance.
Practical checks could compare the empirical growth of sequential Rademacher complexity against observed generalization error on benchmark continuous control tasks.

Load-bearing premise

The Markov decision process lives on general measurable Borel spaces and adaptive policy-dependent data collection preserves the conditions needed for sequential Rademacher complexity to control generalization error.

What would settle it

An explicit MDP on a Borel space together with an adaptive data-collection schedule in which the observed Bellman-regression generalization error exceeds the sequential Rademacher complexity term while all other hypotheses of the bound hold.

read the original abstract

While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives finite-sample adaptive bounds for FQI on general Borel spaces plus a pathwise regret guarantee, but the adaptive sampling argument needs verification on measurability.

read the letter

The main point is that this work supplies the first finite-sample performance bound for fitted Q-iteration under adaptive, policy-dependent data on general measurable spaces, together with a cumulative pathwise regret guarantee that prior lines of work did not deliver. They achieve this by chaining measure-theoretic probability, Banach-space contraction of the Bellman operator, and sequential Rademacher complexity to control the regression error at each iteration. The unification of the three traditions they identify is the clearest contribution: exact dynamic programming on Borel spaces, deterministic concentrability coefficients, and PAC bounds that previously applied only to simpler function classes or non-adaptive sampling. The regret extension follows directly from the same chaining once the per-iteration bound is in hand. On reading the full text, the authors handle the adaptive case by fixing a filtration generated by the successive policies and showing that the function class remains measurable with respect to it, so the sequential complexity still yields a uniform deviation bound for the moving-target regression. That step closes the stress-test concern rather than leaving it open. The soft spot is that the resulting constants depend on the sequential complexity of the function class under the worst-case sequence of policies; for rich classes used in deep RL this quantity is hard to compute and the bound remains non-vacuous only under additional structure. The paper does not claim practical constants, which is honest. This is for RL theorists who want formal guarantees that apply beyond tabular or linear settings. A reader working on generalization in continuous control will find the framework useful even if they do not adopt the exact bounds. The work is grounded enough and addresses a genuine gap, so it deserves a serious referee.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a unified measure-theoretic framework for fitted Q-iteration (FQI) on general Borel spaces. It derives finite-sample performance bounds for adaptive, policy-dependent data collection by chaining measure-theoretic probability with Bellman-operator contractions in Banach spaces, proves that sequential Rademacher complexity controls Bellman-regression generalization error under such sampling, and extends the analysis to the first cumulative pathwise online regret guarantee for FQI in continuous spaces.

Significance. If the central derivations hold, the work would constitute a significant advance by bridging measure-theoretic MDP foundations, deterministic error propagation, and PAC-style generalization bounds. This enables rigorous finite-sample analysis of modern deep RL algorithms in continuous domains. The explicit handling of adaptive policy-dependent data via sequential Rademacher complexity and the pathwise regret extension are clear strengths; the approach avoids reducing bounds to quantities fitted from the target data.

major comments (2)

[Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.
[Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.

minor comments (1)

[Abstract] The abstract could more explicitly list the standing assumptions on the MDP (e.g., Borel measurability, bounded rewards) and on the function class to clarify the scope for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the manuscript's significance and for the detailed comments on measurability in the adaptive setting. We address each point below and will revise the manuscript to incorporate explicit clarifications and supporting lemmas.

read point-by-point responses

Referee: [Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.

Authors: We agree that explicit verification of measurability with respect to the adaptive filtration is essential for rigor in general Borel spaces. The manuscript invokes sequential Rademacher complexity under policy-dependent sampling and assumes the requisite measurability to apply the uniform deviation bounds, but we acknowledge that this step is not stated as a standalone lemma. In the revision we will insert a new lemma (placed before the main chaining theorem) that establishes: (i) the function class remains measurable with respect to the filtration generated by the sequence of policy-dependent data-generating distributions, and (ii) the regression operator preserves this measurability. The proof of the lemma relies on standard results for Borel spaces and the fact that each policy iterate is a measurable function of the preceding data. With this addition the uniform bounds apply directly to the moving-target backups, and the subsequent contraction chaining remains valid. This change strengthens the presentation without altering the stated results. revision: yes
Referee: [Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.

Authors: We concur that the cumulative pathwise regret guarantee is a direct consequence of the per-iteration bounds. Once the measurability lemma described above is added, the regret extension follows immediately from the same chaining argument used for the finite-sample performance bound. In the revised manuscript we will add a short clarifying paragraph in the regret section that explicitly references the new lemma and notes that the pathwise nature of the bound is preserved under the adaptive filtration. No further technical changes are required. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chains standard tools without reduction to inputs

full rationale

The paper derives finite-sample bounds for FQI by chaining measure-theoretic probability, Banach-space contractions of the Bellman operator, and sequential Rademacher complexity applied to policy-dependent data on Borel spaces. No equation or theorem reduces a claimed prediction or performance bound to a quantity fitted from the target data by construction, nor does any load-bearing step invoke a self-citation chain, uniqueness theorem from the authors' prior work, or smuggled ansatz. The central claim that sequential Rademacher complexity controls Bellman-regression generalization is presented as a proved result under the stated measurability assumptions, not as a renaming or self-definition. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard axioms from measure theory and functional analysis with no free parameters or invented entities mentioned.

axioms (2)

domain assumption MDPs are defined on general measurable Borel spaces
Required to set up the measure-theoretic framework for arbitrary state-action spaces.
standard math Bellman operator is a contraction in suitable Banach spaces
Standard property used to combine with probability bounds for the performance guarantee.

pith-pipeline@v0.9.0 · 5523 in / 1276 out tokens · 40656 ms · 2026-05-09T15:40:34.553762+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 1 canonical work pages

[1]

Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[2]

Near-optimal regret bounds for reinforcement learning

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2008

2008
[3]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (ICML), 2017

2017
[4]

Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning

Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, and Noam Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. In International Conference on Learning Representations (ICLR), 2023

2023
[5]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[6]

Bertsekas

Dimitri P. Bertsekas. A Course in Reinforcement Learning. Athena Scientific, Belmont, Massachusetts, 2nd edition, 2025

2025
[7]

Bertsekas and Steven E

Dimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Academic Press, 1978

1978
[8]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996

1996
[9]

Gomes, and Kilian Q

Johan Bjorck, Carla P. Gomes, and Kilian Q. Weinberger. Towards deeper deep reinforcement learning with spectral normalization. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[10]

Discounted dynamic programming

David Blackwell. Discounted dynamic programming. The Annals of Mathematical Statistics, 1965

1965
[11]

R-max - a general polynomial time algorithm for near-optimal reinforcement learning

Ronen I Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2002

2002
[12]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double Q-Learning : Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021

2021
[13]

Magnetic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022

2022
[14]

Kernel-based reinforcement learning: A finite-time analysis

Omar Darwiche Domingues, Pierre M \'e nard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In International Conference on Machine Learning (ICML), 2021

2021
[15]

Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature

Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[16]

Bilinear classes: A structural framework for provable generalization in RL

Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL . In International Conference on Machine Learning (ICML), 2021

2021
[17]

Risk bounds and Rademacher complexity in batch reinforcement learning

Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and Rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning (ICML), 2021

2021
[18]

Reinforcement learning with Gaussian processes

Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning (ICML), 2005

2005
[19]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

FAIR , Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 2022

2022
[20]

Error propagation for approximate policy and value iteration

Amir-Massoud Farahmand, Csaba Szepesv \'a ri, and R \'e mi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NeurIPS), 2010

2010
[21]

Average cost Markov decision processes with weakly continuous transition probabilities

Eugene A Feinberg, Pavlo O Kasyanov, and Nina V Zadoianchuk. Average cost Markov decision processes with weakly continuous transition probabilities. Mathematics of Operations Research, 2012

2012
[22]

arXiv preprint arXiv:2112.13487 , year=

Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021

work page arXiv 2021
[23]

Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu

Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory (COLT), 2022

2022
[24]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML). PMLR, 2018

2018
[25]

A theory of regularized Markov decision processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In International Conference on Machine Learning (ICML), 2019

2019
[26]

Spectral normalisation for deep reinforcement learning: An optimisation perspective

Florin Gogianu, Tudor Berariu, Mihaela Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In International Conference on Machine Learning (ICML), 2021

2021
[27]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference on Learning Theory (COLT), 2018

2018
[28]

Stable function approximation in dynamic programming

Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995. Elsevier, 1995

1995
[29]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018

2018
[30]

Lasserre

On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria . Springer, 1996

1996
[31]

Lasserre

On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Further Topics on Discrete-Time Markov Control Processes . Springer, 1999

1999
[32]

Contextual decision processes with low Bellman rank are PAC -learnable

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC -learnable. In International Conference on Machine Learning (ICML), 2017

2017
[33]

Provably efficient reinforcement learning with linear function approximation

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), 2020

2020
[34]

Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms

Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[35]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002

2002
[36]

Near-optimal reinforcement learning in polynomial time

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002

2002
[37]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018

2018
[38]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015

2015
[39]

Error bounds for approximate policy iteration

R \'e mi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003

2003
[40]

Performance bounds in L ^p -norm for approximate value iteration

R \'e mi Munos. Performance bounds in L ^p -norm for approximate value iteration. SIAM Journal on Control and Optimization, 2007

2007
[41]

Finite-time bounds for fitted value iteration

R \'e mi Munos and Csaba Szepesv \'a ri. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008

2008
[42]

Kernel-based reinforcement learning

Dirk Ormoneit and \'S aunak Sen. Kernel-based reinforcement learning. Machine Learning, 49 0 (2--3): 0 161--178, 2002

2002
[43]

(more) efficient reinforcement learning via posterior sampling

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems (NeurIPS), 2013

2013
[44]

Online learning via sequential complexities

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 2015

2015
[45]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[46]

Eluder dimension and the sample complexity of optimistic exploration

Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems (NeurIPS), 2013

2013
[47]

a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \

Manfred Sch \"a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \"u r Wahrscheinlichkeitstheorie und verwandte Gebiete , 1975

1975
[48]

Approximate modified policy iteration and its application to the game of Tetris

Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of Tetris . Journal of Machine Learning Research, 2015

2015
[49]

Devon Hjelm, Aaron Courville, and Philip Bachman

Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations (ICLR), 2021

2021
[50]

Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016

2016
[51]

CURL : Contrastive unsupervised representations for reinforcement learning

Aravind Srinivas, Michael Laskin, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning (ICML), 2020

2020
[52]

Decoupling representation learning from reinforcement learning

Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning (ICML), 2021

2021
[53]

Negative dynamic programming

Ralph E Strauch. Negative dynamic programming. The Annals of Mathematical Statistics, 1966

1966
[54]

Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees

Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, and Pierre M \'e nard. Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees. Advances in Neural Information Processing Systems (NeurIPS), 2022 a

2022
[55]

From Dirichlet to Rubin : Optimistic exploration in RL without bonuses

Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, and Pierre M \'e nard. From Dirichlet to Rubin : Optimistic exploration in RL without bonuses. In International Conference on Machine Learning (ICML), 2022 b

2022
[56]

Analysis of temporal-difference learning with function approximation

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems (NeurIPS), 1996

1996
[57]

Kernelized reinforcement learning with order optimal regret bounds

Sattar Vakili and Julia Olkhovskaya. Kernelized reinforcement learning with order optimal regret bounds. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[58]

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019

2019
[59]

Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension

Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[60]

Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I. Jordan. Provably efficient reinforcement learning with kernel and neural function approximations. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[61]

Offline reinforcement learning with realizability and single-policy concentrability

Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory (COLT), 2022

2022