pith. machine review for the scientific record. sign in

arxiv: 2605.05791 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords fitted Q-iterationfinite-sample boundsadaptive datameasure-theoretic probabilitysequential Rademacher complexityonline regretBellman operatorcontinuous spaces
0
0 comments X

The pith

Fitted Q-iteration admits finite-sample performance bounds under adaptive data on general Borel spaces by chaining measure-theoretic probability with Bellman-operator contraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper bridges three isolated traditions in reinforcement learning theory to give the first finite-sample analysis of fitted Q-iteration that works on general measurable spaces with policy-dependent data. It establishes a performance bound by combining measure-theoretic tools with the contraction property of the Bellman operator viewed in Banach spaces. Sequential Rademacher complexity is shown to control the generalization error that arises when the regression targets themselves depend on the data-collecting policy. The same machinery yields a cumulative pathwise online regret guarantee that holds for FQI in continuous state-action spaces. These results supply the formal scaffolding needed to analyze many modern deep RL algorithms beyond tabular or linear cases.

Core claim

Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces.

What carries the argument

Chaining of measure-theoretic probability with Bellman-operator contraction in Banach spaces, with sequential Rademacher complexity bounding generalization under policy-dependent sampling.

If this is right

  • Fitted Q-iteration now possesses finite-sample guarantees that apply directly to policy-dependent data collection on general spaces.
  • Sequential Rademacher complexity becomes a usable quantity for controlling estimation error inside the Bellman regression step of value iteration.
  • A cumulative pathwise online regret bound is available for FQI in continuous spaces for the first time.
  • The framework supplies the necessary foundations for formal finite-sample analysis of many modern deep RL algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chaining argument could be applied to other value-based methods that alternate regression and policy improvement steps.
  • The pathwise regret guarantee might be used to derive high-probability statements about the entire trajectory of an FQI run rather than only its final performance.
  • Practical checks could compare the empirical growth of sequential Rademacher complexity against observed generalization error on benchmark continuous control tasks.

Load-bearing premise

The Markov decision process lives on general measurable Borel spaces and adaptive policy-dependent data collection preserves the conditions needed for sequential Rademacher complexity to control generalization error.

What would settle it

An explicit MDP on a Borel space together with an adaptive data-collection schedule in which the observed Bellman-regression generalization error exceeds the sequential Rademacher complexity term while all other hypotheses of the bound hold.

read the original abstract

While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a unified measure-theoretic framework for fitted Q-iteration (FQI) on general Borel spaces. It derives finite-sample performance bounds for adaptive, policy-dependent data collection by chaining measure-theoretic probability with Bellman-operator contractions in Banach spaces, proves that sequential Rademacher complexity controls Bellman-regression generalization error under such sampling, and extends the analysis to the first cumulative pathwise online regret guarantee for FQI in continuous spaces.

Significance. If the central derivations hold, the work would constitute a significant advance by bridging measure-theoretic MDP foundations, deterministic error propagation, and PAC-style generalization bounds. This enables rigorous finite-sample analysis of modern deep RL algorithms in continuous domains. The explicit handling of adaptive policy-dependent data via sequential Rademacher complexity and the pathwise regret extension are clear strengths; the approach avoids reducing bounds to quantities fitted from the target data.

major comments (2)
  1. [Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.
  2. [Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.
minor comments (1)
  1. [Abstract] The abstract could more explicitly list the standing assumptions on the MDP (e.g., Borel measurability, bounded rewards) and on the function class to clarify the scope for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the manuscript's significance and for the detailed comments on measurability in the adaptive setting. We address each point below and will revise the manuscript to incorporate explicit clarifications and supporting lemmas.

read point-by-point responses
  1. Referee: [Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.

    Authors: We agree that explicit verification of measurability with respect to the adaptive filtration is essential for rigor in general Borel spaces. The manuscript invokes sequential Rademacher complexity under policy-dependent sampling and assumes the requisite measurability to apply the uniform deviation bounds, but we acknowledge that this step is not stated as a standalone lemma. In the revision we will insert a new lemma (placed before the main chaining theorem) that establishes: (i) the function class remains measurable with respect to the filtration generated by the sequence of policy-dependent data-generating distributions, and (ii) the regression operator preserves this measurability. The proof of the lemma relies on standard results for Borel spaces and the fact that each policy iterate is a measurable function of the preceding data. With this addition the uniform bounds apply directly to the moving-target backups, and the subsequent contraction chaining remains valid. This change strengthens the presentation without altering the stated results. revision: yes

  2. Referee: [Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.

    Authors: We concur that the cumulative pathwise regret guarantee is a direct consequence of the per-iteration bounds. Once the measurability lemma described above is added, the regret extension follows immediately from the same chaining argument used for the finite-sample performance bound. In the revised manuscript we will add a short clarifying paragraph in the regret section that explicitly references the new lemma and notes that the pathwise nature of the bound is preserved under the adaptive filtration. No further technical changes are required. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chains standard tools without reduction to inputs

full rationale

The paper derives finite-sample bounds for FQI by chaining measure-theoretic probability, Banach-space contractions of the Bellman operator, and sequential Rademacher complexity applied to policy-dependent data on Borel spaces. No equation or theorem reduces a claimed prediction or performance bound to a quantity fitted from the target data by construction, nor does any load-bearing step invoke a self-citation chain, uniqueness theorem from the authors' prior work, or smuggled ansatz. The central claim that sequential Rademacher complexity controls Bellman-regression generalization is presented as a proved result under the stated measurability assumptions, not as a renaming or self-definition. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard axioms from measure theory and functional analysis with no free parameters or invented entities mentioned.

axioms (2)
  • domain assumption MDPs are defined on general measurable Borel spaces
    Required to set up the measure-theoretic framework for arbitrary state-action spaces.
  • standard math Bellman operator is a contraction in suitable Banach spaces
    Standard property used to combine with probability bounds for the performance guarantee.

pith-pipeline@v0.9.0 · 5523 in / 1276 out tokens · 40656 ms · 2026-05-09T15:40:34.553762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 1 canonical work pages

  1. [1]

    Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds

    Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Advances in Neural Information Processing Systems (NeurIPS), 2017

  2. [2]

    Near-optimal regret bounds for reinforcement learning

    Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2008

  3. [3]

    Minimax regret bounds for reinforcement learning

    Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (ICML), 2017

  4. [4]

    Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning

    Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, and Noam Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. In International Conference on Learning Representations (ICLR), 2023

  5. [5]

    Bartlett, Dylan J

    Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  6. [6]

    Bertsekas

    Dimitri P. Bertsekas. A Course in Reinforcement Learning. Athena Scientific, Belmont, Massachusetts, 2nd edition, 2025

  7. [7]

    Bertsekas and Steven E

    Dimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Academic Press, 1978

  8. [8]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996

  9. [9]

    Gomes, and Kilian Q

    Johan Bjorck, Carla P. Gomes, and Kilian Q. Weinberger. Towards deeper deep reinforcement learning with spectral normalization. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  10. [10]

    Discounted dynamic programming

    David Blackwell. Discounted dynamic programming. The Annals of Mathematical Statistics, 1965

  11. [11]

    R-max - a general polynomial time algorithm for near-optimal reinforcement learning

    Ronen I Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2002

  12. [12]

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double Q-Learning : Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021

  13. [13]

    Magnetic control of tokamak plasmas through deep reinforcement learning

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022

  14. [14]

    Kernel-based reinforcement learning: A finite-time analysis

    Omar Darwiche Domingues, Pierre M \'e nard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In International Conference on Machine Learning (ICML), 2021

  15. [15]

    Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature

    Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  16. [16]

    Bilinear classes: A structural framework for provable generalization in RL

    Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL . In International Conference on Machine Learning (ICML), 2021

  17. [17]

    Risk bounds and Rademacher complexity in batch reinforcement learning

    Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and Rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning (ICML), 2021

  18. [18]

    Reinforcement learning with Gaussian processes

    Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning (ICML), 2005

  19. [19]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning

    FAIR , Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 2022

  20. [20]

    Error propagation for approximate policy and value iteration

    Amir-Massoud Farahmand, Csaba Szepesv \'a ri, and R \'e mi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NeurIPS), 2010

  21. [21]

    Average cost Markov decision processes with weakly continuous transition probabilities

    Eugene A Feinberg, Pavlo O Kasyanov, and Nina V Zadoianchuk. Average cost Markov decision processes with weakly continuous transition probabilities. Mathematics of Operations Research, 2012

  22. [22]

    arXiv preprint arXiv:2112.13487 , year=

    Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021

  23. [23]

    Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu

    Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory (COLT), 2022

  24. [24]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML). PMLR, 2018

  25. [25]

    A theory of regularized Markov decision processes

    Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In International Conference on Machine Learning (ICML), 2019

  26. [26]

    Spectral normalisation for deep reinforcement learning: An optimisation perspective

    Florin Gogianu, Tudor Berariu, Mihaela Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In International Conference on Machine Learning (ICML), 2021

  27. [27]

    Size-independent sample complexity of neural networks

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference on Learning Theory (COLT), 2018

  28. [28]

    Stable function approximation in dynamic programming

    Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995. Elsevier, 1995

  29. [29]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018

  30. [30]

    Lasserre

    On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria . Springer, 1996

  31. [31]

    Lasserre

    On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Further Topics on Discrete-Time Markov Control Processes . Springer, 1999

  32. [32]

    Contextual decision processes with low Bellman rank are PAC -learnable

    Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC -learnable. In International Conference on Machine Learning (ICML), 2017

  33. [33]

    Provably efficient reinforcement learning with linear function approximation

    Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), 2020

  34. [34]

    Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms

    Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  35. [35]

    Approximately optimal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002

  36. [36]

    Near-optimal reinforcement learning in polynomial time

    Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002

  37. [37]

    Spectral normalization for generative adversarial networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018

  38. [38]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015

  39. [39]

    Error bounds for approximate policy iteration

    R \'e mi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003

  40. [40]

    Performance bounds in L ^p -norm for approximate value iteration

    R \'e mi Munos. Performance bounds in L ^p -norm for approximate value iteration. SIAM Journal on Control and Optimization, 2007

  41. [41]

    Finite-time bounds for fitted value iteration

    R \'e mi Munos and Csaba Szepesv \'a ri. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008

  42. [42]

    Kernel-based reinforcement learning

    Dirk Ormoneit and \'S aunak Sen. Kernel-based reinforcement learning. Machine Learning, 49 0 (2--3): 0 161--178, 2002

  43. [43]

    (more) efficient reinforcement learning via posterior sampling

    Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems (NeurIPS), 2013

  44. [44]

    Online learning via sequential complexities

    Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 2015

  45. [45]

    Bridging offline reinforcement learning and imitation learning: A tale of pessimism

    Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  46. [46]

    Eluder dimension and the sample complexity of optimistic exploration

    Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems (NeurIPS), 2013

  47. [47]

    a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \

    Manfred Sch \"a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \"u r Wahrscheinlichkeitstheorie und verwandte Gebiete , 1975

  48. [48]

    Approximate modified policy iteration and its application to the game of Tetris

    Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of Tetris . Journal of Machine Learning Research, 2015

  49. [49]

    Devon Hjelm, Aaron Courville, and Philip Bachman

    Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations (ICLR), 2021

  50. [50]

    Mastering the game of Go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016

  51. [51]

    CURL : Contrastive unsupervised representations for reinforcement learning

    Aravind Srinivas, Michael Laskin, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning (ICML), 2020

  52. [52]

    Decoupling representation learning from reinforcement learning

    Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning (ICML), 2021

  53. [53]

    Negative dynamic programming

    Ralph E Strauch. Negative dynamic programming. The Annals of Mathematical Statistics, 1966

  54. [54]

    Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees

    Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, and Pierre M \'e nard. Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees. Advances in Neural Information Processing Systems (NeurIPS), 2022 a

  55. [55]

    From Dirichlet to Rubin : Optimistic exploration in RL without bonuses

    Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, and Pierre M \'e nard. From Dirichlet to Rubin : Optimistic exploration in RL without bonuses. In International Conference on Machine Learning (ICML), 2022 b

  56. [56]

    Analysis of temporal-difference learning with function approximation

    John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems (NeurIPS), 1996

  57. [57]

    Kernelized reinforcement learning with order optimal regret bounds

    Sattar Vakili and Julia Olkhovskaya. Kernelized reinforcement learning with order optimal regret bounds. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  58. [58]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019

  59. [59]

    Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension

    Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  60. [60]

    Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I. Jordan. Provably efficient reinforcement learning with kernel and neural function approximations. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  61. [61]

    Offline reinforcement learning with realizability and single-policy concentrability

    Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory (COLT), 2022