Recognition: unknown
A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration
Pith reviewed 2026-05-09 15:40 UTC · model grok-4.3
The pith
Fitted Q-iteration admits finite-sample performance bounds under adaptive data on general Borel spaces by chaining measure-theoretic probability with Bellman-operator contraction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces.
What carries the argument
Chaining of measure-theoretic probability with Bellman-operator contraction in Banach spaces, with sequential Rademacher complexity bounding generalization under policy-dependent sampling.
If this is right
- Fitted Q-iteration now possesses finite-sample guarantees that apply directly to policy-dependent data collection on general spaces.
- Sequential Rademacher complexity becomes a usable quantity for controlling estimation error inside the Bellman regression step of value iteration.
- A cumulative pathwise online regret bound is available for FQI in continuous spaces for the first time.
- The framework supplies the necessary foundations for formal finite-sample analysis of many modern deep RL algorithms.
Where Pith is reading between the lines
- The same chaining argument could be applied to other value-based methods that alternate regression and policy improvement steps.
- The pathwise regret guarantee might be used to derive high-probability statements about the entire trajectory of an FQI run rather than only its final performance.
- Practical checks could compare the empirical growth of sequential Rademacher complexity against observed generalization error on benchmark continuous control tasks.
Load-bearing premise
The Markov decision process lives on general measurable Borel spaces and adaptive policy-dependent data collection preserves the conditions needed for sequential Rademacher complexity to control generalization error.
What would settle it
An explicit MDP on a Borel space together with an adaptive data-collection schedule in which the observed Bellman-regression generalization error exceeds the sequential Rademacher complexity term while all other hypotheses of the bound hold.
read the original abstract
While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a unified measure-theoretic framework for fitted Q-iteration (FQI) on general Borel spaces. It derives finite-sample performance bounds for adaptive, policy-dependent data collection by chaining measure-theoretic probability with Bellman-operator contractions in Banach spaces, proves that sequential Rademacher complexity controls Bellman-regression generalization error under such sampling, and extends the analysis to the first cumulative pathwise online regret guarantee for FQI in continuous spaces.
Significance. If the central derivations hold, the work would constitute a significant advance by bridging measure-theoretic MDP foundations, deterministic error propagation, and PAC-style generalization bounds. This enables rigorous finite-sample analysis of modern deep RL algorithms in continuous domains. The explicit handling of adaptive policy-dependent data via sequential Rademacher complexity and the pathwise regret extension are clear strengths; the approach avoids reducing bounds to quantities fitted from the target data.
major comments (2)
- [Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.
- [Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.
minor comments (1)
- [Abstract] The abstract could more explicitly list the standing assumptions on the MDP (e.g., Borel measurability, bounded rewards) and on the function class to clarify the scope for readers.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of the manuscript's significance and for the detailed comments on measurability in the adaptive setting. We address each point below and will revise the manuscript to incorporate explicit clarifications and supporting lemmas.
read point-by-point responses
-
Referee: [Main theorem on finite-sample performance bound] The main performance bound (the chaining result) relies on sequential Rademacher complexity yielding uniform deviation bounds for the Bellman regression even though the data-generating distribution at each iteration is induced by the policy from the previous iterate. The manuscript must explicitly establish that the function class and regression operator remain measurable with respect to the filtration generated by the adaptive process; without this, the uniform bound does not apply to the moving-target backups and the subsequent contraction chaining fails to produce a valid guarantee.
Authors: We agree that explicit verification of measurability with respect to the adaptive filtration is essential for rigor in general Borel spaces. The manuscript invokes sequential Rademacher complexity under policy-dependent sampling and assumes the requisite measurability to apply the uniform deviation bounds, but we acknowledge that this step is not stated as a standalone lemma. In the revision we will insert a new lemma (placed before the main chaining theorem) that establishes: (i) the function class remains measurable with respect to the filtration generated by the sequence of policy-dependent data-generating distributions, and (ii) the regression operator preserves this measurability. The proof of the lemma relies on standard results for Borel spaces and the fact that each policy iterate is a measurable function of the preceding data. With this addition the uniform bounds apply directly to the moving-target backups, and the subsequent contraction chaining remains valid. This change strengthens the presentation without altering the stated results. revision: yes
-
Referee: [Extension to cumulative pathwise regret] The pathwise online regret extension builds directly on the per-iteration performance bounds. Any gap in the adaptive generalization control (measurability or uniform deviation under policy-dependent sampling) would propagate to the cumulative regret statement, so the regret claim cannot be established until the core generalization step is verified for general Borel spaces.
Authors: We concur that the cumulative pathwise regret guarantee is a direct consequence of the per-iteration bounds. Once the measurability lemma described above is added, the regret extension follows immediately from the same chaining argument used for the finite-sample performance bound. In the revised manuscript we will add a short clarifying paragraph in the regret section that explicitly references the new lemma and notes that the pathwise nature of the bound is preserved under the adaptive filtration. No further technical changes are required. revision: yes
Circularity Check
No circularity: derivation chains standard tools without reduction to inputs
full rationale
The paper derives finite-sample bounds for FQI by chaining measure-theoretic probability, Banach-space contractions of the Bellman operator, and sequential Rademacher complexity applied to policy-dependent data on Borel spaces. No equation or theorem reduces a claimed prediction or performance bound to a quantity fitted from the target data by construction, nor does any load-bearing step invoke a self-citation chain, uniqueness theorem from the authors' prior work, or smuggled ansatz. The central claim that sequential Rademacher complexity controls Bellman-regression generalization is presented as a proved result under the stated measurability assumptions, not as a renaming or self-definition. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MDPs are defined on general measurable Borel spaces
- standard math Bellman operator is a contraction in suitable Banach spaces
Reference graph
Works this paper leans on
-
[1]
Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds
Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. Advances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[2]
Near-optimal regret bounds for reinforcement learning
Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2008
2008
-
[3]
Minimax regret bounds for reinforcement learning
Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (ICML), 2017
2017
-
[4]
Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning
Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, and Noam Brown. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. In International Conference on Learning Representations (ICLR), 2023
2023
-
[5]
Bartlett, Dylan J
Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[6]
Bertsekas
Dimitri P. Bertsekas. A Course in Reinforcement Learning. Athena Scientific, Belmont, Massachusetts, 2nd edition, 2025
2025
-
[7]
Bertsekas and Steven E
Dimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Academic Press, 1978
1978
-
[8]
Bertsekas and John N
Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996
1996
-
[9]
Gomes, and Kilian Q
Johan Bjorck, Carla P. Gomes, and Kilian Q. Weinberger. Towards deeper deep reinforcement learning with spectral normalization. In Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[10]
Discounted dynamic programming
David Blackwell. Discounted dynamic programming. The Annals of Mathematical Statistics, 1965
1965
-
[11]
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
Ronen I Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2002
2002
-
[12]
Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double Q-Learning : Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021
2021
-
[13]
Magnetic control of tokamak plasmas through deep reinforcement learning
Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022
2022
-
[14]
Kernel-based reinforcement learning: A finite-time analysis
Omar Darwiche Domingues, Pierre M \'e nard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In International Conference on Machine Learning (ICML), 2021
2021
-
[15]
Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature
Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. In Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[16]
Bilinear classes: A structural framework for provable generalization in RL
Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL . In International Conference on Machine Learning (ICML), 2021
2021
-
[17]
Risk bounds and Rademacher complexity in batch reinforcement learning
Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and Rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning (ICML), 2021
2021
-
[18]
Reinforcement learning with Gaussian processes
Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning (ICML), 2005
2005
-
[19]
Human-level play in the game of diplomacy by combining language models with strategic reasoning
FAIR , Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 2022
2022
-
[20]
Error propagation for approximate policy and value iteration
Amir-Massoud Farahmand, Csaba Szepesv \'a ri, and R \'e mi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NeurIPS), 2010
2010
-
[21]
Average cost Markov decision processes with weakly continuous transition probabilities
Eugene A Feinberg, Pavlo O Kasyanov, and Nina V Zadoianchuk. Average cost Markov decision processes with weakly continuous transition probabilities. Mathematics of Operations Research, 2012
2012
-
[22]
arXiv preprint arXiv:2112.13487 , year=
Dylan J. Foster, Sham M. Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021
-
[23]
Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu
Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory (COLT), 2022
2022
-
[24]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML). PMLR, 2018
2018
-
[25]
A theory of regularized Markov decision processes
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In International Conference on Machine Learning (ICML), 2019
2019
-
[26]
Spectral normalisation for deep reinforcement learning: An optimisation perspective
Florin Gogianu, Tudor Berariu, Mihaela Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In International Conference on Machine Learning (ICML), 2021
2021
-
[27]
Size-independent sample complexity of neural networks
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference on Learning Theory (COLT), 2018
2018
-
[28]
Stable function approximation in dynamic programming
Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995. Elsevier, 1995
1995
-
[29]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018
2018
-
[30]
Lasserre
On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria . Springer, 1996
1996
-
[31]
Lasserre
On \'e simo Hern \'a ndez-Lerma and Jean B. Lasserre. Further Topics on Discrete-Time Markov Control Processes . Springer, 1999
1999
-
[32]
Contextual decision processes with low Bellman rank are PAC -learnable
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC -learnable. In International Conference on Machine Learning (ICML), 2017
2017
-
[33]
Provably efficient reinforcement learning with linear function approximation
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), 2020
2020
-
[34]
Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms
Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[35]
Approximately optimal approximate reinforcement learning
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002
2002
-
[36]
Near-optimal reinforcement learning in polynomial time
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002
2002
-
[37]
Spectral normalization for generative adversarial networks
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018
2018
-
[38]
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015
2015
-
[39]
Error bounds for approximate policy iteration
R \'e mi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003
2003
-
[40]
Performance bounds in L ^p -norm for approximate value iteration
R \'e mi Munos. Performance bounds in L ^p -norm for approximate value iteration. SIAM Journal on Control and Optimization, 2007
2007
-
[41]
Finite-time bounds for fitted value iteration
R \'e mi Munos and Csaba Szepesv \'a ri. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008
2008
-
[42]
Kernel-based reinforcement learning
Dirk Ormoneit and \'S aunak Sen. Kernel-based reinforcement learning. Machine Learning, 49 0 (2--3): 0 161--178, 2002
2002
-
[43]
(more) efficient reinforcement learning via posterior sampling
Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems (NeurIPS), 2013
2013
-
[44]
Online learning via sequential complexities
Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 2015
2015
-
[45]
Bridging offline reinforcement learning and imitation learning: A tale of pessimism
Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[46]
Eluder dimension and the sample complexity of optimistic exploration
Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems (NeurIPS), 2013
2013
-
[47]
a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \
Manfred Sch \"a l. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f \"u r Wahrscheinlichkeitstheorie und verwandte Gebiete , 1975
1975
-
[48]
Approximate modified policy iteration and its application to the game of Tetris
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of Tetris . Journal of Machine Learning Research, 2015
2015
-
[49]
Devon Hjelm, Aaron Courville, and Philip Bachman
Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations (ICLR), 2021
2021
-
[50]
Mastering the game of Go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016
2016
-
[51]
CURL : Contrastive unsupervised representations for reinforcement learning
Aravind Srinivas, Michael Laskin, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning (ICML), 2020
2020
-
[52]
Decoupling representation learning from reinforcement learning
Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning (ICML), 2021
2021
-
[53]
Negative dynamic programming
Ralph E Strauch. Negative dynamic programming. The Annals of Mathematical Statistics, 1966
1966
-
[54]
Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees
Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, and Pierre M \'e nard. Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees. Advances in Neural Information Processing Systems (NeurIPS), 2022 a
2022
-
[55]
From Dirichlet to Rubin : Optimistic exploration in RL without bonuses
Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, and Pierre M \'e nard. From Dirichlet to Rubin : Optimistic exploration in RL without bonuses. In International Conference on Machine Learning (ICML), 2022 b
2022
-
[56]
Analysis of temporal-difference learning with function approximation
John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation. Advances in Neural Information Processing Systems (NeurIPS), 1996
1996
-
[57]
Kernelized reinforcement learning with order optimal regret bounds
Sattar Vakili and Julia Olkhovskaya. Kernelized reinforcement learning with order optimal regret bounds. In Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[58]
Grandmaster level in StarCraft II using multi-agent reinforcement learning
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019
2019
-
[59]
Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension
Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. In Advances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[60]
Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I. Jordan. Provably efficient reinforcement learning with kernel and neural function approximations. In Advances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[61]
Offline reinforcement learning with realizability and single-policy concentrability
Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory (COLT), 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.