A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning

Abolfazl Hashemi; Ege C. Kaya

arxiv: 2605.06866 · v1 · submitted 2026-05-07 · 💻 cs.LG · math.OC

A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning

Ege C. Kaya , Abolfazl Hashemi This is my paper

Pith reviewed 2026-05-11 01:56 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords distributional reinforcement learningcategorical temporal differencefinite iteration boundsasynchronous stochastic approximationisometric embeddingCramér geometrymaximum mean discrepancypolicy evaluation

0 comments

The pith

Isometric embeddings convert asynchronous categorical TD updates into single-state contracting recursions, delivering finite-iteration guarantees for both discounted and fixed-horizon settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scalar categorical temporal-difference learning in the Cramér geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry can be rewritten, after isometric embeddings, as asynchronous stochastic-approximation recursions. These recursions contract in the statewise supremum norm, which directly supplies non-asymptotic error bounds even when only one state is updated per step. The analysis covers i.i.d. sampling and Markovian trajectories for discounted problems as well as i.i.d. episodic sampling for undiscounted fixed-horizon problems. This closes the previous gap between existing synchronous theory and the online, single-state updates that are used in practice.

Core claim

After suitable isometric embeddings, both the scalar categorical temporal-difference learning algorithm in the Cramér geometry and the multivariate signed-categorical temporal-difference learning algorithm in the maximum mean discrepancy geometry take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.

What carries the argument

Isometric embeddings of the categorical and signed-categorical updates that preserve contraction in the statewise supremum norm under the Cramér and maximum mean discrepancy geometries.

If this is right

Finite-iteration error bounds apply directly to discounted problems when states are sampled i.i.d.
The same contraction yields finite-iteration bounds when states arrive from a Markovian trajectory in discounted problems.
For undiscounted fixed-horizon tasks, finite-iteration guarantees hold under i.i.d. episodic sampling.
The analysis applies uniformly to both scalar Cramér updates and multivariate signed-categorical updates in the MMD geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar embedding techniques could be tested on other distributional Bellman operators to obtain comparable contraction results.
The statewise supremum norm contraction supplies a concrete way to allocate iteration budgets in online distributional RL implementations.
If the embedding approach generalizes, it may reduce the need for synchronous full-state updates in theoretical analyses of distributional methods.

Load-bearing premise

The chosen isometric embeddings must preserve the contraction property of the updates when measured in the statewise supremum norm.

What would settle it

An explicit MDP and sampling schedule where the embedded update fails to contract in the statewise supremum norm, or a numerical check showing that observed error after a predicted number of iterations exceeds the derived bound by more than the allowed slack.

read the original abstract

Recent non-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full-state updates under a generative model, model-based estimators, accelerated variants, or different approximation architectures. Standard categorical temporal-difference learning is typically used in a different regime. It asynchronously performs a single-state update at each iteration and, in online settings, is driven by a Markovian trajectory. This leaves an important gap between existing finite-iteration theory and the categorical recursions most closely aligned with practical distributional temporal-difference implementations. We bridge this gap for two categorical policy-evaluation methods: scalar categorical temporal-difference learning in the Cram\'er geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry. After suitable isometric embeddings, both algorithms take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper closes the gap by giving finite-iteration bounds for the exact asynchronous single-state categorical TD updates that get used in practice.

read the letter

The core contribution is a reduction that turns the standard async categorical TD and signed-categorical TD recursions into single-state stochastic-approximation updates that contract in the statewise sup norm. They do this for the Cramér geometry and the MMD geometry by applying suitable isometric embeddings, then invoke existing finite-time SA results. This covers discounted problems with i.i.d. or Markovian sampling and undiscounted fixed-horizon problems with episodic sampling. That directly addresses the mismatch the abstract flags between prior synchronous or generative-model analyses and the regime most implementations actually run in. The argument structure is standard and clean once the embeddings are granted, and the paper states the gap explicitly rather than burying it. Credit for that identification and for matching the practical sampling models. The main soft spot is the embedding step itself. The claim that the embedded operators remain contractions in the required norm has to be verified in detail; if the isometry preserves the contraction constant without extra factors or hidden assumptions on the support or the projection, the bounds follow, but that verification is load-bearing and not visible from the abstract. Minor additional point: the resulting sample-complexity expressions will likely be loose, as is typical for these SA reductions, but that does not invalidate the existence result. This is for readers who care about non-asymptotic theory in distributional RL and want guarantees that line up with deployed single-state TD rather than idealized synchronous versions. A specialist in RL theory will get a concrete, usable reduction. It is worth sending to peer review because the gap is real, the reduction is a natural move, and the technical claims are checkable once the full derivations are in front of a referee.

Referee Report

2 major / 2 minor

Summary. The paper develops finite-iteration theory for two asynchronous categorical distributional TD methods: scalar categorical TD learning under the Cramér geometry and multivariate signed-categorical TD learning under the maximum mean discrepancy geometry. After suitable isometric embeddings, both are recast as single-state asynchronous stochastic-approximation recursions that contract in the statewise supremum norm. This reduction yields explicit finite-iteration bounds for discounted problems under i.i.d. and Markovian state sampling as well as for undiscounted fixed-horizon problems under i.i.d. episodic sampling.

Significance. If the isometric embeddings are shown to preserve the required contraction properties without introducing extraneous factors, the result is significant: it closes the gap between existing non-asymptotic analyses (largely synchronous or generative-model based) and the asynchronous online regime used in practical distributional RL. The reduction to standard SA theory is a clean technical contribution that directly imports finite-time guarantees from the SA literature to the distributional setting.

major comments (2)

[§3.2] §3.2, the definition of the isometric embedding for the signed-categorical case: the claim that the embedding is isometric with respect to the statewise supremum norm must be accompanied by an explicit verification that the MMD geometry induces a norm equivalent to the one used for the contraction mapping; without this, the reduction to a contracting recursion in the sup norm is not yet load-bearing.
[§4.1] §4.1, Theorem 1 (finite-iteration bound for discounted Markovian sampling): the contraction factor after embedding appears to depend on the discount factor and the embedding dimension; the proof sketch must confirm that this factor remains strictly less than one uniformly over the state space, otherwise the finite-iteration guarantee does not follow from the cited SA results.

minor comments (2)

[§1] The abstract and §1 refer to 'statewise supremum norm' without an early equation reference; adding a pointer to the precise norm definition in §2 would improve readability.
[Table 1] Table 1 (comparison with prior non-asymptotic results) lists several works but omits the precise sampling assumptions used in each; clarifying the 'asynchronous online' column would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3.2] §3.2, the definition of the isometric embedding for the signed-categorical case: the claim that the embedding is isometric with respect to the statewise supremum norm must be accompanied by an explicit verification that the MMD geometry induces a norm equivalent to the one used for the contraction mapping; without this, the reduction to a contracting recursion in the sup norm is not yet load-bearing.

Authors: We thank the referee for this observation. The embedding in §3.2 is constructed from the reproducing kernel of the MMD so that the Euclidean distance in the feature space equals the MMD distance by definition. To make the reduction load-bearing, we will add a short lemma (new Lemma 3.3) immediately after the embedding definition that explicitly verifies the isometry with respect to the statewise supremum norm and confirms that the induced norm on the embedded space is identical to the original MMD geometry (equivalence constants equal to 1). This addition will be placed before the contraction argument and will not alter any subsequent results. revision: yes
Referee: [§4.1] §4.1, Theorem 1 (finite-iteration bound for discounted Markovian sampling): the contraction factor after embedding appears to depend on the discount factor and the embedding dimension; the proof sketch must confirm that this factor remains strictly less than one uniformly over the state space, otherwise the finite-iteration guarantee does not follow from the cited SA results.

Authors: We agree that the contraction factor requires explicit confirmation. Because the embedding is isometric, the Bellman operator in the embedded space contracts with modulus exactly equal to the discount factor γ < 1; this modulus is independent of embedding dimension and is uniform over the state space. The proof sketch in §4.1 already invokes standard asynchronous SA bounds that apply under a uniform contraction strictly less than one. We will expand the proof paragraph to state this derivation explicitly, including the observation that the isometry preserves the contraction modulus without introducing dimension-dependent factors, thereby directly justifying the application of the cited SA results. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation proceeds by defining isometric embeddings that map the categorical and signed-categorical TD updates onto asynchronous single-state stochastic-approximation recursions, then verifying that these embedded operators contract in the statewise supremum norm under the chosen geometries (Cramér and MMD). Finite-iteration bounds are obtained by invoking standard SA results for the resulting recursions under the listed sampling regimes. This chain rests on explicit verification of contraction preservation and external SA theory rather than any self-definitional reduction, fitted-parameter renaming, or load-bearing self-citation; the central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of isometric embeddings that convert categorical TD updates into statewise-supremum-norm contractions; this is a domain assumption whose validity is asserted but not inspectable from the abstract alone.

axioms (2)

domain assumption Suitable isometric embeddings exist that map the categorical and signed-categorical TD updates into recursions contracting in the statewise supremum norm.
Invoked in the abstract as the step that permits standard stochastic-approximation finite-iteration arguments.
standard math The sampling processes (i.i.d., Markovian, episodic) satisfy the conditions required by the underlying stochastic-approximation theorems.
Standard assumption for applying finite-time bounds to asynchronous recursions.

pith-pipeline@v0.9.0 · 5483 in / 1393 out tokens · 34689 ms · 2026-05-11T01:56:49.457955+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

[1]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InInternational conference on machine learning, pages 263–272. PMLR, 2017

work page 2017
[2]

Bauschke and Patrick L

Heinz H. Bauschke and Patrick L. Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 2nd edition, 2017. ISBN 3319483102

work page 2017
[3]

SIAM, 2017

Amir Beck.First-order methods in optimization. SIAM, 2017

work page 2017
[4]

A distributional perspective on rein- forcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

work page 2017
[5]

MIT Press, 2023

Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023

work page 2023
[6]

Neuro-dynamic programming, 1996

Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, 1996

work page 1996
[7]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–

work page
[8]

Springer

Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer

work page
[9]

Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

work page 1998
[10]

Speedy categorical distributional reinforcement learning and complexity analysis.SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022

Markus Böck and Clemens Heitzinger. Speedy categorical distributional reinforcement learning and complexity analysis.SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022

work page 2022
[11]

Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

work page 2020
[12]

Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022
[13]

A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

work page 2024
[14]

Implicit quantile networks for distributional reinforcement learning

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018

work page 2018
[15]

Distributional reinforce- ment learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[16]

Finite sample analyses for td (0) with function approximation

Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[17]

Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

work page 1994
[18]

Fixed-horizon temporal difference methods for stable reinforcement learning

Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020. 11

work page 2020
[19]

A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

work page 2012
[20]

Incremental learning of evaluation functions for absorbing markov chains: New methods and theorems.preprint, 1994

Leonid Gurvits, Long-Ji Lin, and Stephen José Hanson. Incremental learning of evaluation functions for absorbing markov chains: New methods and theorems.preprint, 1994

work page 1994
[21]

Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

work page 1972
[22]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993
[23]

Markov decision processes with a new optimality criterion: Discrete time

Stratton C Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505, 1973

work page 1973
[24]

A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976

Stratton C Jaquette. A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976

work page 1976
[25]

Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics

Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025

work page 2025
[26]

Springer, 2003

Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003

work page 2003
[27]

American Mathematical Soc., 2017

David A Levin and Yuval Peres.Markov chains and mixing times, volume 107. American Mathematical Soc., 2017

work page 2017
[28]

A generalized reinforcement-learning model: Con- vergence and applications

Michael L Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Con- vergence and applications. InICML, volume 96, pages 310–318, 1996

work page 1996
[29]

L. Ljung. Analysis of recursive stochastic algorithms.IEEE Transactions on Automatic Control, 22(4):551–575, 1977. doi: 10.1109/TAC.1977.1101561

work page doi:10.1109/tac.1977.1101561 1977
[30]

A comparative analysis of expected and distributional reinforcement learning

Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019

work page 2019
[31]

Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

work page 1965
[32]

Nonparametric return distribution approximation for reinforcement learning

Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799–806, 2010

work page 2010
[33]

Foundations and Trends® in Machine Learning , author =

Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond.Found. Trends Mach. Learn., 10(1–2): 1–141, June 2017. ISSN 1935-8237. doi: 10.1561/2200000060. URL https://doi.org/10. 1561/2200000060

work page doi:10.1561/2200000060 2017
[34]

Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

work page 2023
[35]

Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

work page 2024
[36]

A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025. 12

work page arXiv 2025
[37]

Finite-time analysis of asynchronous stochastic approxima- tion andq-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020

work page 2020
[38]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[39]

An analysis of categorical distributional reinforcement learning

Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018

work page 2018
[40]

Statistics and samples in distributional reinforcement learning

Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019

work page 2019
[41]

Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

work page 2024
[42]

A hilbert space embedding for distributions

Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. InInternational conference on algorithmic learning theory, pages 13–31. Springer, 2007

work page 2007
[43]

The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982

Matthew J Sobel. The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982

work page 1982
[44]

Finite-time error bounds for linear stochastic approximation and td learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019

work page 2019
[45]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988
[46]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[47]

Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

work page 1996
[48]

Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

work page 1994
[49]

Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

work page 2024
[50]

Distributional offline policy evaluation with predictive error guarantees

Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023

work page 2023
[51]

Fully parameterized quantile function for distributional reinforcement learning.Advances in neural information processing systems, 32, 2019

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning.Advances in neural information processing systems, 32, 2019

work page 2019
[52]

Estimation and inference in distributional reinforcement learning.arXiv preprint arXiv:2309.17262, 2023

Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, and Zhihua Zhang. Estimation and inference in distributional reinforcement learning.arXiv preprint arXiv:2309.17262, 2023. 13 Appendix Table of Contents A Common finite-iteration ingredients in the block-supremum geometry 15 A.1 Norm comparison and smoothing . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2023
[53]

ifα k ≡αandα≤a 2/a3, then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞(1−a 2α)k + a4 a2 α,(67)

work page
[54]

ifα k =α/(k+h),α >1/a 2, and h≥max 1, αa3 a2 ,(68) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ h k+h a2α + 4e α2a4 a2α−1 · 1 k+h ,(69)

work page
[55]

ifα k =α/(k+h) z withz∈(0,1)and h≥max ( 1, αa3 a2 1/z , 2z a2α 1/(1−z)) ,(70) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ exp − a2α 1−z (k+h) 1−z −h 1−z + 2αa4 a2 · 1 (k+h) z . (71) Proof.Define the averaged asynchronous operator Hρ(U) :=U+ X s∈S ρ(s)Ps (OU)(s)−U(s) .(72) For each state blocks, Hρ(U)−H ρ(V) (s) = (1−ρ(s))(U(s)−V(s)) +ρ(s) (OU)(s)−(OV)(s) ....

work page
[56]

ifα k ≡α, then for allk≥t α, E ∥Wk∥2 2,∞ ≤ϕ 1c1(1−ϕ 2α)k−tα + ϕ3 ϕ2 c2 αtα;(86)

work page
[57]

ifα k =α/(k+h),α >1/ϕ 2, andh≥1, then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 K+h k+h ϕ2α + 8e α2ϕ3c2 ϕ2α−1 · tk k+h ;(87)

work page
[58]

(88) Proof

ifα k =α/(k+h) z withz∈(0,1), then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 exp − ϕ2α 1−z (k+h) 1−z −(K+h) 1−z + 4ϕ3c2α ϕ2 · tk (k+h) z . (88) Proof. This is the Markovian finite-iteration theorem of Chen et al. [13], translated into the present block-supremum notation and applied to the centered recursion Wk :=U k −U ⋆. The contraction factor is ¯βµ. The smoothnes...

work page
[59]

Recall from Section 3.1 that OC :=I C ◦Π Θ CT π ◦I −1 C .(98) Proposition 18(Embedded CTD contraction).For allU, U ′ ∈I C(F S C,Θ), ∥OCU− O CU ′∥2,∞ ≤ √γ∥U−U ′∥2,∞.(99) Proof

Taking the maximum over s∈ S gives the product-space identity. Recall from Section 3.1 that OC :=I C ◦Π Θ CT π ◦I −1 C .(98) Proposition 18(Embedded CTD contraction).For allU, U ′ ∈I C(F S C,Θ), ∥OCU− O CU ′∥2,∞ ≤ √γ∥U−U ′∥2,∞.(99) Proof. Let η:=I −1 C (U) and η′ :=I −1 C (U ′). By Proposition 17 and the contraction of OC in the supremum Cramér metric [5]...

work page

[1] [1]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InInternational conference on machine learning, pages 263–272. PMLR, 2017

work page 2017

[2] [2]

Bauschke and Patrick L

Heinz H. Bauschke and Patrick L. Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 2nd edition, 2017. ISBN 3319483102

work page 2017

[3] [3]

SIAM, 2017

Amir Beck.First-order methods in optimization. SIAM, 2017

work page 2017

[4] [4]

A distributional perspective on rein- forcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

work page 2017

[5] [5]

MIT Press, 2023

Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023

work page 2023

[6] [6]

Neuro-dynamic programming, 1996

Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, 1996

work page 1996

[7] [7]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–

work page

[8] [8]

Springer

Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer

work page

[9] [9]

Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

work page 1998

[10] [10]

Speedy categorical distributional reinforcement learning and complexity analysis.SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022

Markus Böck and Clemens Heitzinger. Speedy categorical distributional reinforcement learning and complexity analysis.SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022

work page 2022

[11] [11]

Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

work page 2020

[12] [12]

Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022

[13] [13]

A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

work page 2024

[14] [14]

Implicit quantile networks for distributional reinforcement learning

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018

work page 2018

[15] [15]

Distributional reinforce- ment learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[16] [16]

Finite sample analyses for td (0) with function approximation

Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[17] [17]

Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

work page 1994

[18] [18]

Fixed-horizon temporal difference methods for stable reinforcement learning

Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020. 11

work page 2020

[19] [19]

A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

work page 2012

[20] [20]

Incremental learning of evaluation functions for absorbing markov chains: New methods and theorems.preprint, 1994

Leonid Gurvits, Long-Ji Lin, and Stephen José Hanson. Incremental learning of evaluation functions for absorbing markov chains: New methods and theorems.preprint, 1994

work page 1994

[21] [21]

Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

work page 1972

[22] [22]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

work page 1993

[23] [23]

Markov decision processes with a new optimality criterion: Discrete time

Stratton C Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505, 1973

work page 1973

[24] [24]

A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976

Stratton C Jaquette. A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976

work page 1976

[25] [25]

Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics

Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025

work page 2025

[26] [26]

Springer, 2003

Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003

work page 2003

[27] [27]

American Mathematical Soc., 2017

David A Levin and Yuval Peres.Markov chains and mixing times, volume 107. American Mathematical Soc., 2017

work page 2017

[28] [28]

A generalized reinforcement-learning model: Con- vergence and applications

Michael L Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Con- vergence and applications. InICML, volume 96, pages 310–318, 1996

work page 1996

[29] [29]

L. Ljung. Analysis of recursive stochastic algorithms.IEEE Transactions on Automatic Control, 22(4):551–575, 1977. doi: 10.1109/TAC.1977.1101561

work page doi:10.1109/tac.1977.1101561 1977

[30] [30]

A comparative analysis of expected and distributional reinforcement learning

Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019

work page 2019

[31] [31]

Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965

work page 1965

[32] [32]

Nonparametric return distribution approximation for reinforcement learning

Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799–806, 2010

work page 2010

[33] [33]

Foundations and Trends® in Machine Learning , author =

Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond.Found. Trends Mach. Learn., 10(1–2): 1–141, June 2017. ISSN 1935-8237. doi: 10.1561/2200000060. URL https://doi.org/10. 1561/2200000060

work page doi:10.1561/2200000060 2017

[34] [34]

Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

work page 2023

[35] [35]

Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

work page 2024

[36] [36]

A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025. 12

work page arXiv 2025

[37] [37]

Finite-time analysis of asynchronous stochastic approxima- tion andq-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020

work page 2020

[38] [38]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951

[39] [39]

An analysis of categorical distributional reinforcement learning

Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018

work page 2018

[40] [40]

Statistics and samples in distributional reinforcement learning

Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019

work page 2019

[41] [41]

Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

work page 2024

[42] [42]

A hilbert space embedding for distributions

Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. InInternational conference on algorithmic learning theory, pages 13–31. Springer, 2007

work page 2007

[43] [43]

The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982

Matthew J Sobel. The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982

work page 1982

[44] [44]

Finite-time error bounds for linear stochastic approximation and td learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019

work page 2019

[45] [45]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988

[46] [46]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[47] [47]

Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

work page 1996

[48] [48]

Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

work page 1994

[49] [49]

Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

work page 2024

[50] [50]

Distributional offline policy evaluation with predictive error guarantees

Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023

work page 2023

[51] [51]

Fully parameterized quantile function for distributional reinforcement learning.Advances in neural information processing systems, 32, 2019

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning.Advances in neural information processing systems, 32, 2019

work page 2019

[52] [52]

Estimation and inference in distributional reinforcement learning.arXiv preprint arXiv:2309.17262, 2023

Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, and Zhihua Zhang. Estimation and inference in distributional reinforcement learning.arXiv preprint arXiv:2309.17262, 2023. 13 Appendix Table of Contents A Common finite-iteration ingredients in the block-supremum geometry 15 A.1 Norm comparison and smoothing . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2023

[53] [53]

ifα k ≡αandα≤a 2/a3, then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞(1−a 2α)k + a4 a2 α,(67)

work page

[54] [54]

ifα k =α/(k+h),α >1/a 2, and h≥max 1, αa3 a2 ,(68) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ h k+h a2α + 4e α2a4 a2α−1 · 1 k+h ,(69)

work page

[55] [55]

ifα k =α/(k+h) z withz∈(0,1)and h≥max ( 1, αa3 a2 1/z , 2z a2α 1/(1−z)) ,(70) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ exp − a2α 1−z (k+h) 1−z −h 1−z + 2αa4 a2 · 1 (k+h) z . (71) Proof.Define the averaged asynchronous operator Hρ(U) :=U+ X s∈S ρ(s)Ps (OU)(s)−U(s) .(72) For each state blocks, Hρ(U)−H ρ(V) (s) = (1−ρ(s))(U(s)−V(s)) +ρ(s) (OU)(s)−(OV)(s) ....

work page

[56] [56]

ifα k ≡α, then for allk≥t α, E ∥Wk∥2 2,∞ ≤ϕ 1c1(1−ϕ 2α)k−tα + ϕ3 ϕ2 c2 αtα;(86)

work page

[57] [57]

ifα k =α/(k+h),α >1/ϕ 2, andh≥1, then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 K+h k+h ϕ2α + 8e α2ϕ3c2 ϕ2α−1 · tk k+h ;(87)

work page

[58] [58]

(88) Proof

ifα k =α/(k+h) z withz∈(0,1), then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 exp − ϕ2α 1−z (k+h) 1−z −(K+h) 1−z + 4ϕ3c2α ϕ2 · tk (k+h) z . (88) Proof. This is the Markovian finite-iteration theorem of Chen et al. [13], translated into the present block-supremum notation and applied to the centered recursion Wk :=U k −U ⋆. The contraction factor is ¯βµ. The smoothnes...

work page

[59] [59]

Recall from Section 3.1 that OC :=I C ◦Π Θ CT π ◦I −1 C .(98) Proposition 18(Embedded CTD contraction).For allU, U ′ ∈I C(F S C,Θ), ∥OCU− O CU ′∥2,∞ ≤ √γ∥U−U ′∥2,∞.(99) Proof

Taking the maximum over s∈ S gives the product-space identity. Recall from Section 3.1 that OC :=I C ◦Π Θ CT π ◦I −1 C .(98) Proposition 18(Embedded CTD contraction).For allU, U ′ ∈I C(F S C,Θ), ∥OCU− O CU ′∥2,∞ ≤ √γ∥U−U ′∥2,∞.(99) Proof. Let η:=I −1 C (U) and η′ :=I −1 C (U ′). By Proposition 17 and the contraction of OC in the supremum Cramér metric [5]...

work page