A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
Pith reviewed 2026-05-11 01:56 UTC · model grok-4.3
The pith
Isometric embeddings convert asynchronous categorical TD updates into single-state contracting recursions, delivering finite-iteration guarantees for both discounted and fixed-horizon settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After suitable isometric embeddings, both the scalar categorical temporal-difference learning algorithm in the Cramér geometry and the multivariate signed-categorical temporal-difference learning algorithm in the maximum mean discrepancy geometry take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.
What carries the argument
Isometric embeddings of the categorical and signed-categorical updates that preserve contraction in the statewise supremum norm under the Cramér and maximum mean discrepancy geometries.
If this is right
- Finite-iteration error bounds apply directly to discounted problems when states are sampled i.i.d.
- The same contraction yields finite-iteration bounds when states arrive from a Markovian trajectory in discounted problems.
- For undiscounted fixed-horizon tasks, finite-iteration guarantees hold under i.i.d. episodic sampling.
- The analysis applies uniformly to both scalar Cramér updates and multivariate signed-categorical updates in the MMD geometry.
Where Pith is reading between the lines
- Similar embedding techniques could be tested on other distributional Bellman operators to obtain comparable contraction results.
- The statewise supremum norm contraction supplies a concrete way to allocate iteration budgets in online distributional RL implementations.
- If the embedding approach generalizes, it may reduce the need for synchronous full-state updates in theoretical analyses of distributional methods.
Load-bearing premise
The chosen isometric embeddings must preserve the contraction property of the updates when measured in the statewise supremum norm.
What would settle it
An explicit MDP and sampling schedule where the embedded update fails to contract in the statewise supremum norm, or a numerical check showing that observed error after a predicted number of iterations exceeds the derived bound by more than the allowed slack.
read the original abstract
Recent non-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full-state updates under a generative model, model-based estimators, accelerated variants, or different approximation architectures. Standard categorical temporal-difference learning is typically used in a different regime. It asynchronously performs a single-state update at each iteration and, in online settings, is driven by a Markovian trajectory. This leaves an important gap between existing finite-iteration theory and the categorical recursions most closely aligned with practical distributional temporal-difference implementations. We bridge this gap for two categorical policy-evaluation methods: scalar categorical temporal-difference learning in the Cram\'er geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry. After suitable isometric embeddings, both algorithms take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops finite-iteration theory for two asynchronous categorical distributional TD methods: scalar categorical TD learning under the Cramér geometry and multivariate signed-categorical TD learning under the maximum mean discrepancy geometry. After suitable isometric embeddings, both are recast as single-state asynchronous stochastic-approximation recursions that contract in the statewise supremum norm. This reduction yields explicit finite-iteration bounds for discounted problems under i.i.d. and Markovian state sampling as well as for undiscounted fixed-horizon problems under i.i.d. episodic sampling.
Significance. If the isometric embeddings are shown to preserve the required contraction properties without introducing extraneous factors, the result is significant: it closes the gap between existing non-asymptotic analyses (largely synchronous or generative-model based) and the asynchronous online regime used in practical distributional RL. The reduction to standard SA theory is a clean technical contribution that directly imports finite-time guarantees from the SA literature to the distributional setting.
major comments (2)
- [§3.2] §3.2, the definition of the isometric embedding for the signed-categorical case: the claim that the embedding is isometric with respect to the statewise supremum norm must be accompanied by an explicit verification that the MMD geometry induces a norm equivalent to the one used for the contraction mapping; without this, the reduction to a contracting recursion in the sup norm is not yet load-bearing.
- [§4.1] §4.1, Theorem 1 (finite-iteration bound for discounted Markovian sampling): the contraction factor after embedding appears to depend on the discount factor and the embedding dimension; the proof sketch must confirm that this factor remains strictly less than one uniformly over the state space, otherwise the finite-iteration guarantee does not follow from the cited SA results.
minor comments (2)
- [§1] The abstract and §1 refer to 'statewise supremum norm' without an early equation reference; adding a pointer to the precise norm definition in §2 would improve readability.
- [Table 1] Table 1 (comparison with prior non-asymptotic results) lists several works but omits the precise sampling assumptions used in each; clarifying the 'asynchronous online' column would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3.2] §3.2, the definition of the isometric embedding for the signed-categorical case: the claim that the embedding is isometric with respect to the statewise supremum norm must be accompanied by an explicit verification that the MMD geometry induces a norm equivalent to the one used for the contraction mapping; without this, the reduction to a contracting recursion in the sup norm is not yet load-bearing.
Authors: We thank the referee for this observation. The embedding in §3.2 is constructed from the reproducing kernel of the MMD so that the Euclidean distance in the feature space equals the MMD distance by definition. To make the reduction load-bearing, we will add a short lemma (new Lemma 3.3) immediately after the embedding definition that explicitly verifies the isometry with respect to the statewise supremum norm and confirms that the induced norm on the embedded space is identical to the original MMD geometry (equivalence constants equal to 1). This addition will be placed before the contraction argument and will not alter any subsequent results. revision: yes
-
Referee: [§4.1] §4.1, Theorem 1 (finite-iteration bound for discounted Markovian sampling): the contraction factor after embedding appears to depend on the discount factor and the embedding dimension; the proof sketch must confirm that this factor remains strictly less than one uniformly over the state space, otherwise the finite-iteration guarantee does not follow from the cited SA results.
Authors: We agree that the contraction factor requires explicit confirmation. Because the embedding is isometric, the Bellman operator in the embedded space contracts with modulus exactly equal to the discount factor γ < 1; this modulus is independent of embedding dimension and is uniform over the state space. The proof sketch in §4.1 already invokes standard asynchronous SA bounds that apply under a uniform contraction strictly less than one. We will expand the proof paragraph to state this derivation explicitly, including the observation that the isometry preserves the contraction modulus without introducing dimension-dependent factors, thereby directly justifying the application of the cited SA results. revision: partial
Circularity Check
No significant circularity detected
full rationale
The derivation proceeds by defining isometric embeddings that map the categorical and signed-categorical TD updates onto asynchronous single-state stochastic-approximation recursions, then verifying that these embedded operators contract in the statewise supremum norm under the chosen geometries (Cramér and MMD). Finite-iteration bounds are obtained by invoking standard SA results for the resulting recursions under the listed sampling regimes. This chain rests on explicit verification of contraction preservation and external SA theory rather than any self-definitional reduction, fitted-parameter renaming, or load-bearing self-citation; the central claim therefore remains independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Suitable isometric embeddings exist that map the categorical and signed-categorical TD updates into recursions contracting in the statewise supremum norm.
- standard math The sampling processes (i.i.d., Markovian, episodic) satisfy the conditions required by the underlying stochastic-approximation theorems.
Reference graph
Works this paper leans on
-
[1]
Minimax regret bounds for reinforcement learning
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InInternational conference on machine learning, pages 263–272. PMLR, 2017
work page 2017
-
[2]
Heinz H. Bauschke and Patrick L. Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 2nd edition, 2017. ISBN 3319483102
work page 2017
- [3]
-
[4]
A distributional perspective on rein- forcement learning
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017
work page 2017
-
[5]
Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023
work page 2023
-
[6]
Neuro-dynamic programming, 1996
Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, 1996
work page 1996
-
[7]
A finite time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–
- [8]
-
[9]
Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998
Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998
work page 1998
-
[10]
Markus Böck and Clemens Heitzinger. Speedy categorical distributional reinforcement learning and complexity analysis.SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022
work page 2022
-
[11]
Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020
work page 2020
-
[12]
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
work page 2022
-
[13]
Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024
work page 2024
-
[14]
Implicit quantile networks for distributional reinforcement learning
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018
work page 2018
-
[15]
Distributional reinforce- ment learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[16]
Finite sample analyses for td (0) with function approximation
Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[17]
Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994
Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994
work page 1994
-
[18]
Fixed-horizon temporal difference methods for stable reinforcement learning
Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020. 11
work page 2020
-
[19]
A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012
work page 2012
-
[20]
Leonid Gurvits, Long-Ji Lin, and Stephen José Hanson. Incremental learning of evaluation functions for absorbing markov chains: New methods and theorems.preprint, 1994
work page 1994
-
[21]
Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972
Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972
work page 1972
-
[22]
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993
work page 1993
-
[23]
Markov decision processes with a new optimality criterion: Discrete time
Stratton C Jaquette. Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505, 1973
work page 1973
-
[24]
A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976
Stratton C Jaquette. A utility criterion for markov decision processes.Management Science, 23 (1):43–49, 1976
work page 1976
-
[25]
Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[26]
Harold J Kushner and G George Yin.Stochastic approximation and recursive algorithms and applications. Springer, 2003
work page 2003
-
[27]
American Mathematical Soc., 2017
David A Levin and Yuval Peres.Markov chains and mixing times, volume 107. American Mathematical Soc., 2017
work page 2017
-
[28]
A generalized reinforcement-learning model: Con- vergence and applications
Michael L Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Con- vergence and applications. InICML, volume 96, pages 310–318, 1996
work page 1996
-
[29]
L. Ljung. Analysis of recursive stochastic algorithms.IEEE Transactions on Automatic Control, 22(4):551–575, 1977. doi: 10.1109/TAC.1977.1101561
-
[30]
A comparative analysis of expected and distributional reinforcement learning
Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019
work page 2019
-
[31]
Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société mathématique de France, 93:273–299, 1965
work page 1965
-
[32]
Nonparametric return distribution approximation for reinforcement learning
Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799–806, 2010
work page 2010
-
[33]
Foundations and Trends® in Machine Learning , author =
Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond.Found. Trends Mach. Learn., 10(1–2): 1–141, June 2017. ISSN 1935-8237. doi: 10.1561/2200000060. URL https://doi.org/10. 1561/2200000060
-
[34]
Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023
work page 2023
-
[35]
Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024
work page 2024
-
[36]
Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025. 12
-
[37]
Finite-time analysis of asynchronous stochastic approxima- tion andq-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020
work page 2020
-
[38]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[39]
An analysis of categorical distributional reinforcement learning
Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018
work page 2018
-
[40]
Statistics and samples in distributional reinforcement learning
Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019
work page 2019
-
[41]
Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024
work page 2024
-
[42]
A hilbert space embedding for distributions
Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. InInternational conference on algorithmic learning theory, pages 13–31. Springer, 2007
work page 2007
-
[43]
Matthew J Sobel. The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982
work page 1982
-
[44]
Finite-time error bounds for linear stochastic approximation and td learning
Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019
work page 2019
-
[45]
Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
work page 1988
-
[46]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[47]
John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996
work page 1996
-
[48]
Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994
work page 1994
-
[49]
Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024
work page 2024
-
[50]
Distributional offline policy evaluation with predictive error guarantees
Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023
work page 2023
-
[51]
Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[52]
Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, and Zhihua Zhang. Estimation and inference in distributional reinforcement learning.arXiv preprint arXiv:2309.17262, 2023. 13 Appendix Table of Contents A Common finite-iteration ingredients in the block-supremum geometry 15 A.1 Norm comparison and smoothing . . . . . . . . . . . . . . . . . . . . . . ...
-
[53]
ifα k ≡αandα≤a 2/a3, then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞(1−a 2α)k + a4 a2 α,(67)
-
[54]
ifα k =α/(k+h),α >1/a 2, and h≥max 1, αa3 a2 ,(68) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ h k+h a2α + 4e α2a4 a2α−1 · 1 k+h ,(69)
-
[55]
ifα k =α/(k+h) z withz∈(0,1)and h≥max ( 1, αa3 a2 1/z , 2z a2α 1/(1−z)) ,(70) then for allk≥0, E ∥Wk∥2 2,∞ ≤a 1∥W0∥2 2,∞ exp − a2α 1−z (k+h) 1−z −h 1−z + 2αa4 a2 · 1 (k+h) z . (71) Proof.Define the averaged asynchronous operator Hρ(U) :=U+ X s∈S ρ(s)Ps (OU)(s)−U(s) .(72) For each state blocks, Hρ(U)−H ρ(V) (s) = (1−ρ(s))(U(s)−V(s)) +ρ(s) (OU)(s)−(OV)(s) ....
-
[56]
ifα k ≡α, then for allk≥t α, E ∥Wk∥2 2,∞ ≤ϕ 1c1(1−ϕ 2α)k−tα + ϕ3 ϕ2 c2 αtα;(86)
-
[57]
ifα k =α/(k+h),α >1/ϕ 2, andh≥1, then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 K+h k+h ϕ2α + 8e α2ϕ3c2 ϕ2α−1 · tk k+h ;(87)
-
[58]
ifα k =α/(k+h) z withz∈(0,1), then for allk≥K, E ∥Wk∥2 2,∞ ≤ϕ 1c1 exp − ϕ2α 1−z (k+h) 1−z −(K+h) 1−z + 4ϕ3c2α ϕ2 · tk (k+h) z . (88) Proof. This is the Markovian finite-iteration theorem of Chen et al. [13], translated into the present block-supremum notation and applied to the centered recursion Wk :=U k −U ⋆. The contraction factor is ¯βµ. The smoothnes...
-
[59]
Taking the maximum over s∈ S gives the product-space identity. Recall from Section 3.1 that OC :=I C ◦Π Θ CT π ◦I −1 C .(98) Proposition 18(Embedded CTD contraction).For allU, U ′ ∈I C(F S C,Θ), ∥OCU− O CU ′∥2,∞ ≤ √γ∥U−U ′∥2,∞.(99) Proof. Let η:=I −1 C (U) and η′ :=I −1 C (U ′). By Proposition 17 and the contraction of OC in the supremum Cramér metric [5]...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.