pith. machine review for the scientific record. sign in

arxiv: 2605.11289 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords average-reward reinforcement learningdistributional reinforcement learningquotient spacecategorical representationstemporal difference learningBellman operatornon-expansive operatorsconvergence analysis
0
0 comments X

The pith

Quotient formulation resolves ill-posedness in average-reward distributional RL by using translation-invariant bias laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a quotient-space approach to make distributional methods work with average-reward reinforcement learning. In average-reward settings, the bias value function is only defined up to an additive constant, which breaks standard distributional formulations on the real line. By identifying distributions up to a common translation in a quotient space and using a compatible categorical parameterization, the authors define a projected distributional operator that is non-expansive and possesses fixed points. This setup allows them to prove convergence properties for sampled updates, including almost-sure convergence of a one-state temporal-difference method under idealized conditions. A reader should care because it provides a theoretically grounded way to handle uncertainty in long-running tasks where average reward is the natural objective.

Core claim

We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When

What carries the argument

The quotient-categorical space, which identifies state-indexed bias laws up to a common translation, together with the projected average-reward distributional operator defined on it that is non-expansive in the coordinate Cramér metric.

If this is right

  • The operator is well-defined, non-expansive, and admits fixed points, supporting iterative solution methods in the quotient space.
  • Sampled one-state TD updates converge almost surely with finite-iteration residual bounds under both i.i.d. and Markovian sampling in the centered-reward setting.
  • Augmenting the recursion with an online gain estimator produces a coupled scheme that remains non-expansive and converges under Markovian sampling.
  • Synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast with practical fixed-grid categorical representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The symmetry-preserving parameterization could support extensions to multi-state asynchronous updates while retaining the same metric properties.
  • Practical algorithms built on this space might improve uncertainty-aware policies in continuing tasks such as resource allocation where gain is the primary objective.
  • Adaptive discretization schemes that explicitly preserve the translation quotient could close the gap between the ideal fixed-point analysis and deployed categorical representations.

Load-bearing premise

The idealized centered-reward setting required for almost-sure convergence of the one-state TD update, together with the assumption that the chosen categorical parameterization respects the quotient symmetry without introducing additional bias.

What would settle it

A demonstration that the projected operator fails to be non-expansive in the coordinate Cramér metric, or that the one-state TD update diverges under centered rewards, would falsify the central results.

Figures

Figures reproduced from arXiv: 2605.11289 by Abolfazl Hashemi, Aliasghar Pourghani, Ege C. Kaya, Vijay Gupta.

Figure 1
Figure 1. Figure 1: The left panel illustrates Lemma 4: translating both the law and support preserves the coefficient vector p. The right panel illustrates that projecting a translated law back onto the same support generally gives different coefficients q ̸= p. The application of Lemma 4 shows that projection onto Θc ′ or onto Θc produces categorical families with the same coefficient vector q. Therefore, Gg is a well-defin… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Centered exact and stochastic recursions validate the non-expansiveness and convergence predictions of Theorems 7, 13, and 14. (b) Both the product residual and gain error decrease, validating Theorem 19. (c) The wrong-centering ablation shows that using g = 0 as the centering targets G0 rather than the correctly centered operator G. The rate guide curves plot the theorem rates without their exact mult… view at source ↗
Figure 3
Figure 3. Figure 3: Categorical bias laws learned in the five-state MRP. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Continuous-state Pendulum-v1 evaluation. (a) We evaluate the MC-trained critic, the online estimated gain critic, and the naive raw-reward critic against the correctly centered projected operator G. Solid curves show the empirical supremum residual over validation states, and dotted curves show the residual averaged over states. (b) For the online-gain critic, we report the product residual dλ together wit… view at source ↗
read the original abstract

Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cram\'er metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a quotient-space formulation for average-reward distributional RL to handle translation invariance of bias laws, paired with a categorical parameterization that respects this symmetry. It defines a projected average-reward distributional operator shown to be well-defined, non-expansive in a coordinate Cramér metric, and to admit fixed points. Sampled recursions are analyzed as asynchronous relaxations of this operator; in an idealized centered-reward setting, a one-state TD update is proved to converge almost surely with finite-iteration residual bounds under i.i.d. and Markovian sampling. For unknown gain, the scheme is augmented with an online estimator, with proofs of non-expansiveness and Markovian convergence. Synchronous exact updates are shown to be gain-independent at the quotient-law level.

Significance. If the central claims hold, the work offers a principled resolution to the ill-posedness of distributional methods in average-reward RL by using quotient identifications, which is a notable technical contribution for continuing-task settings. The non-expansiveness result in the coordinate Cramér metric and the distinction between ideal quotient distributions and practical fixed-grid representations provide useful structural insights. Convergence guarantees under sampling, even if limited to the centered-reward case, strengthen the case for practical extensions of distributional RL beyond discounted settings.

major comments (2)
  1. [Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.
  2. [Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.
minor comments (2)
  1. The abstract would be clearer if it briefly indicated the form of the projection operator and the precise definition of the coordinate Cramér metric used for non-expansiveness.
  2. Notation for the quotient identification and the gain estimator could be introduced earlier to improve readability of the convergence statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive major comments. We address each point below and will revise the manuscript to strengthen the presentation of the idealized setting and to supply the requested analysis of discretization effects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.

    Authors: We appreciate the referee's request for explicit verification. The centered-reward assumption selects the zero-mean representative within each equivalence class of the quotient, which is compatible with the construction by definition. The non-expansiveness and convergence statements are proved under this idealization. In the revision we will add a short subsection that confirms compatibility without external enforcement and supplies a Lyapunov argument justifying the passage from mean-field non-expansiveness to the almost-sure bounds under Markovian sampling. revision: yes

  2. Referee: [Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.

    Authors: We thank the referee for identifying this omission. The categorical representation is defined on the quotient space so that translation invariance holds exactly at the level of equivalence classes. Nevertheless, we agree that a quantitative treatment of support-point discretization error is needed to guarantee that practical finite grids do not re-introduce drift. The revised manuscript will include a new paragraph bounding the discretization error in the coordinate Cramér metric and showing that the error remains translation-invariant at the quotient level, thereby preserving the non-expansiveness and gain-independence claims up to a controllable approximation term. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivations rely on explicit definitions and standard operator arguments

full rationale

The paper introduces a quotient-space formulation identifying bias laws up to translation and a categorical parameterization respecting this symmetry, then defines the projected average-reward distributional operator and establishes its well-definedness, non-expansiveness in the coordinate Cramér metric, and existence of fixed points via direct arguments. Sampled recursions are analyzed as asynchronous relaxations of this operator, with convergence results (almost-sure convergence and residual bounds for one-state TD, plus coupled gain estimation) derived under explicitly stated assumptions such as the idealized centered-reward setting. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the idealized setting is declared as an assumption rather than derived from the operator itself. The final claim on gain-independence of synchronous updates at the quotient level follows from the symmetry definition without circular reduction. The derivation chain is self-contained against external RL operator theory benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central claims rest on the existence of a well-behaved projected operator in the quotient space and on standard convergence arguments for asynchronous relaxations; no explicit free parameters are named.

axioms (2)
  • domain assumption Bellman operator properties carry over to the quotient-categorical space after projection
    Invoked when defining the projected average-reward distributional operator.
  • standard math The coordinate Cramér metric is a valid contraction metric on the quotient space
    Used to establish non-expansiveness.
invented entities (1)
  • Quotient-categorical space no independent evidence
    purpose: Identify state-indexed bias laws up to common translation so that distributional representations become well-defined
    New mathematical structure introduced to resolve the additive-constant ambiguity.

pith-pipeline@v0.9.0 · 5513 in / 1412 out tokens · 56948 ms · 2026-05-13T02:33:01.905507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Learning algorithms for markov decision processes with average cost.SIAM Journal on Control and Optimization, 40(3): 681–698, 2001

    Jinane Abounadi, Dimitri Bertsekas, and Vivek S Borkar. Learning algorithms for markov decision processes with average cost.SIAM Journal on Control and Optimization, 40(3): 681–698, 2001

  2. [2]

    Springer, 2006

    Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 2006

  3. [3]

    A distributional perspective on rein- forcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

  4. [4]

    MIT Press, 2023

    Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023

  5. [5]

    Dynamic programming.Science, 153(3731):34–37, 1966

    Richard Bellman. Dynamic programming.Science, 153(3731):34–37, 1966

  6. [6]

    A finite time analysis of temporal difference learning with linear function approximation

    Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–

  7. [7]

    Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise

    Ethan Blaser and Shangtong Zhang. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 19764–19772, 2026

  8. [8]

    Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

    Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

  9. [9]

    Springer, 2008

    Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer, 2008

  10. [10]

    Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

    Mario Bravo and Roberto Cominetti. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

  11. [11]

    Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces.Mathematical Programming, 175(1):241–262, 2019

    Mario Bravo, Roberto Cominetti, and Matías Pavez-Signé. Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces.Mathematical Programming, 175(1):241–262, 2019

  12. [12]

    Openai gym, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

  13. [13]

    Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

    Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

  14. [14]

    Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

    Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

  15. [15]

    A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

    Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

  16. [16]

    Implicit quantile networks for distributional reinforcement learning

    Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018

  17. [17]

    Distributional reinforce- ment learning with quantile regression

    Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 11

  18. [18]

    Finite sample analyses for td (0) with function approximation

    Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  19. [19]

    Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

    Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

  20. [20]

    Fixed-horizon temporal difference methods for stable reinforcement learning

    Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020

  21. [21]

    Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics

    Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025

  22. [22]

    Kaya and Abolfazl Hashemi

    Ege C. Kaya and Abolfazl Hashemi. A finite-iteration theory for asynchronous categorical distributional temporal-difference learning, 2026. URL https://arxiv.org/abs/2605. 06866

  23. [23]

    Robustness of mann’s algorithm for nonexpansive mappings

    Tae-Hwa Kim and Hong-Kun Xu. Robustness of mann’s algorithm for nonexpansive mappings. Journal of Mathematical Analysis and Applications, 327(2):1105–1115, 2007

  24. [24]

    Two remarks on the method of successive approximations

    Mark Aleksandrovich Krasnosel’skii. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955

  25. [25]

    A comparative analysis of expected and distributional reinforcement learning

    Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019

  26. [26]

    Average reward reinforcement learning: Foundations, algorithms, and empirical results.Machine learning, 22(1):159–195, 1996

    Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results.Machine learning, 22(1):159–195, 1996

  27. [27]

    Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

    W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

  28. [28]

    Tweedie, and Peter W

    Sean Meyn, Richard L. Tweedie, and Peter W. Glynn.Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 2009

  29. [29]

    Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation

    Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

  30. [30]

    Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

    Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

  31. [31]

    A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

    Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

  32. [32]

    Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779

  33. [33]

    Finite-time analysis of asynchronous stochastic approxima- tion andq-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020

  34. [34]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  35. [35]

    A differential perspective on distributional reinforce- ment learning

    Juan Sebastian Rojas and Chi-Guhn Lee. A differential perspective on distributional reinforce- ment learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25160–25167, 2026. 12

  36. [36]

    An analysis of categorical distributional reinforcement learning

    Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018

  37. [37]

    Statistics and samples in distributional reinforcement learning

    Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019

  38. [38]

    Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

    Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

  39. [39]

    A reinforcement learning method for maximizing undiscounted rewards

    Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. Machine Learning Proceedings 1993, 1993

  40. [40]

    Finite-time error bounds for linear stochastic approximation and td learning

    Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019

  41. [41]

    Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

    Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

  42. [42]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  43. [43]

    Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

    John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

  44. [44]

    Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

    John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

  45. [45]

    Sattar Vakili and Julia Olkhovskaya. Kernel-based function approximation for average reward reinforcement learning: an optimist no-regret algorithm.Advances in Neural Information Processing Systems, 37:25401–25425, 2024

  46. [46]

    Learning and planning in average-reward markov decision processes

    Yi Wan, Abhishek Naik, and Richard S Sutton. Learning and planning in average-reward markov decision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR, 2021

  47. [47]

    The benefits of being distributional: Small-loss bounds for reinforcement learning.Advances in neural information processing systems, 36:2275–2312, 2023

    Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, and Wen Sun. The benefits of being distributional: Small-loss bounds for reinforcement learning.Advances in neural information processing systems, 36:2275–2312, 2023

  48. [48]

    More benefits of being distributional: Second-order bounds for reinforcement learning

    Kaiwen Wang, Owen Oertell, Alekh Agarwal, Nathan Kallus, and Wen Sun. More benefits of being distributional: Second-order bounds for reinforcement learning. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lear...

  49. [49]

    Model-free robust average-reward reinforcement learning

    Yue Wang, Alvaro Velasquez, George K Atia, Ashley Prater-Bennette, and Shaofeng Zou. Model-free robust average-reward reinforcement learning. InInternational Conference on Machine Learning, pages 36431–36469. PMLR, 2023

  50. [50]

    Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

    Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

  51. [51]

    Distributional offline policy evaluation with predictive error guarantees

    Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023

  52. [52]

    Span-based optimal sample complexity for weakly commu- nicating and general average reward mdps.Advances in Neural Information Processing Systems, 37:33455–33504, 2024

    Matthew Zurek and Yudong Chen. Span-based optimal sample complexity for weakly commu- nicating and general average reward mdps.Advances in Neural Information Processing Systems, 37:33455–33504, 2024. 13 Appendix Table of Contents A Proofs of Section 4 15 A.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Pr...