arxiv: 2605.11289 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Ege C. Kaya , Aliasghar Pourghani , Vijay Gupta , Abolfazl Hashemi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords average-reward reinforcement learningdistributional reinforcement learningquotient spacecategorical representationstemporal difference learningBellman operatornon-expansive operatorsconvergence analysis

0 comments

The pith

Quotient formulation resolves ill-posedness in average-reward distributional RL by using translation-invariant bias laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a quotient-space approach to make distributional methods work with average-reward reinforcement learning. In average-reward settings, the bias value function is only defined up to an additive constant, which breaks standard distributional formulations on the real line. By identifying distributions up to a common translation in a quotient space and using a compatible categorical parameterization, the authors define a projected distributional operator that is non-expansive and possesses fixed points. This setup allows them to prove convergence properties for sampled updates, including almost-sure convergence of a one-state temporal-difference method under idealized conditions. A reader should care because it provides a theoretically grounded way to handle uncertainty in long-running tasks where average reward is the natural objective.

Core claim

We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When

What carries the argument

The quotient-categorical space, which identifies state-indexed bias laws up to a common translation, together with the projected average-reward distributional operator defined on it that is non-expansive in the coordinate Cramér metric.

If this is right

The operator is well-defined, non-expansive, and admits fixed points, supporting iterative solution methods in the quotient space.
Sampled one-state TD updates converge almost surely with finite-iteration residual bounds under both i.i.d. and Markovian sampling in the centered-reward setting.
Augmenting the recursion with an online gain estimator produces a coupled scheme that remains non-expansive and converges under Markovian sampling.
Synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast with practical fixed-grid categorical representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The symmetry-preserving parameterization could support extensions to multi-state asynchronous updates while retaining the same metric properties.
Practical algorithms built on this space might improve uncertainty-aware policies in continuing tasks such as resource allocation where gain is the primary objective.
Adaptive discretization schemes that explicitly preserve the translation quotient could close the gap between the ideal fixed-point analysis and deployed categorical representations.

Load-bearing premise

The idealized centered-reward setting required for almost-sure convergence of the one-state TD update, together with the assumption that the chosen categorical parameterization respects the quotient symmetry without introducing additional bias.

What would settle it

A demonstration that the projected operator fails to be non-expansive in the coordinate Cramér metric, or that the one-state TD update diverges under centered rewards, would falsify the central results.

Figures

Figures reproduced from arXiv: 2605.11289 by Abolfazl Hashemi, Aliasghar Pourghani, Ege C. Kaya, Vijay Gupta.

**Figure 1.** Figure 1: The left panel illustrates Lemma 4: translating both the law and support preserves the coefficient vector p. The right panel illustrates that projecting a translated law back onto the same support generally gives different coefficients q ̸= p. The application of Lemma 4 shows that projection onto Θc ′ or onto Θc produces categorical families with the same coefficient vector q. Therefore, Gg is a well-defin… view at source ↗

**Figure 2.** Figure 2: (a) Centered exact and stochastic recursions validate the non-expansiveness and convergence predictions of Theorems 7, 13, and 14. (b) Both the product residual and gain error decrease, validating Theorem 19. (c) The wrong-centering ablation shows that using g = 0 as the centering targets G0 rather than the correctly centered operator G. The rate guide curves plot the theorem rates without their exact mult… view at source ↗

**Figure 3.** Figure 3: Categorical bias laws learned in the five-state MRP. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Continuous-state Pendulum-v1 evaluation. (a) We evaluate the MC-trained critic, the online estimated gain critic, and the naive raw-reward critic against the correctly centered projected operator G. Solid curves show the empirical supremum residual over validation states, and dotted curves show the residual averaged over states. (b) For the online-gain critic, we report the product residual dλ together wit… view at source ↗

read the original abstract

Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cram\'er metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a quotient-categorical fix for the translation ambiguity in average-reward distributional RL and proves some operator and convergence properties under strong centering assumptions.

read the letter

The core contribution is a quotient-space formulation that treats bias distributions as equivalent up to a shared additive constant, paired with a categorical parameterization that preserves that symmetry. They then define a projected distributional operator on this space and establish that it is well-defined, non-expansive in a coordinate Cramér metric, and has fixed points. Sampled one-state TD recursions are shown to converge almost surely with residual bounds under i.i.d. and Markovian sampling when rewards are already centered, and they extend the argument to a coupled scheme that estimates the unknown gain online. Synchronous exact updates turn out to be gain-independent at the quotient level, which is a clean structural observation. These pieces are new in combination for distributional average-reward work and give a coherent route around the ill-posedness that has limited prior attempts. The constructions rely on standard metric and operator arguments rather than circular definitions, and the abstract claims are stated precisely enough to be checkable once the proofs are read. The main limitation is that almost-sure convergence and the finite-iteration bounds are derived only in an idealized centered-reward regime. If centering must be imposed externally rather than emerging automatically from the quotient map, the transfer from mean-field contraction to stochastic approximation under Markovian noise may need extra uniform integrability or Lyapunov control that is not automatic from non-expansiveness alone. Discretization in the categorical support could also reintroduce translation-dependent drift in practice. The paper is aimed at RL theorists who care about distributional methods for continuing tasks. Readers working on average-reward extensions or uncertainty-aware control will find the symmetry handling and the gain-independent exact-update result useful. It is coherent on its own terms and formally grounded enough to merit referee time, though reviewers will likely press on the centering assumption and the gap between the idealized and coupled schemes. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces a quotient-space formulation for average-reward distributional RL to handle translation invariance of bias laws, paired with a categorical parameterization that respects this symmetry. It defines a projected average-reward distributional operator shown to be well-defined, non-expansive in a coordinate Cramér metric, and to admit fixed points. Sampled recursions are analyzed as asynchronous relaxations of this operator; in an idealized centered-reward setting, a one-state TD update is proved to converge almost surely with finite-iteration residual bounds under i.i.d. and Markovian sampling. For unknown gain, the scheme is augmented with an online estimator, with proofs of non-expansiveness and Markovian convergence. Synchronous exact updates are shown to be gain-independent at the quotient-law level.

Significance. If the central claims hold, the work offers a principled resolution to the ill-posedness of distributional methods in average-reward RL by using quotient identifications, which is a notable technical contribution for continuing-task settings. The non-expansiveness result in the coordinate Cramér metric and the distinction between ideal quotient distributions and practical fixed-grid representations provide useful structural insights. Convergence guarantees under sampling, even if limited to the centered-reward case, strengthen the case for practical extensions of distributional RL beyond discounted settings.

major comments (2)

[Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.
[Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.

minor comments (2)

The abstract would be clearer if it briefly indicated the form of the projection operator and the precise definition of the coordinate Cramér metric used for non-expansiveness.
Notation for the quotient identification and the gain estimator could be introduced earlier to improve readability of the convergence statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive major comments. We address each point below and will revise the manuscript to strengthen the presentation of the idealized setting and to supply the requested analysis of discretization effects.

read point-by-point responses

Referee: [Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.

Authors: We appreciate the referee's request for explicit verification. The centered-reward assumption selects the zero-mean representative within each equivalence class of the quotient, which is compatible with the construction by definition. The non-expansiveness and convergence statements are proved under this idealization. In the revision we will add a short subsection that confirms compatibility without external enforcement and supplies a Lyapunov argument justifying the passage from mean-field non-expansiveness to the almost-sure bounds under Markovian sampling. revision: yes
Referee: [Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.

Authors: We thank the referee for identifying this omission. The categorical representation is defined on the quotient space so that translation invariance holds exactly at the level of equivalence classes. Nevertheless, we agree that a quantitative treatment of support-point discretization error is needed to guarantee that practical finite grids do not re-introduce drift. The revised manuscript will include a new paragraph bounding the discretization error in the coordinate Cramér metric and showing that the error remains translation-invariant at the quotient level, thereby preserving the non-expansiveness and gain-independence claims up to a controllable approximation term. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivations rely on explicit definitions and standard operator arguments

full rationale

The paper introduces a quotient-space formulation identifying bias laws up to translation and a categorical parameterization respecting this symmetry, then defines the projected average-reward distributional operator and establishes its well-definedness, non-expansiveness in the coordinate Cramér metric, and existence of fixed points via direct arguments. Sampled recursions are analyzed as asynchronous relaxations of this operator, with convergence results (almost-sure convergence and residual bounds for one-state TD, plus coupled gain estimation) derived under explicitly stated assumptions such as the idealized centered-reward setting. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the idealized setting is declared as an assumption rather than derived from the operator itself. The final claim on gain-independence of synchronous updates at the quotient level follows from the symmetry definition without circular reduction. The derivation chain is self-contained against external RL operator theory benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central claims rest on the existence of a well-behaved projected operator in the quotient space and on standard convergence arguments for asynchronous relaxations; no explicit free parameters are named.

axioms (2)

domain assumption Bellman operator properties carry over to the quotient-categorical space after projection
Invoked when defining the projected average-reward distributional operator.
standard math The coordinate Cramér metric is a valid contraction metric on the quotient space
Used to establish non-expansiveness.

invented entities (1)

Quotient-categorical space no independent evidence
purpose: Identify state-indexed bias laws up to common translation so that distributional representations become well-defined
New mathematical structure introduced to resolve the additive-constant ambiguity.

pith-pipeline@v0.9.0 · 5513 in / 1412 out tokens · 56948 ms · 2026-05-13T02:33:01.905507+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. ... projected average-reward distributional operator ... non-expansive in a coordinate Cramér metric
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 7 (Non-expansiveness of G_g) ... Corollary 8 (Existence of fixed points)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Learning algorithms for markov decision processes with average cost.SIAM Journal on Control and Optimization, 40(3): 681–698, 2001

Jinane Abounadi, Dimitri Bertsekas, and Vivek S Borkar. Learning algorithms for markov decision processes with average cost.SIAM Journal on Control and Optimization, 40(3): 681–698, 2001

work page 2001
[2]

Springer, 2006

Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 2006

work page 2006
[3]

A distributional perspective on rein- forcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

work page 2017
[4]

MIT Press, 2023

Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023

work page 2023
[5]

Dynamic programming.Science, 153(3731):34–37, 1966

Richard Bellman. Dynamic programming.Science, 153(3731):34–37, 1966

work page 1966
[6]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–

work page
[7]

Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise

Ethan Blaser and Shangtong Zhang. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 19764–19772, 2026

work page 2026
[8]

Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998

work page 1998
[9]

Springer, 2008

Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer, 2008

work page 2008
[10]

Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

Mario Bravo and Roberto Cominetti. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

work page 2024
[11]

Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces.Mathematical Programming, 175(1):241–262, 2019

Mario Bravo, Roberto Cominetti, and Matías Pavez-Signé. Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces.Mathematical Programming, 175(1):241–262, 2019

work page 2019
[12]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

work page 2016
[13]

Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020

work page 2020
[14]

Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022
[15]

A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024

work page 2024
[16]

Implicit quantile networks for distributional reinforcement learning

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018

work page 2018
[17]

Distributional reinforce- ment learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 11

work page 2018
[18]

Finite sample analyses for td (0) with function approximation

Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[19]

Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994

work page 1994
[20]

Fixed-horizon temporal difference methods for stable reinforcement learning

Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020

work page 2020
[21]

Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics

Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025

work page 2025
[22]

Kaya and Abolfazl Hashemi

Ege C. Kaya and Abolfazl Hashemi. A finite-iteration theory for asynchronous categorical distributional temporal-difference learning, 2026. URL https://arxiv.org/abs/2605. 06866

work page 2026
[23]

Robustness of mann’s algorithm for nonexpansive mappings

Tae-Hwa Kim and Hong-Kun Xu. Robustness of mann’s algorithm for nonexpansive mappings. Journal of Mathematical Analysis and Applications, 327(2):1105–1115, 2007

work page 2007
[24]

Two remarks on the method of successive approximations

Mark Aleksandrovich Krasnosel’skii. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955

work page 1955
[25]

A comparative analysis of expected and distributional reinforcement learning

Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019

work page 2019
[26]

Average reward reinforcement learning: Foundations, algorithms, and empirical results.Machine learning, 22(1):159–195, 1996

Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results.Machine learning, 22(1):159–195, 1996

work page 1996
[27]

Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953

work page 1953
[28]

Tweedie, and Peter W

Sean Meyn, Richard L. Tweedie, and Peter W. Glynn.Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 2009

work page 2009
[29]

Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023

work page 2023
[30]

Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024

work page 2024
[31]

A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025

work page arXiv 2025
[32]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779

work page 1994
[33]

Finite-time analysis of asynchronous stochastic approxima- tion andq-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020

work page 2020
[34]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[35]

A differential perspective on distributional reinforce- ment learning

Juan Sebastian Rojas and Chi-Guhn Lee. A differential perspective on distributional reinforce- ment learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25160–25167, 2026. 12

work page 2026
[36]

An analysis of categorical distributional reinforcement learning

Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018

work page 2018
[37]

Statistics and samples in distributional reinforcement learning

Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019

work page 2019
[38]

Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024

work page 2024
[39]

A reinforcement learning method for maximizing undiscounted rewards

Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. Machine Learning Proceedings 1993, 1993

work page 1993
[40]

Finite-time error bounds for linear stochastic approximation and td learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019

work page 2019
[41]

Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

work page 1988
[42]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[43]

Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996

work page 1996
[44]

Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994

work page 1994
[45]

Sattar Vakili and Julia Olkhovskaya. Kernel-based function approximation for average reward reinforcement learning: an optimist no-regret algorithm.Advances in Neural Information Processing Systems, 37:25401–25425, 2024

work page 2024
[46]

Learning and planning in average-reward markov decision processes

Yi Wan, Abhishek Naik, and Richard S Sutton. Learning and planning in average-reward markov decision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR, 2021

work page 2021
[47]

The benefits of being distributional: Small-loss bounds for reinforcement learning.Advances in neural information processing systems, 36:2275–2312, 2023

Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, and Wen Sun. The benefits of being distributional: Small-loss bounds for reinforcement learning.Advances in neural information processing systems, 36:2275–2312, 2023

work page 2023
[48]

More benefits of being distributional: Second-order bounds for reinforcement learning

Kaiwen Wang, Owen Oertell, Alekh Agarwal, Nathan Kallus, and Wen Sun. More benefits of being distributional: Second-order bounds for reinforcement learning. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lear...

work page 2024
[49]

Model-free robust average-reward reinforcement learning

Yue Wang, Alvaro Velasquez, George K Atia, Ashley Prater-Bennette, and Shaofeng Zou. Model-free robust average-reward reinforcement learning. InInternational Conference on Machine Learning, pages 36431–36469. PMLR, 2023

work page 2023
[50]

Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024

work page 2024
[51]

Distributional offline policy evaluation with predictive error guarantees

Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023

work page 2023
[52]

Span-based optimal sample complexity for weakly commu- nicating and general average reward mdps.Advances in Neural Information Processing Systems, 37:33455–33504, 2024

Matthew Zurek and Yudong Chen. Span-based optimal sample complexity for weakly commu- nicating and general average reward mdps.Advances in Neural Information Processing Systems, 37:33455–33504, 2024. 13 Appendix Table of Contents A Proofs of Section 4 15 A.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Pr...

work page 2024