Recognition: 2 theorem links
· Lean TheoremQuotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning
Pith reviewed 2026-05-13 02:33 UTC · model grok-4.3
The pith
Quotient formulation resolves ill-posedness in average-reward distributional RL by using translation-invariant bias laws.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When
What carries the argument
The quotient-categorical space, which identifies state-indexed bias laws up to a common translation, together with the projected average-reward distributional operator defined on it that is non-expansive in the coordinate Cramér metric.
If this is right
- The operator is well-defined, non-expansive, and admits fixed points, supporting iterative solution methods in the quotient space.
- Sampled one-state TD updates converge almost surely with finite-iteration residual bounds under both i.i.d. and Markovian sampling in the centered-reward setting.
- Augmenting the recursion with an online gain estimator produces a coupled scheme that remains non-expansive and converges under Markovian sampling.
- Synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast with practical fixed-grid categorical representations.
Where Pith is reading between the lines
- The symmetry-preserving parameterization could support extensions to multi-state asynchronous updates while retaining the same metric properties.
- Practical algorithms built on this space might improve uncertainty-aware policies in continuing tasks such as resource allocation where gain is the primary objective.
- Adaptive discretization schemes that explicitly preserve the translation quotient could close the gap between the ideal fixed-point analysis and deployed categorical representations.
Load-bearing premise
The idealized centered-reward setting required for almost-sure convergence of the one-state TD update, together with the assumption that the chosen categorical parameterization respects the quotient symmetry without introducing additional bias.
What would settle it
A demonstration that the projected operator fails to be non-expansive in the coordinate Cramér metric, or that the one-state TD update diverges under centered rewards, would falsify the central results.
Figures
read the original abstract
Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cram\'er metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a quotient-space formulation for average-reward distributional RL to handle translation invariance of bias laws, paired with a categorical parameterization that respects this symmetry. It defines a projected average-reward distributional operator shown to be well-defined, non-expansive in a coordinate Cramér metric, and to admit fixed points. Sampled recursions are analyzed as asynchronous relaxations of this operator; in an idealized centered-reward setting, a one-state TD update is proved to converge almost surely with finite-iteration residual bounds under i.i.d. and Markovian sampling. For unknown gain, the scheme is augmented with an online estimator, with proofs of non-expansiveness and Markovian convergence. Synchronous exact updates are shown to be gain-independent at the quotient-law level.
Significance. If the central claims hold, the work offers a principled resolution to the ill-posedness of distributional methods in average-reward RL by using quotient identifications, which is a notable technical contribution for continuing-task settings. The non-expansiveness result in the coordinate Cramér metric and the distinction between ideal quotient distributions and practical fixed-grid representations provide useful structural insights. Convergence guarantees under sampling, even if limited to the centered-reward case, strengthen the case for practical extensions of distributional RL beyond discounted settings.
major comments (2)
- [Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.
- [Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.
minor comments (2)
- The abstract would be clearer if it briefly indicated the form of the projection operator and the precise definition of the coordinate Cramér metric used for non-expansiveness.
- Notation for the quotient identification and the gain estimator could be introduced earlier to improve readability of the convergence statements.
Simulated Author's Rebuttal
We thank the referee for the positive summary and constructive major comments. We address each point below and will revise the manuscript to strengthen the presentation of the idealized setting and to supply the requested analysis of discretization effects.
read point-by-point responses
-
Referee: [Abstract] Abstract: the almost-sure convergence and finite-iteration residual bounds for the one-state TD update are established only under an idealized centered-reward setting. The manuscript must verify that this centering is compatible with the quotient-categorical construction without external enforcement; otherwise the non-expansiveness of the mean-field map does not automatically yield the stochastic-approximation bounds under Markovian sampling, and an additional uniform-integrability or Lyapunov argument is required.
Authors: We appreciate the referee's request for explicit verification. The centered-reward assumption selects the zero-mean representative within each equivalence class of the quotient, which is compatible with the construction by definition. The non-expansiveness and convergence statements are proved under this idealization. In the revision we will add a short subsection that confirms compatibility without external enforcement and supplies a Lyapunov argument justifying the passage from mean-field non-expansiveness to the almost-sure bounds under Markovian sampling. revision: yes
-
Referee: [Abstract] Abstract: the categorical parameterization is asserted to respect the quotient symmetry without extra bias, yet the text supplies no analysis of discretization error on the support points. Such error could violate exact quotient invariance and re-introduce a translation-dependent drift, undermining both the non-expansiveness claim and the gain-independence result for synchronous updates.
Authors: We thank the referee for identifying this omission. The categorical representation is defined on the quotient space so that translation invariance holds exactly at the level of equivalence classes. Nevertheless, we agree that a quantitative treatment of support-point discretization error is needed to guarantee that practical finite grids do not re-introduce drift. The revised manuscript will include a new paragraph bounding the discretization error in the coordinate Cramér metric and showing that the error remains translation-invariant at the quotient level, thereby preserving the non-expansiveness and gain-independence claims up to a controllable approximation term. revision: yes
Circularity Check
No significant circularity detected; derivations rely on explicit definitions and standard operator arguments
full rationale
The paper introduces a quotient-space formulation identifying bias laws up to translation and a categorical parameterization respecting this symmetry, then defines the projected average-reward distributional operator and establishes its well-definedness, non-expansiveness in the coordinate Cramér metric, and existence of fixed points via direct arguments. Sampled recursions are analyzed as asynchronous relaxations of this operator, with convergence results (almost-sure convergence and residual bounds for one-state TD, plus coupled gain estimation) derived under explicitly stated assumptions such as the idealized centered-reward setting. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the idealized setting is declared as an assumption rather than derived from the operator itself. The final claim on gain-independence of synchronous updates at the quotient level follows from the symmetry definition without circular reduction. The derivation chain is self-contained against external RL operator theory benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bellman operator properties carry over to the quotient-categorical space after projection
- standard math The coordinate Cramér metric is a valid contraction metric on the quotient space
invented entities (1)
-
Quotient-categorical space
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. ... projected average-reward distributional operator ... non-expansive in a coordinate Cramér metric
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 7 (Non-expansiveness of G_g) ... Corollary 8 (Existence of fixed points)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jinane Abounadi, Dimitri Bertsekas, and Vivek S Borkar. Learning algorithms for markov decision processes with average cost.SIAM Journal on Control and Optimization, 40(3): 681–698, 2001
work page 2001
-
[2]
Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 2006
work page 2006
-
[3]
A distributional perspective on rein- forcement learning
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017
work page 2017
-
[4]
Marc G Bellemare, Will Dabney, and Mark Rowland.Distributional reinforcement learning. MIT Press, 2023
work page 2023
-
[5]
Dynamic programming.Science, 153(3731):34–37, 1966
Richard Bellman. Dynamic programming.Science, 153(3731):34–37, 1966
work page 1966
-
[6]
A finite time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–
-
[7]
Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise
Ethan Blaser and Shangtong Zhang. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 19764–19772, 2026
work page 2026
-
[8]
Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998
Vivek S Borkar. Asynchronous stochastic approximations.SIAM Journal on Control and Optimization, 36(3):840–851, 1998
work page 1998
-
[9]
Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer, 2008
work page 2008
-
[10]
Mario Bravo and Roberto Cominetti. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024
work page 2024
-
[11]
Mario Bravo, Roberto Cominetti, and Matías Pavez-Signé. Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces.Mathematical Programming, 175(1):241–262, 2019
work page 2019
-
[12]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016
work page 2016
-
[13]
Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite- sample analysis of contractive stochastic approximation using smooth convex envelopes.Ad- vances in Neural Information Processing Systems, 33:8223–8234, 2020
work page 2020
-
[14]
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
work page 2022
-
[15]
Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A lya- punov theory for finite-sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367, 2024
work page 2024
-
[16]
Implicit quantile networks for distributional reinforcement learning
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. InInternational conference on machine learning, pages 1096–1105. PMLR, 2018
work page 2018
-
[17]
Distributional reinforce- ment learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 11
work page 2018
-
[18]
Finite sample analyses for td (0) with function approximation
Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[19]
Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994
Peter Dayan and Terrence J Sejnowski. Td (λ) converges with probability 1.Machine Learning, 14(3):295–301, 1994
work page 1994
-
[20]
Fixed-horizon temporal difference methods for stable reinforcement learning
Kristopher De Asis, Alan Chan, Silviu Pitis, Richard Sutton, and Daniel Graves. Fixed-horizon temporal difference methods for stable reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3741–3748, 2020
work page 2020
-
[21]
Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, and Amir-massoud Farahmand. Categorical distributional reinforcement learning with kullback-leibler divergence: Convergence and asymptotics. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[22]
Ege C. Kaya and Abolfazl Hashemi. A finite-iteration theory for asynchronous categorical distributional temporal-difference learning, 2026. URL https://arxiv.org/abs/2605. 06866
work page 2026
-
[23]
Robustness of mann’s algorithm for nonexpansive mappings
Tae-Hwa Kim and Hong-Kun Xu. Robustness of mann’s algorithm for nonexpansive mappings. Journal of Mathematical Analysis and Applications, 327(2):1105–1115, 2007
work page 2007
-
[24]
Two remarks on the method of successive approximations
Mark Aleksandrovich Krasnosel’skii. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955
work page 1955
-
[25]
A comparative analysis of expected and distributional reinforcement learning
Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019
work page 2019
-
[26]
Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results.Machine learning, 22(1):159–195, 1996
work page 1996
-
[27]
Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953
W Robert Mann. Mean value methods in iteration.Proceedings of the American Mathematical Society, 4(3):506–510, 1953
work page 1953
-
[28]
Sean Meyn, Richard L. Tweedie, and Peter W. Glynn.Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 2009
work page 2009
-
[29]
Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of tem- poral difference learning with linear function approximation: Tail averaging and regularisation. InInternational Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023
work page 2023
-
[30]
Yang Peng, Liangyu Zhang, and Zhihua Zhang. Statistical efficiency of distributional temporal difference learning.Advances in Neural Information Processing Systems, 37:24724–24761, 2024
work page 2024
-
[31]
Yang Peng, Kaicheng Jin, Liangyu Zhang, and Zhihua Zhang. A finite sample analysis of distributional td learning with linear function approximation.arXiv preprint arXiv:2502.14172, 2025
-
[32]
Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779
work page 1994
-
[33]
Finite-time analysis of asynchronous stochastic approxima- tion andq-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion andq-learning. InConference on learning theory, pages 3185–3205. PMLR, 2020
work page 2020
-
[34]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[35]
A differential perspective on distributional reinforce- ment learning
Juan Sebastian Rojas and Chi-Guhn Lee. A differential perspective on distributional reinforce- ment learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25160–25167, 2026. 12
work page 2026
-
[36]
An analysis of categorical distributional reinforcement learning
Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018
work page 2018
-
[37]
Statistics and samples in distributional reinforcement learning
Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will Dabney. Statistics and samples in distributional reinforcement learning. InInternational Conference on Machine Learning, pages 5528–5536. PMLR, 2019
work page 2019
-
[38]
Mark Rowland, Li K Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, and Will Dabney. Near-minimax-optimal distributional reinforcement learning with a generative model.Advances in Neural Information Processing Systems, 37:132774–132823, 2024
work page 2024
-
[39]
A reinforcement learning method for maximizing undiscounted rewards
Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. Machine Learning Proceedings 1993, 1993
work page 1993
-
[40]
Finite-time error bounds for linear stochastic approximation and td learning
Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and td learning. InConference on learning theory, pages 2803–2830. PMLR, 2019
work page 2019
-
[41]
Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988
work page 1988
-
[42]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[43]
John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-difference learning with function approximation.Advances in neural information processing systems, 9, 1996
work page 1996
-
[44]
Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994
work page 1994
-
[45]
Sattar Vakili and Julia Olkhovskaya. Kernel-based function approximation for average reward reinforcement learning: an optimist no-regret algorithm.Advances in Neural Information Processing Systems, 37:25401–25425, 2024
work page 2024
-
[46]
Learning and planning in average-reward markov decision processes
Yi Wan, Abhishek Naik, and Richard S Sutton. Learning and planning in average-reward markov decision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR, 2021
work page 2021
-
[47]
Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, and Wen Sun. The benefits of being distributional: Small-loss bounds for reinforcement learning.Advances in neural information processing systems, 36:2275–2312, 2023
work page 2023
-
[48]
More benefits of being distributional: Second-order bounds for reinforcement learning
Kaiwen Wang, Owen Oertell, Alekh Agarwal, Nathan Kallus, and Wen Sun. More benefits of being distributional: Second-order bounds for reinforcement learning. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lear...
work page 2024
-
[49]
Model-free robust average-reward reinforcement learning
Yue Wang, Alvaro Velasquez, George K Atia, Ashley Prater-Bennette, and Shaofeng Zou. Model-free robust average-reward reinforcement learning. InInternational Conference on Machine Learning, pages 36431–36469. PMLR, 2023
work page 2023
-
[50]
Harley Wiltzer, Jesse Farebrother, Arthur Gretton, and Mark Rowland. Foundations of mul- tivariate distributional reinforcement learning.Advances in Neural Information Processing Systems, 37:101297–101336, 2024
work page 2024
-
[51]
Distributional offline policy evaluation with predictive error guarantees
Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. InInternational Conference on Machine Learning, pages 37685– 37712. PMLR, 2023
work page 2023
-
[52]
Matthew Zurek and Yudong Chen. Span-based optimal sample complexity for weakly commu- nicating and general average reward mdps.Advances in Neural Information Processing Systems, 37:33455–33504, 2024. 13 Appendix Table of Contents A Proofs of Section 4 15 A.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Pr...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.