A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs
Pith reviewed 2026-05-08 16:51 UTC · model grok-4.3
The pith
A modified harmonic mean operator correctly computes average reward rates in non-stationary semi-Markov decision processes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In infinite-horizon SMDPs, the average reward rate under non-stationary reward and duration distributions is given by a modified harmonic mean operator applied to the rates rather than the ratio of cumulative reward to cumulative duration. The operator correctly computes the limiting rate even as distributions change over time. Theoretical properties of the operator are established, model-free algorithms are derived from it, and empirical comparisons show improved performance against standard ratio-based methods in non-stationary settings.
What carries the argument
The modified harmonic mean operator, which computes the correct long-run reward rate by combining time-varying instantaneous rates without requiring stationarity.
If this is right
- Model-free learning algorithms become available for SMDPs that remain correct under non-stationary conditions.
- The algorithms optimize average reward rate without needing to detect or compensate for distribution shifts.
- Proven properties of the operator guarantee that the computed rate converges to the correct limiting value.
- Empirical tests demonstrate higher performance than ratio-based baselines when non-stationarity is present.
Where Pith is reading between the lines
- Similar rate operators could be adapted for other continuing RL settings that optimize per-unit-time objectives.
- The formulation may extend naturally to continuous-time Markov processes where durations are implicit.
- Practical deployment in drifting environments such as adaptive control or online resource allocation would provide direct tests of robustness.
Load-bearing premise
That the ratio of cumulative reward to cumulative duration misrepresents the true long-run average rate when rewards and durations are non-stationary, while the modified harmonic mean computes it correctly without introducing bias.
What would settle it
Construct a non-stationary SMDP with explicitly known time-varying reward and duration distributions, compute the analytically true limiting average rate, and verify whether the modified harmonic mean operator matches that rate while the cumulative ratio does not.
Figures
read the original abstract
Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a modified harmonic mean operator for average-reward reinforcement learning in infinite-horizon SMDPs. It asserts that the standard ratio of cumulative reward to cumulative duration becomes incorrect under non-stationary reward and duration distributions, and that the new operator correctly computes reward rates, yielding robust model-free algorithms. Theoretical properties are proven and empirical comparisons to existing methods are presented.
Significance. If the modified operator is shown to correctly recover the long-run average reward rate without introducing new biases or assumptions, the work would offer a useful alternative formulation for continuing-task SMDPs where non-stationarity is present, potentially improving robustness of model-free average-reward methods.
major comments (2)
- [Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.
- [§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.
minor comments (2)
- Clarify the precise definition of the 'modified' harmonic mean (e.g., how the modification is parameterized or derived from first principles) upon first introduction, and contrast it directly with the arithmetic mean used in prior ratio-based methods.
- [Experiments] In the experimental section, specify the exact mechanism used to induce non-stationarity in the reward and duration distributions so that the robustness claim can be independently verified.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We will revise the manuscript to provide the requested clarifications, counter-example, and explicit derivations while preserving the core contributions.
read point-by-point responses
-
Referee: [Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.
Authors: We agree that a concrete counter-example and precise failure-mode specification would strengthen the introduction. In the revision we will add a simple non-stationary SMDP example (with time-varying reward and duration distributions) showing that the standard cumulative ratio, when used inside the Bellman operator, produces value estimates that deviate from the true long-run rate even though the limit exists. The failure mode is specific to the operator and online updates: the ratio does not yield the correct fixed point of the average-reward optimality equation under non-stationarity. We will also clarify this distinction in the abstract. revision: yes
-
Referee: [§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.
Authors: We will expand Section 3 with an explicit lemma showing that the modified harmonic mean reduces exactly to the standard ratio under stationary distributions. We will also add a proof that the operator is unbiased for the long-run average reward rate under the paper’s non-stationarity model without introducing extra assumptions. Finally, we will include the fixed-point characterization of the operator and a contraction-mapping argument establishing that it solves the average-reward optimality equation. revision: yes
Circularity Check
No circularity: derivation of modified harmonic mean operator is self-contained from first principles
full rationale
The paper motivates the issue with the standard reward/duration ratio under non-stationarity, then introduces and proves properties of a modified harmonic mean operator for SMDPs. No equations or steps reduce the claimed result to a fitted input, self-definition, or load-bearing self-citation chain. The operator is presented as derived independently, with theoretical proofs and empirical comparisons that do not rely on renaming or smuggling prior ansatzes. The central formulation stands on its own without the forbidden patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing ratio-of-rewards-to-durations approach is incorrect when rewards and durations are non-stationary in infinite-horizon SMDPs
invented entities (1)
-
modified harmonic mean operator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
János Aczél. 1948. On mean values.Bull. Amer. Math. Soc.54, 4 (1948), 392–400
work page 1948
-
[2]
2007.Aggregation Functions: A Guide for Practitioners
Gleb Beliakov, Andrea Pradera, and Tomasa Calvo. 2007.Aggregation Functions: A Guide for Practitioners. Springer, Berlin
work page 2007
-
[3]
Haim Bodek. 2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform. Decimus Capital Markets, LLC, CT, USA
work page 2013
-
[4]
Peter S. Bullen. 2003.Handbook of Means and Their Inequalities. Kluwer Academic Publishers, Dordrecht
work page 2003
-
[5]
Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck
Tapas K. Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck. 1999. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning.Management Science45, 4 (1999), 560–574. https://doi.org/10.1287/ mnsc.45.4.560 arXiv:https://doi.org/10.1287/mnsc.45.4.560
- [6]
-
[7]
Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, and Fred Roosta
-
[8]
arXiv:2010.08920 https: //arxiv.org/abs/2010.08920
Average-reward model-free reinforcement learning: a systematic review and literature mapping.CoRRabs/2010.08920 (2020). arXiv:2010.08920 https: //arxiv.org/abs/2010.08920
-
[9]
Stanisław Drożdż, Ludovico Minati, Paweł Oświęcimka, Michał Stanuszek, and Marcin Watorek. 2018. Bitcoin Market Route to Maturity? Evidence from Re- turn Fluctuations, Temporal Correlations and Multiscaling Effects.Chaos: An Interdisciplinary Journal of Nonlinear Science28, 7 (2018), 071101
work page 2018
-
[10]
Abhijit Gosavi. 2004. Reinforcement learning for long-run average cost.European Journal of Operational Research155 (06 2004), 654–674. https://doi.org/10.1016/ S0377-2217(02)00874-3
work page 2004
-
[11]
John E. Gray and Andrew Vogt. 2013. The Mean: Axiomatics, Generalizations, Applications. arXiv:1210.3908 [math.PR] https://arxiv.org/abs/1210.3908
-
[12]
Andrei N. Kolmogorov. 1930. Sur la notion de la moyenne.Atti della Accademia Nazionale dei Lincei. Rendiconti, Classe di Scienze Fisiche, Matematiche e Naturali 12 (1930), 388–391
work page 1930
-
[13]
Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, and Qianchuan Zhao
-
[14]
arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442
Average-Reward Reinforcement Learning with Trust Region Methods. arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442
-
[15]
Sridhar Mahadevan. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results.Machine Learning22, 1–3 (1996), 159–195. https://doi.org/10.1007/BF00114727
-
[16]
Mitio Nagumo. 1930. Über eine Klasse der Mittelwerte.Japanese Journal of Mathematics7 (1930), 71–79
work page 1930
- [17]
-
[18]
Anton Schwartz. 1993. A reinforcement learning method for maximizing undis- counted rewards. InProceedings of the Tenth International Conference on Interna- tional Conference on Machine Learning(Amherst, MA, USA)(ICML’93). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 298–305
work page 1993
- [19]
-
[20]
Nozer D. Singpurwalla and Boya Lai. 2020. What Does the "Mean" Really Mean? arXiv:2003.01973 [stat.OT] https://arxiv.org/abs/2003.01973
-
[21]
Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2 ed.). MIT Press, http://incompleteideas.net/book/the-book-2nd.html. http://incompleteideas.net/book/the-book-2nd.html
work page 2018
-
[22]
1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward
Prasad Tadepalli and DoKyeong Ok. 1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward. Technical Report 94-30-1. Oregon State University, Department of Computer Science
work page 1994
-
[23]
Yaoyue Tang, Karina Arias-Calluari, M. N. Najafi, Michael S. Harré, and Fernando Alonso-Marroquin. 2025. Stylized Facts of High-Frequency Bitcoin Time Series. arXiv:2402.11930 [q-fin.ST] https://arxiv.org/abs/2402.11930 A PROOF OF THEOREM 5 In real world scenarios, we need an average rate method that can extend to treat scenarios of: positive, negative an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.