A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

Alicia Vidler; Erel Shtossel; Gal A. Kaminka; Uri Shaham

arxiv: 2605.04880 · v1 · pith:4MRX7NOBnew · submitted 2026-05-06 · 💻 cs.LG · cs.AI

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

Erel Shtossel , Alicia Vidler , Uri Shaham , Gal A. Kaminka This is my paper

Pith reviewed 2026-05-08 16:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords average reward reinforcement learningsemi-Markov decision processesharmonic mean operatornon-stationary environmentsmodel-free learninginfinite horizon taskscontinuing tasks

0 comments

The pith

A modified harmonic mean operator correctly computes average reward rates in non-stationary semi-Markov decision processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to correct how average reward per unit time is computed in continuing tasks modeled as semi-Markov decision processes. When rewards and action durations follow changing distributions over an infinite horizon, dividing cumulative reward by cumulative time produces an incorrect long-run rate. The authors replace this with a modified harmonic mean operator that combines instantaneous rates properly under non-stationarity. If the operator works as claimed, it supports model-free learning algorithms that optimize the true average reward rate without assuming fixed statistics. This matters for real continuing tasks such as robotics or scheduling where conditions drift.

Core claim

In infinite-horizon SMDPs, the average reward rate under non-stationary reward and duration distributions is given by a modified harmonic mean operator applied to the rates rather than the ratio of cumulative reward to cumulative duration. The operator correctly computes the limiting rate even as distributions change over time. Theoretical properties of the operator are established, model-free algorithms are derived from it, and empirical comparisons show improved performance against standard ratio-based methods in non-stationary settings.

What carries the argument

The modified harmonic mean operator, which computes the correct long-run reward rate by combining time-varying instantaneous rates without requiring stationarity.

If this is right

Model-free learning algorithms become available for SMDPs that remain correct under non-stationary conditions.
The algorithms optimize average reward rate without needing to detect or compensate for distribution shifts.
Proven properties of the operator guarantee that the computed rate converges to the correct limiting value.
Empirical tests demonstrate higher performance than ratio-based baselines when non-stationarity is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rate operators could be adapted for other continuing RL settings that optimize per-unit-time objectives.
The formulation may extend naturally to continuous-time Markov processes where durations are implicit.
Practical deployment in drifting environments such as adaptive control or online resource allocation would provide direct tests of robustness.

Load-bearing premise

That the ratio of cumulative reward to cumulative duration misrepresents the true long-run average rate when rewards and durations are non-stationary, while the modified harmonic mean computes it correctly without introducing bias.

What would settle it

Construct a non-stationary SMDP with explicitly known time-varying reward and duration distributions, compute the analytically true limiting average rate, and verify whether the modified harmonic mean operator matches that rate while the cumulative ratio does not.

Figures

Figures reproduced from arXiv: 2605.04880 by Alicia Vidler, Erel Shtossel, Gal A. Kaminka, Uri Shaham.

**Figure 1.** Figure 1: Two-state SMDP with stochastic reward and sojourn view at source ↗

**Figure 2.** Figure 2: shows the first thousand steps of each of the reward (𝑆𝑖𝑛𝐿𝑜𝑔𝐷) and duration (𝐶𝑜𝑠𝐿𝑜𝑔𝐷) functions and their ratio: SinLogD(𝑡,10,0.001) CosLogD) 𝑡,10,0.0005) . The figure illustrates how quickly the non-stationary rewards and durations expand, while their nonstationary ratio expands much more slowly view at source ↗

**Figure 3.** Figure 3: shows a specific setting, where action B’s rewards and durations are defined as in view at source ↗

**Figure 5.** Figure 5: Comparing the success rate of the different algo view at source ↗

**Figure 4.** Figure 4: shows that smaller values lead to more difficult problems. log_scale values greater than 0.003 are “easier” because Action B dominates within the 1,000 steps of the learning. Smaller values make the settings harder (moving the crossover point further and further away). (a) SinLogD(𝑡, 10, 0.003) / CosLogD(𝑡, 10, 0.0015) (b) SinLogD(𝑡, 10, 0.0005) / CosLogD(𝑡, 10, 0.00025) view at source ↗

**Figure 6.** Figure 6: Win-Ratio of harmonic for all 21 segments of the view at source ↗

**Figure 7.** Figure 7: Win-Ratio of harmonic for all 21 segments of the view at source ↗

read the original abstract

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's claim that the standard reward-duration ratio fails under non-stationarity lacks justification, so the harmonic mean operator's advantage is unclear.

read the letter

The paper introduces a modified harmonic mean operator for average-reward RL in SMDPs, claiming it correctly handles non-stationary reward and duration distributions where the usual ratio of cumulative reward to duration becomes incorrect. This leads to model-free algorithms for continuing tasks. The operator itself is presented as new relative to prior ratio-based methods in the average-reward literature, and the abstract asserts proofs of its properties plus empirical comparisons showing efficacy over existing approaches. That formulation is the main concrete addition here. It targets a real setting—SMDPs with variable action durations in infinite horizons—and tries to build robustness into the learning rule without extra assumptions on stationarity. The work engages directly with the mechanics of average reward rates rather than defaulting to discounted formulations. The central soft spot is the motivation. The long-run average rate is defined as the limit of total reward over total duration whenever that limit exists. The abstract gives no derivation, counter-example, or precise failure mode showing where this ratio stops representing the intended objective under non-stationarity—whether in the Bellman operator, in online updates, or only when the limit fails to exist. Without that anchor, the switch to the harmonic mean reads as an alternative rather than a necessary correction, and any claim that it avoids new biases needs the full derivations to evaluate. The proofs and experiments are asserted but not inspectable from the abstract alone. This paper is for specialists already working on undiscounted RL and SMDP algorithms. A reader in that niche could extract the operator idea for further study, but only if the full text supplies the missing justification for why the ratio breaks. It deserves peer review so the derivations and experiments can be checked directly by people in the subfield.

Referee Report

2 major / 2 minor

Summary. The paper proposes a modified harmonic mean operator for average-reward reinforcement learning in infinite-horizon SMDPs. It asserts that the standard ratio of cumulative reward to cumulative duration becomes incorrect under non-stationary reward and duration distributions, and that the new operator correctly computes reward rates, yielding robust model-free algorithms. Theoretical properties are proven and empirical comparisons to existing methods are presented.

Significance. If the modified operator is shown to correctly recover the long-run average reward rate without introducing new biases or assumptions, the work would offer a useful alternative formulation for continuing-task SMDPs where non-stationarity is present, potentially improving robustness of model-free average-reward methods.

major comments (2)

[Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.
[§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.

minor comments (2)

Clarify the precise definition of the 'modified' harmonic mean (e.g., how the modification is parameterized or derived from first principles) upon first introduction, and contrast it directly with the arithmetic mean used in prior ratio-based methods.
[Experiments] In the experimental section, specify the exact mechanism used to induce non-stationarity in the reward and duration distributions so that the robustness claim can be independently verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We will revise the manuscript to provide the requested clarifications, counter-example, and explicit derivations while preserving the core contributions.

read point-by-point responses

Referee: [Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.

Authors: We agree that a concrete counter-example and precise failure-mode specification would strengthen the introduction. In the revision we will add a simple non-stationary SMDP example (with time-varying reward and duration distributions) showing that the standard cumulative ratio, when used inside the Bellman operator, produces value estimates that deviate from the true long-run rate even though the limit exists. The failure mode is specific to the operator and online updates: the ratio does not yield the correct fixed point of the average-reward optimality equation under non-stationarity. We will also clarify this distinction in the abstract. revision: yes
Referee: [§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.

Authors: We will expand Section 3 with an explicit lemma showing that the modified harmonic mean reduces exactly to the standard ratio under stationary distributions. We will also add a proof that the operator is unbiased for the long-run average reward rate under the paper’s non-stationarity model without introducing extra assumptions. Finally, we will include the fixed-point characterization of the operator and a contraction-mapping argument establishing that it solves the average-reward optimality equation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation of modified harmonic mean operator is self-contained from first principles

full rationale

The paper motivates the issue with the standard reward/duration ratio under non-stationarity, then introduces and proves properties of a modified harmonic mean operator for SMDPs. No equations or steps reduce the claimed result to a fitted input, self-definition, or load-bearing self-citation chain. The operator is presented as derived independently, with theoretical proofs and empirical comparisons that do not rely on renaming or smuggling prior ansatzes. The central formulation stands on its own without the forbidden patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract claims only; full paper would likely add more entries on operator derivation and convergence assumptions.

axioms (1)

domain assumption Existing ratio-of-rewards-to-durations approach is incorrect when rewards and durations are non-stationary in infinite-horizon SMDPs
Stated directly in the abstract as the motivation for the new operator.

invented entities (1)

modified harmonic mean operator no independent evidence
purpose: Correct computation of reward rates under non-stationary conditions in SMDPs
New operator introduced by the paper; no independent evidence or external validation provided in abstract.

pith-pipeline@v0.9.0 · 5456 in / 1194 out tokens · 24603 ms · 2026-05-08T16:51:13.556790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

János Aczél. 1948. On mean values.Bull. Amer. Math. Soc.54, 4 (1948), 392–400

work page 1948
[2]

2007.Aggregation Functions: A Guide for Practitioners

Gleb Beliakov, Andrea Pradera, and Tomasa Calvo. 2007.Aggregation Functions: A Guide for Practitioners. Springer, Berlin

work page 2007
[3]

2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform

Haim Bodek. 2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform. Decimus Capital Markets, LLC, CT, USA

work page 2013
[4]

Peter S. Bullen. 2003.Handbook of Means and Their Inequalities. Kluwer Academic Publishers, Dordrecht

work page 2003
[5]

Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck

Tapas K. Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck. 1999. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning.Management Science45, 4 (1999), 560–574. https://doi.org/10.1287/ mnsc.45.4.560 arXiv:https://doi.org/10.1287/mnsc.45.4.560

work page doi:10.1287/mnsc.45.4.560 1999
[6]

Nassim Dehouche. 2021. Scale matters: The daily, weekly and monthly volatility and predictability of Bitcoin, Gold, and the S& P 500. arXiv:2103.00395 [q-fin.ST] https://arxiv.org/abs/2103.00395

work page arXiv 2021
[7]

Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, and Fred Roosta

work page
[8]

arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

Average-reward model-free reinforcement learning: a systematic review and literature mapping.CoRRabs/2010.08920 (2020). arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

work page arXiv 2010
[9]

Stanisław Drożdż, Ludovico Minati, Paweł Oświęcimka, Michał Stanuszek, and Marcin Watorek. 2018. Bitcoin Market Route to Maturity? Evidence from Re- turn Fluctuations, Temporal Correlations and Multiscaling Effects.Chaos: An Interdisciplinary Journal of Nonlinear Science28, 7 (2018), 071101

work page 2018
[10]

Abhijit Gosavi. 2004. Reinforcement learning for long-run average cost.European Journal of Operational Research155 (06 2004), 654–674. https://doi.org/10.1016/ S0377-2217(02)00874-3

work page 2004
[11]

Gray and Andrew Vogt

John E. Gray and Andrew Vogt. 2013. The Mean: Axiomatics, Generalizations, Applications. arXiv:1210.3908 [math.PR] https://arxiv.org/abs/1210.3908

work page arXiv 2013
[12]

Kolmogorov

Andrei N. Kolmogorov. 1930. Sur la notion de la moyenne.Atti della Accademia Nazionale dei Lincei. Rendiconti, Classe di Scienze Fisiche, Matematiche e Naturali 12 (1930), 388–391

work page 1930
[13]

Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, and Qianchuan Zhao

work page
[14]

arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

Average-Reward Reinforcement Learning with Trust Region Methods. arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

work page arXiv
[15]

Sridhar Mahadevan. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results.Machine Learning22, 1–3 (1996), 159–195. https://doi.org/10.1007/BF00114727

work page doi:10.1007/bf00114727 1996
[16]

Mitio Nagumo. 1930. Über eine Klasse der Mittelwerte.Japanese Journal of Mathematics7 (1930), 71–79

work page 1930
[17]

Puterman

Martin L. Puterman. 1994.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York

work page 1994
[18]

Anton Schwartz. 1993. A reinforcement learning method for maximizing undis- counted rewards. InProceedings of the Tenth International Conference on Interna- tional Conference on Machine Learning(Amherst, MA, USA)(ICML’93). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 298–305

work page 1993
[19]

Rishi Shah, Yuqian Jiang, Justin Hart, and Peter Stone. 2020. Deep R-Learning for Continual Area Sweeping. arXiv:2006.00589 [cs.LG] https://arxiv.org/abs/ 2006.00589

work page arXiv 2020
[20]

Singpurwalla and Boya Lai

Nozer D. Singpurwalla and Boya Lai. 2020. What Does the "Mean" Really Mean? arXiv:2003.01973 [stat.OT] https://arxiv.org/abs/2003.01973

work page arXiv 2020
[21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2 ed.). MIT Press, http://incompleteideas.net/book/the-book-2nd.html. http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[22]

1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward

Prasad Tadepalli and DoKyeong Ok. 1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward. Technical Report 94-30-1. Oregon State University, Department of Computer Science

work page 1994
[23]

all rewards

Yaoyue Tang, Karina Arias-Calluari, M. N. Najafi, Michael S. Harré, and Fernando Alonso-Marroquin. 2025. Stylized Facts of High-Frequency Bitcoin Time Series. arXiv:2402.11930 [q-fin.ST] https://arxiv.org/abs/2402.11930 A PROOF OF THEOREM 5 In real world scenarios, we need an average rate method that can extend to treat scenarios of: positive, negative an...

work page arXiv 2025

[1] [1]

János Aczél. 1948. On mean values.Bull. Amer. Math. Soc.54, 4 (1948), 392–400

work page 1948

[2] [2]

2007.Aggregation Functions: A Guide for Practitioners

Gleb Beliakov, Andrea Pradera, and Tomasa Calvo. 2007.Aggregation Functions: A Guide for Practitioners. Springer, Berlin

work page 2007

[3] [3]

2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform

Haim Bodek. 2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform. Decimus Capital Markets, LLC, CT, USA

work page 2013

[4] [4]

Peter S. Bullen. 2003.Handbook of Means and Their Inequalities. Kluwer Academic Publishers, Dordrecht

work page 2003

[5] [5]

Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck

Tapas K. Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck. 1999. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning.Management Science45, 4 (1999), 560–574. https://doi.org/10.1287/ mnsc.45.4.560 arXiv:https://doi.org/10.1287/mnsc.45.4.560

work page doi:10.1287/mnsc.45.4.560 1999

[6] [6]

Nassim Dehouche. 2021. Scale matters: The daily, weekly and monthly volatility and predictability of Bitcoin, Gold, and the S& P 500. arXiv:2103.00395 [q-fin.ST] https://arxiv.org/abs/2103.00395

work page arXiv 2021

[7] [7]

Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, and Fred Roosta

work page

[8] [8]

arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

Average-reward model-free reinforcement learning: a systematic review and literature mapping.CoRRabs/2010.08920 (2020). arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

work page arXiv 2010

[9] [9]

Stanisław Drożdż, Ludovico Minati, Paweł Oświęcimka, Michał Stanuszek, and Marcin Watorek. 2018. Bitcoin Market Route to Maturity? Evidence from Re- turn Fluctuations, Temporal Correlations and Multiscaling Effects.Chaos: An Interdisciplinary Journal of Nonlinear Science28, 7 (2018), 071101

work page 2018

[10] [10]

Abhijit Gosavi. 2004. Reinforcement learning for long-run average cost.European Journal of Operational Research155 (06 2004), 654–674. https://doi.org/10.1016/ S0377-2217(02)00874-3

work page 2004

[11] [11]

Gray and Andrew Vogt

John E. Gray and Andrew Vogt. 2013. The Mean: Axiomatics, Generalizations, Applications. arXiv:1210.3908 [math.PR] https://arxiv.org/abs/1210.3908

work page arXiv 2013

[12] [12]

Kolmogorov

Andrei N. Kolmogorov. 1930. Sur la notion de la moyenne.Atti della Accademia Nazionale dei Lincei. Rendiconti, Classe di Scienze Fisiche, Matematiche e Naturali 12 (1930), 388–391

work page 1930

[13] [13]

Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, and Qianchuan Zhao

work page

[14] [14]

arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

Average-Reward Reinforcement Learning with Trust Region Methods. arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

work page arXiv

[15] [15]

Sridhar Mahadevan. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results.Machine Learning22, 1–3 (1996), 159–195. https://doi.org/10.1007/BF00114727

work page doi:10.1007/bf00114727 1996

[16] [16]

Mitio Nagumo. 1930. Über eine Klasse der Mittelwerte.Japanese Journal of Mathematics7 (1930), 71–79

work page 1930

[17] [17]

Puterman

Martin L. Puterman. 1994.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York

work page 1994

[18] [18]

Anton Schwartz. 1993. A reinforcement learning method for maximizing undis- counted rewards. InProceedings of the Tenth International Conference on Interna- tional Conference on Machine Learning(Amherst, MA, USA)(ICML’93). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 298–305

work page 1993

[19] [19]

Rishi Shah, Yuqian Jiang, Justin Hart, and Peter Stone. 2020. Deep R-Learning for Continual Area Sweeping. arXiv:2006.00589 [cs.LG] https://arxiv.org/abs/ 2006.00589

work page arXiv 2020

[20] [20]

Singpurwalla and Boya Lai

Nozer D. Singpurwalla and Boya Lai. 2020. What Does the "Mean" Really Mean? arXiv:2003.01973 [stat.OT] https://arxiv.org/abs/2003.01973

work page arXiv 2020

[21] [21]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2 ed.). MIT Press, http://incompleteideas.net/book/the-book-2nd.html. http://incompleteideas.net/book/the-book-2nd.html

work page 2018

[22] [22]

1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward

Prasad Tadepalli and DoKyeong Ok. 1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward. Technical Report 94-30-1. Oregon State University, Department of Computer Science

work page 1994

[23] [23]

all rewards

Yaoyue Tang, Karina Arias-Calluari, M. N. Najafi, Michael S. Harré, and Fernando Alonso-Marroquin. 2025. Stylized Facts of High-Frequency Bitcoin Time Series. arXiv:2402.11930 [q-fin.ST] https://arxiv.org/abs/2402.11930 A PROOF OF THEOREM 5 In real world scenarios, we need an average rate method that can extend to treat scenarios of: positive, negative an...

work page arXiv 2025