pith. sign in

arxiv: 2605.04880 · v1 · pith:4MRX7NOBnew · submitted 2026-05-06 · 💻 cs.LG · cs.AI

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

Pith reviewed 2026-05-08 16:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords average reward reinforcement learningsemi-Markov decision processesharmonic mean operatornon-stationary environmentsmodel-free learninginfinite horizon taskscontinuing tasks
0
0 comments X

The pith

A modified harmonic mean operator correctly computes average reward rates in non-stationary semi-Markov decision processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to correct how average reward per unit time is computed in continuing tasks modeled as semi-Markov decision processes. When rewards and action durations follow changing distributions over an infinite horizon, dividing cumulative reward by cumulative time produces an incorrect long-run rate. The authors replace this with a modified harmonic mean operator that combines instantaneous rates properly under non-stationarity. If the operator works as claimed, it supports model-free learning algorithms that optimize the true average reward rate without assuming fixed statistics. This matters for real continuing tasks such as robotics or scheduling where conditions drift.

Core claim

In infinite-horizon SMDPs, the average reward rate under non-stationary reward and duration distributions is given by a modified harmonic mean operator applied to the rates rather than the ratio of cumulative reward to cumulative duration. The operator correctly computes the limiting rate even as distributions change over time. Theoretical properties of the operator are established, model-free algorithms are derived from it, and empirical comparisons show improved performance against standard ratio-based methods in non-stationary settings.

What carries the argument

The modified harmonic mean operator, which computes the correct long-run reward rate by combining time-varying instantaneous rates without requiring stationarity.

If this is right

  • Model-free learning algorithms become available for SMDPs that remain correct under non-stationary conditions.
  • The algorithms optimize average reward rate without needing to detect or compensate for distribution shifts.
  • Proven properties of the operator guarantee that the computed rate converges to the correct limiting value.
  • Empirical tests demonstrate higher performance than ratio-based baselines when non-stationarity is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar rate operators could be adapted for other continuing RL settings that optimize per-unit-time objectives.
  • The formulation may extend naturally to continuous-time Markov processes where durations are implicit.
  • Practical deployment in drifting environments such as adaptive control or online resource allocation would provide direct tests of robustness.

Load-bearing premise

That the ratio of cumulative reward to cumulative duration misrepresents the true long-run average rate when rewards and durations are non-stationary, while the modified harmonic mean computes it correctly without introducing bias.

What would settle it

Construct a non-stationary SMDP with explicitly known time-varying reward and duration distributions, compute the analytically true limiting average rate, and verify whether the modified harmonic mean operator matches that rate while the cumulative ratio does not.

Figures

Figures reproduced from arXiv: 2605.04880 by Alicia Vidler, Erel Shtossel, Gal A. Kaminka, Uri Shaham.

Figure 1
Figure 1. Figure 1: Two-state SMDP with stochastic reward and sojourn view at source ↗
Figure 2
Figure 2. Figure 2: shows the first thousand steps of each of the re￾ward (𝑆𝑖𝑛𝐿𝑜𝑔𝐷) and duration (𝐶𝑜𝑠𝐿𝑜𝑔𝐷) functions and their ra￾tio: SinLogD(𝑡,10,0.001) CosLogD) 𝑡,10,0.0005) . The figure illustrates how quickly the non-stationary rewards and durations expand, while their non￾stationary ratio expands much more slowly view at source ↗
Figure 3
Figure 3. Figure 3: shows a specific setting, where action B’s rewards and durations are defined as in view at source ↗
Figure 5
Figure 5. Figure 5: Comparing the success rate of the different algo view at source ↗
Figure 4
Figure 4. Figure 4: shows that smaller values lead to more difficult problems. log_scale values greater than 0.003 are “easier” because Action B dominates within the 1,000 steps of the learning. Smaller values make the settings harder (moving the crossover point further and further away). (a) SinLogD(𝑡, 10, 0.003) / CosLogD(𝑡, 10, 0.0015) (b) SinLogD(𝑡, 10, 0.0005) / CosLogD(𝑡, 10, 0.00025) view at source ↗
Figure 6
Figure 6. Figure 6: Win-Ratio of harmonic for all 21 segments of the view at source ↗
Figure 7
Figure 7. Figure 7: Win-Ratio of harmonic for all 21 segments of the view at source ↗
read the original abstract

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a modified harmonic mean operator for average-reward reinforcement learning in infinite-horizon SMDPs. It asserts that the standard ratio of cumulative reward to cumulative duration becomes incorrect under non-stationary reward and duration distributions, and that the new operator correctly computes reward rates, yielding robust model-free algorithms. Theoretical properties are proven and empirical comparisons to existing methods are presented.

Significance. If the modified operator is shown to correctly recover the long-run average reward rate without introducing new biases or assumptions, the work would offer a useful alternative formulation for continuing-task SMDPs where non-stationarity is present, potentially improving robustness of model-free average-reward methods.

major comments (2)
  1. [Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.
  2. [§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.
minor comments (2)
  1. Clarify the precise definition of the 'modified' harmonic mean (e.g., how the modification is parameterized or derived from first principles) upon first introduction, and contrast it directly with the arithmetic mean used in prior ratio-based methods.
  2. [Experiments] In the experimental section, specify the exact mechanism used to induce non-stationarity in the reward and duration distributions so that the robustness claim can be independently verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We will revise the manuscript to provide the requested clarifications, counter-example, and explicit derivations while preserving the core contributions.

read point-by-point responses
  1. Referee: [Introduction] Introduction and abstract: The central assertion that the ratio of cumulative reward to cumulative duration 'can be incorrect' under non-stationarity requires a concrete counter-example or derivation. The long-run average reward is defined as lim (T→∞) [total reward up to T / total duration up to T] when the limit exists; the manuscript must specify the precise failure mode (e.g., non-existence of the limit, issue only in the Bellman operator, or only in online updates) to justify replacing it with a modified harmonic mean.

    Authors: We agree that a concrete counter-example and precise failure-mode specification would strengthen the introduction. In the revision we will add a simple non-stationary SMDP example (with time-varying reward and duration distributions) showing that the standard cumulative ratio, when used inside the Bellman operator, produces value estimates that deviate from the true long-run rate even though the limit exists. The failure mode is specific to the operator and online updates: the ratio does not yield the correct fixed point of the average-reward optimality equation under non-stationarity. We will also clarify this distinction in the abstract. revision: yes

  2. Referee: [§3] Section on the modified harmonic mean operator (likely §3): The derivation must explicitly demonstrate (a) reduction to the standard ratio when reward/duration distributions are stationary, and (b) that the operator remains unbiased for the true long-run rate under the stated non-stationarity without additional assumptions. Include the fixed-point characterization or contraction mapping argument to confirm it solves the average-reward optimality equation.

    Authors: We will expand Section 3 with an explicit lemma showing that the modified harmonic mean reduces exactly to the standard ratio under stationary distributions. We will also add a proof that the operator is unbiased for the long-run average reward rate under the paper’s non-stationarity model without introducing extra assumptions. Finally, we will include the fixed-point characterization of the operator and a contraction-mapping argument establishing that it solves the average-reward optimality equation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation of modified harmonic mean operator is self-contained from first principles

full rationale

The paper motivates the issue with the standard reward/duration ratio under non-stationarity, then introduces and proves properties of a modified harmonic mean operator for SMDPs. No equations or steps reduce the claimed result to a fitted input, self-definition, or load-bearing self-citation chain. The operator is presented as derived independently, with theoretical proofs and empirical comparisons that do not rely on renaming or smuggling prior ansatzes. The central formulation stands on its own without the forbidden patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract claims only; full paper would likely add more entries on operator derivation and convergence assumptions.

axioms (1)
  • domain assumption Existing ratio-of-rewards-to-durations approach is incorrect when rewards and durations are non-stationary in infinite-horizon SMDPs
    Stated directly in the abstract as the motivation for the new operator.
invented entities (1)
  • modified harmonic mean operator no independent evidence
    purpose: Correct computation of reward rates under non-stationary conditions in SMDPs
    New operator introduced by the paper; no independent evidence or external validation provided in abstract.

pith-pipeline@v0.9.0 · 5456 in / 1194 out tokens · 24603 ms · 2026-05-08T16:51:13.556790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    János Aczél. 1948. On mean values.Bull. Amer. Math. Soc.54, 4 (1948), 392–400

  2. [2]

    2007.Aggregation Functions: A Guide for Practitioners

    Gleb Beliakov, Andrea Pradera, and Tomasa Calvo. 2007.Aggregation Functions: A Guide for Practitioners. Springer, Berlin

  3. [3]

    2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform

    Haim Bodek. 2013.The Problem of HFT: Collected Writings on High Frequency Trading and Stock Market Structure Reform. Decimus Capital Markets, LLC, CT, USA

  4. [4]

    Peter S. Bullen. 2003.Handbook of Means and Their Inequalities. Kluwer Academic Publishers, Dordrecht

  5. [5]

    Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck

    Tapas K. Das, Abhijit Gosavi, Sridhar Mahadevan, and Nicholas Marchalleck. 1999. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning.Management Science45, 4 (1999), 560–574. https://doi.org/10.1287/ mnsc.45.4.560 arXiv:https://doi.org/10.1287/mnsc.45.4.560

  6. [6]

    Nassim Dehouche. 2021. Scale matters: The daily, weekly and monthly volatility and predictability of Bitcoin, Gold, and the S& P 500. arXiv:2103.00395 [q-fin.ST] https://arxiv.org/abs/2103.00395

  7. [7]

    Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, and Fred Roosta

  8. [8]

    arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

    Average-reward model-free reinforcement learning: a systematic review and literature mapping.CoRRabs/2010.08920 (2020). arXiv:2010.08920 https: //arxiv.org/abs/2010.08920

  9. [9]

    Stanisław Drożdż, Ludovico Minati, Paweł Oświęcimka, Michał Stanuszek, and Marcin Watorek. 2018. Bitcoin Market Route to Maturity? Evidence from Re- turn Fluctuations, Temporal Correlations and Multiscaling Effects.Chaos: An Interdisciplinary Journal of Nonlinear Science28, 7 (2018), 071101

  10. [10]

    Abhijit Gosavi. 2004. Reinforcement learning for long-run average cost.European Journal of Operational Research155 (06 2004), 654–674. https://doi.org/10.1016/ S0377-2217(02)00874-3

  11. [11]

    Gray and Andrew Vogt

    John E. Gray and Andrew Vogt. 2013. The Mean: Axiomatics, Generalizations, Applications. arXiv:1210.3908 [math.PR] https://arxiv.org/abs/1210.3908

  12. [12]

    Kolmogorov

    Andrei N. Kolmogorov. 1930. Sur la notion de la moyenne.Atti della Accademia Nazionale dei Lincei. Rendiconti, Classe di Scienze Fisiche, Matematiche e Naturali 12 (1930), 388–391

  13. [13]

    Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, and Qianchuan Zhao

  14. [14]

    arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

    Average-Reward Reinforcement Learning with Trust Region Methods. arXiv:2106.03442 [cs.LG] https://arxiv.org/abs/2106.03442

  15. [15]

    Sridhar Mahadevan. 1996. Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results.Machine Learning22, 1–3 (1996), 159–195. https://doi.org/10.1007/BF00114727

  16. [16]

    Mitio Nagumo. 1930. Über eine Klasse der Mittelwerte.Japanese Journal of Mathematics7 (1930), 71–79

  17. [17]

    Puterman

    Martin L. Puterman. 1994.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York

  18. [18]

    Anton Schwartz. 1993. A reinforcement learning method for maximizing undis- counted rewards. InProceedings of the Tenth International Conference on Interna- tional Conference on Machine Learning(Amherst, MA, USA)(ICML’93). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 298–305

  19. [19]

    Rishi Shah, Yuqian Jiang, Justin Hart, and Peter Stone. 2020. Deep R-Learning for Continual Area Sweeping. arXiv:2006.00589 [cs.LG] https://arxiv.org/abs/ 2006.00589

  20. [20]

    Singpurwalla and Boya Lai

    Nozer D. Singpurwalla and Boya Lai. 2020. What Does the "Mean" Really Mean? arXiv:2003.01973 [stat.OT] https://arxiv.org/abs/2003.01973

  21. [21]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2 ed.). MIT Press, http://incompleteideas.net/book/the-book-2nd.html. http://incompleteideas.net/book/the-book-2nd.html

  22. [22]

    1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward

    Prasad Tadepalli and DoKyeong Ok. 1994.H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted A verage Reward. Technical Report 94-30-1. Oregon State University, Department of Computer Science

  23. [23]

    all rewards

    Yaoyue Tang, Karina Arias-Calluari, M. N. Najafi, Michael S. Harré, and Fernando Alonso-Marroquin. 2025. Stylized Facts of High-Frequency Bitcoin Time Series. arXiv:2402.11930 [q-fin.ST] https://arxiv.org/abs/2402.11930 A PROOF OF THEOREM 5 In real world scenarios, we need an average rate method that can extend to treat scenarios of: positive, negative an...