Empirical Evaluation of Policy-Based Reinforcement Learning for Dynamic Service Control in an M/M/1 Queue

Gabriel Nicolosi; Joseph Walton

arxiv: 2604.14406 · v2 · submitted 2026-04-15 · 🧮 math.OC

Empirical Evaluation of Policy-Based Reinforcement Learning for Dynamic Service Control in an M/M/1 Queue

Joseph Walton , Gabriel Nicolosi This is my paper

Pith reviewed 2026-05-10 12:10 UTC · model grok-4.3

classification 🧮 math.OC

keywords reinforcement learningM/M/1 queueservice rate controlpolicy gradient methodssemi-Markov decision processqueuing controldynamic optimization

0 comments

The pith

Three policy-based RL algorithms learn to control service rates in an M/M/1 queue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the use of policy-based reinforcement learning for adjusting service speeds in a single-server queue to trade off customer waiting times against energy consumption. It trains REINFORCE, Actor-Critic, and Proximal Policy Optimization in a simulation with either the current queue length or an extended state that includes the previous length. Performance is measured by how fast the algorithms improve, how many samples they need, the quality of the final policies, and how much they deviate from the best fixed-rate solution.

Core claim

Three policy-based reinforcement learning algorithms, namely REINFORCE, Actor-Critic (A2C), and Proximal Policy Optimization (PPO), are trained in a simulated environment using two state representations: the instantaneous queue length and an augmented state that includes a one-step queue history. Performance is evaluated in terms of convergence speed, sampling efficiency, policy quality, and pseudo-regret relative to the steady-state optimum.

What carries the argument

The semi-Markov decision process (SMDP) in which the agent chooses a service rate from a finite set at the start of each new service to minimize an objective combining system congestion and energy costs.

If this is right

The trained policies would achieve lower combined costs than any fixed service rate when arrival rates vary.
An augmented state with queue history would allow policies to react to recent trends rather than only the present length.
PPO is expected to require fewer samples than REINFORCE due to its clipped objective for stable updates.
These methods could be used to manage resources in computing clusters or manufacturing lines without solving the full optimization problem analytically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the state to include more history or arrival rate estimates might further improve performance in non-stationary environments.
Testing the approach on multi-server queues would show if the same algorithms scale without major changes.
If the learned policies are deployed online, monitoring their regret in real time could provide a way to detect when retraining is needed.

Load-bearing premise

The assumption that decisions about service rates are made only when a new customer begins service and that the cost function properly balances congestion penalties with energy use.

What would settle it

If long-running simulations of the learned policies show average costs higher than those from the optimal steady-state service rate, the evaluation would indicate that the algorithms did not find better dynamic controls.

Figures

Figures reproduced from arXiv: 2604.14406 by Gabriel Nicolosi, Joseph Walton.

**Figure 1.** Figure 1: Average reward learning curves versus number of samples for two state representations. Shaded regions [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Difference between the expected queue length under the policy at [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

While reinforcement learning has been increasingly applied to stochastic control, few studies have systematically examined policy-based methods in queuing environments modeled as a semi-Markov decision process (SMDP). To address this gap, we investigate how policy-based reinforcement learning (RL) algorithms perform when applied to the control of service rates in an M/M/1 queue, a common queuing model for manufacturing, computing, and service systems. The problem is formulated as an SMDP in which decisions occur at each new service, allowing an agent to select different service rates from a finite set of speeds, aiming to minimize an objective function that manages system congestion and energy costs. Three policy-based reinforcement learning algorithms, namely REINFORCE, Actor-Critic (A2C), and Proximal Policy Optimization (PPO), are trained in a simulated environment using two state representations: the instantaneous queue length and an augmented state that includes a one-step queue history. Performance is evaluated in terms of convergence speed, sampling efficiency, policy quality, and pseudo-regret relative to the steady-state optimum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean empirical comparison of three standard policy-gradient algorithms on M/M/1 service-rate control, with no new method but usable numbers on convergence and efficiency, though the pseudo-regret benchmark needs clearer definition.

read the letter

This paper runs a straightforward bake-off of REINFORCE, A2C, and PPO on the problem of choosing service rates in an M/M/1 queue formulated as an SMDP. Decisions happen at service completions, the agent picks from a discrete set of speeds, and the goal is to balance congestion and energy costs. They test two state representations: plain queue length and queue length plus one-step history. The results focus on convergence speed, sampling efficiency, policy quality, and pseudo-regret relative to a steady-state optimum. That is basically the whole contribution—no new algorithm, no theorem, just numbers from simulation runs on this classic model. What the work does well is keep the setup simple and reproducible. The two state choices make sense for checking whether extra history helps, and measuring practical quantities like how many samples each method needs before it stabilizes is the right thing to report for this kind of control task. The experiments stay within a standard queueing model that actually appears in manufacturing and computing, so the numbers could serve as a reference point. The soft spot is the benchmark itself. The abstract calls the comparison target the “steady-state optimum” but does not say whether this is the best constant rate or the true optimal dynamic policy obtained by solving the optimality equations. If it is only the former, then pseudo-regret does not measure distance to the best possible policy and the “policy quality” claims become harder to read. The paper also stays tightly scoped to one queue and three algorithms, so it does not move the broader theory forward. This is the sort of paper that would interest people who apply RL to stochastic service systems and want a concrete data point on these three methods. It is not for readers looking for new frameworks or proofs. I would send it to peer review. The experimental questions are well-posed and the simulation approach is standard, so a referee could usefully press on the benchmark definition and implementation details without the paper being fundamentally broken.

Referee Report

2 major / 2 minor

Summary. The paper empirically evaluates three policy-based RL algorithms (REINFORCE, A2C, and PPO) applied to dynamic service-rate control in an M/M/1 queue formulated as an SMDP. Decisions occur at service completions, with the agent selecting from a finite set of service rates to minimize a combined congestion-plus-energy cost. Two state representations are tested (instantaneous queue length and one-step history augmentation). Algorithms are trained via simulation and compared on convergence speed, sampling efficiency, policy quality, and pseudo-regret relative to a steady-state optimum.

Significance. If the reported performance differences hold under a correctly computed benchmark, the study supplies useful empirical guidance on the relative strengths of policy-gradient methods in a canonical stochastic control setting. The SMDP formulation and dual state representations are reasonable choices that align with standard practice in queueing control. The work is primarily incremental rather than foundational, as it applies known algorithms without new theoretical results or parameter-free derivations.

major comments (2)

[Evaluation methodology] Evaluation methodology (implicit in abstract and results description): The pseudo-regret is reported relative to an unspecified 'steady-state optimum.' In an SMDP whose optimal long-run average cost is obtained by solving the optimality equations over the countable state space, it is essential to clarify whether the benchmark is (i) the optimal dynamic policy or (ii) merely the best fixed service rate. If the latter, the metric does not bound suboptimality of the learned policies and cannot support claims about policy quality.
[Experimental design] Experimental design: No information is given on the number of independent runs, variance of the reported metrics, or statistical tests used to compare convergence speed and sampling efficiency across REINFORCE, A2C, and PPO. Without these, the comparative claims rest on single-run trajectories whose reliability cannot be assessed.

minor comments (2)

[Abstract] Abstract: The phrase 'pseudo-regret relative to the steady-state optimum' should be accompanied by a brief definition or citation to the relevant RL literature (e.g., regret definitions in average-cost SMDPs).
[Problem formulation] Notation: The objective function combining congestion and energy costs is described only qualitatively; an explicit mathematical expression (e.g., holding cost plus power cost) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comments point by point below, providing clarifications and indicating the revisions we will implement.

read point-by-point responses

Referee: [Evaluation methodology] Evaluation methodology (implicit in abstract and results description): The pseudo-regret is reported relative to an unspecified 'steady-state optimum.' In an SMDP whose optimal long-run average cost is obtained by solving the optimality equations over the countable state space, it is essential to clarify whether the benchmark is (i) the optimal dynamic policy or (ii) merely the best fixed service rate. If the latter, the metric does not bound suboptimality of the learned policies and cannot support claims about policy quality.

Authors: We agree that the benchmark must be explicitly defined. In the manuscript, the 'steady-state optimum' denotes the best fixed service rate: for each candidate rate we compute the long-run average cost under the M/M/1 steady-state distribution and retain the minimum. This is not the optimal dynamic policy, which would require solving the SMDP optimality equations over the countable state space. We acknowledge that the pseudo-regret therefore measures improvement relative to the best static policy rather than a bound on suboptimality to the true optimum. We will revise the abstract, Section 3, and the results discussion to state this definition clearly, explain its computation, note the computational difficulty of the dynamic optimum, and adjust all claims about policy quality accordingly. revision: yes
Referee: [Experimental design] Experimental design: No information is given on the number of independent runs, variance of the reported metrics, or statistical tests used to compare convergence speed and sampling efficiency across REINFORCE, A2C, and PPO. Without these, the comparative claims rest on single-run trajectories whose reliability cannot be assessed.

Authors: We accept this criticism. All reported curves and tables were obtained from 10 independent runs with distinct random seeds; means and standard deviations are shown in the figures, and pairwise Welch t-tests were used to assess statistical significance of differences in convergence speed and final performance. We will add a dedicated paragraph in the Experimental Setup section (and a short note in the caption of each results figure) that states the number of runs, how variance is reported, and the statistical procedure employed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical simulation study with independent benchmark

full rationale

The paper is an empirical evaluation of REINFORCE, A2C, and PPO on an M/M/1 queue formulated as an SMDP. It reports simulation-based metrics (convergence speed, sampling efficiency, policy quality, pseudo-regret) without any claimed mathematical derivations, fitted parameters presented as predictions, or self-citation chains. The steady-state optimum benchmark is computed separately (standard DP solution for the countable-state average-cost problem) and does not reduce to the RL training data or outputs by construction. No self-definitional steps, ansatzes, or renamings appear in the provided abstract or description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard assumptions from queuing theory and reinforcement learning without introducing new free parameters, axioms beyond the domain, or invented entities.

axioms (1)

domain assumption The M/M/1 queue can be modeled as a semi-Markov decision process with decisions at service initiation points.
Explicitly stated in the abstract as the problem formulation.

pith-pipeline@v0.9.0 · 5482 in / 1340 out tokens · 52299 ms · 2026-05-10T12:10:02.772673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Optimal Control of a Queue with variable Service Rates,

T. B. Crabill, “Optimal Control of a Queue with variable Service Rates,” Ph.D. dissertation, Cornell University, 1969

work page 1969
[2]

Optimal Decision in Queueing,

H. Sabeti, “Optimal Decision in Queueing,” California University, Berkeley, Operations Research Center, Tech. Rep., 1970

work page 1970
[3]

Dynamic Control of a Queue with Adjustable Service Rate,

J. M. George and J. M. Harrison, “Dynamic Control of a Queue with Adjustable Service Rate,”Operations Research, vol. 49, no. 5, pp. 720–731, 2001

work page 2001
[4]

Optimal Control of a Service Facility with Variable Exponential Service Times and Constant Arrival Rate,

T. B. Crabill, “Optimal Control of a Service Facility with Variable Exponential Service Times and Constant Arrival Rate,”Management Science, vol. 18, no. 9, pp. 560–566, 1972

work page 1972
[5]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming, ser. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc, 1994

work page 1994
[6]

Adaptive service rate control of an M/M/1 queue with server breakdowns,

Y . Zheng, J. Julaiti, and G. Pang, “Adaptive service rate control of an M/M/1 queue with server breakdowns,” Queueing Systems, vol. 106, no. 1-2, pp. 159–191, 2024

work page 2024
[7]

Queueing Network Controls via Deep Reinforcement Learning,

J. G. Dai and M. Gluzman, “Queueing Network Controls via Deep Reinforcement Learning,”Stochastic Systems, vol. 12, no. 1, pp. 30–67, 2022

work page 2022
[8]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Ma- chine Learning, vol. 8, no. 3, pp. 229–256, 1992

work page 1992
[9]

Actor-Critic Algorithms,

V . R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” inAdvances in Neural Information Processing Sys- tems, 1999

work page 1999
[10]

Proximal Policy Optimization Algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” 2017

work page 2017
[11]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Prob- lems,

S. Bubeck and N. Cesa-Bianchi, “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Prob- lems,”F oundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012

work page 2012
[12]

Learning Algorithms for Minimizing Queue Length Regret,

T. Stahlbuhk, B. Shrader, and E. Modiano, “Learning Algorithms for Minimizing Queue Length Regret,”IEEE Transactions on Information Theory, vol. 67, no. 3, pp. 1759–1781, 2021

work page 2021

[1] [1]

Optimal Control of a Queue with variable Service Rates,

T. B. Crabill, “Optimal Control of a Queue with variable Service Rates,” Ph.D. dissertation, Cornell University, 1969

work page 1969

[2] [2]

Optimal Decision in Queueing,

H. Sabeti, “Optimal Decision in Queueing,” California University, Berkeley, Operations Research Center, Tech. Rep., 1970

work page 1970

[3] [3]

Dynamic Control of a Queue with Adjustable Service Rate,

J. M. George and J. M. Harrison, “Dynamic Control of a Queue with Adjustable Service Rate,”Operations Research, vol. 49, no. 5, pp. 720–731, 2001

work page 2001

[4] [4]

Optimal Control of a Service Facility with Variable Exponential Service Times and Constant Arrival Rate,

T. B. Crabill, “Optimal Control of a Service Facility with Variable Exponential Service Times and Constant Arrival Rate,”Management Science, vol. 18, no. 9, pp. 560–566, 1972

work page 1972

[5] [5]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming, ser. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc, 1994

work page 1994

[6] [6]

Adaptive service rate control of an M/M/1 queue with server breakdowns,

Y . Zheng, J. Julaiti, and G. Pang, “Adaptive service rate control of an M/M/1 queue with server breakdowns,” Queueing Systems, vol. 106, no. 1-2, pp. 159–191, 2024

work page 2024

[7] [7]

Queueing Network Controls via Deep Reinforcement Learning,

J. G. Dai and M. Gluzman, “Queueing Network Controls via Deep Reinforcement Learning,”Stochastic Systems, vol. 12, no. 1, pp. 30–67, 2022

work page 2022

[8] [8]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Ma- chine Learning, vol. 8, no. 3, pp. 229–256, 1992

work page 1992

[9] [9]

Actor-Critic Algorithms,

V . R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” inAdvances in Neural Information Processing Sys- tems, 1999

work page 1999

[10] [10]

Proximal Policy Optimization Algorithms,

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” 2017

work page 2017

[11] [11]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Prob- lems,

S. Bubeck and N. Cesa-Bianchi, “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Prob- lems,”F oundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012

work page 2012

[12] [12]

Learning Algorithms for Minimizing Queue Length Regret,

T. Stahlbuhk, B. Shrader, and E. Modiano, “Learning Algorithms for Minimizing Queue Length Regret,”IEEE Transactions on Information Theory, vol. 67, no. 3, pp. 1759–1781, 2021

work page 2021