Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks

Aneta Vulgarakis Feljan; Athanasios Karapantelakis; Jendrik Seipp; Kristina Levina; Nikolaos Pappas

arxiv: 2604.07411 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks

Kristina Levina , Nikolaos Pappas , Athanasios Karapantelakis , Aneta Vulgarakis Feljan , Jendrik Seipp This is my paper

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningreward machinessleep controlmobile networksenergy efficiencyQoS constraintstime-averaged performance

0 comments

The pith

Reinforcement learning with reward machines enables sleep decisions in mobile networks that respect time-averaged quality-of-service constraints while saving energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile networks consume increasing power as they densify, so turning components off when possible saves energy but risks harming service quality. The challenge is that quality guarantees are often time averages, such as limits on average packet drops for urgent traffic, which depend on past performance and make standard reinforcement learning ineffective. This paper proposes adding reward machines to the RL setup so that an abstract state tracks cumulative violations of these averages. The resulting policies can then decide sleep intervals that trade off immediate power savings against long-term service compliance. A sympathetic reader would care because this offers a systematic way to manage the energy-QoS tradeoff in future dense networks without manual tuning for each traffic type.

Core claim

The paper establishes that reward machines account for history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time, converting the non-Markovian reward into one that reinforcement learning can optimize directly for sleep-control decisions.

What carries the argument

Reward machines that maintain an abstract state to track cumulative QoS violations over time, converting non-Markovian rewards from time-averaged constraints into Markovian ones for RL optimization.

If this is right

Sleep decisions can satisfy time-averaged packet drop rates for deadline-constrained traffic.
Sleep decisions can satisfy time-averaged minimum-throughput guarantees for constant-rate users.
The method applies across diverse traffic patterns and QoS requirements in next-generation networks.
Energy management for network components becomes a principled optimization rather than a heuristic search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reward-machine construction could extend to other cumulative performance metrics common in wireless systems, such as average delay or energy fairness.
Real deployments might combine the abstract state tracker with online traffic estimators to adjust violation thresholds dynamically.
The approach suggests that reward-machine augmentation could improve RL sample efficiency in other resource allocation settings with long-term constraints.

Load-bearing premise

Reward machines can be designed to track cumulative QoS violations in a compact way that keeps the augmented state space manageable for the RL agent to learn policies satisfying the constraints.

What would settle it

A simulation of realistic traffic in which the learned policy exceeds the allowed time-averaged packet drop rate or throughput violation threshold, or in which the state space grows too large for training to converge.

Figures

Figures reproduced from arXiv: 2604.07411 by Aneta Vulgarakis Feljan, Athanasios Karapantelakis, Jendrik Seipp, Kristina Levina, Nikolaos Pappas.

**Figure 1.** Figure 1: One RBS with G radio units (RUs) serving heterogeneous traffic. 1) Network Topology: In the system, N users communicate with the single RBS over wireless fading links (one link per user). Let N = {1, . . . , N} be the set of all users. At each time slot t, a central controller dynamically decides which RUs to put into sleep and for how long. Sleeping RUs wake themselves up after the sleep duration has elap… view at source ↗

**Figure 2.** Figure 2: Power consumption, energy efficiency, and constraint satisfaction for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Power cycling and converged sleep mode (SM) distribution. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Policy analysis via SM distribution of each agent. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Energy efficiency in mobile networks is crucial for sustainable telecommunications infrastructure, particularly as network densification continues to increase power consumption. Sleep mechanisms for the components in mobile networks can reduce energy use, but deciding which components to put to sleep, when, and for how long while preserving quality of service (QoS) remains a difficult optimisation problem. In this paper, we utilise reinforcement learning with reward machines (RMs) to make sleep-control decisions that balance immediate energy savings and long-term QoS impact, i.e. time-averaged packet drop rates for deadline-constrained traffic and time-averaged minimum-throughput guarantees for constant-rate users. A challenge is that time-averaged constraints depend on cumulative performance over time rather than immediate performance. As a result, the effective reward is non-Markovian, and optimal actions depend on operational history rather than the instantaneous system state. RMs account for the history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time. Our framework provides a principled, scalable approach to energy management for next-generation mobile networks under diverse traffic patterns and QoS requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reward machines turn the non-Markovian QoS averages into trackable states for RL sleep control, but the paper still needs to prove the augmented state stays small enough under real traffic loads.

read the letter

The core move is using reward machines to monitor cumulative packet drops and throughput shortfalls so the RL agent can learn sleep policies that respect long-term averages instead of just reacting to the current slot. That addresses the history dependence cleanly without forcing the designer to hand-craft extra state variables for every running sum. The approach is a direct application of existing RM machinery to a concrete network problem, and the abstract lays out the motivation without obvious circularity or invented entities. What stands out is the explicit separation between immediate energy reward and the RM-tracked constraint violations; that separation makes the non-Markovian part legible to standard RL algorithms. The main weakness is the scalability claim. The stress-test concern holds: standard RM constructions for averages rely on counters or discretised sums, and nothing in the provided description bounds how many RM states appear once you have multiple users, deadline classes, or bursty traffic. If the product state space grows linearly with user count or requires fine granularity to keep violation tracking accurate, Q-learning or policy gradients will struggle to find feasible policies. The abstract asserts manageability but does not report state-space sizes or scaling experiments, so that part remains an assumption rather than a demonstrated result. This paper is for people already working on RL for resource management who want a structured way to encode time-average constraints. It deserves a serious referee because the framing is coherent and the problem is practical; a reviewer can check whether the full experiments close the state-space gap and whether the learned policies actually meet the QoS targets in simulation. I would send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a reinforcement learning framework that augments standard RL with reward machines (RMs) to solve sleep-control decisions in mobile networks. The RMs maintain an abstract state that tracks cumulative, time-averaged QoS violations (packet-drop rates for deadline traffic and minimum-throughput shortfalls for constant-rate users), thereby converting the non-Markovian constraint problem into a Markovian product MDP on which an RL agent can be trained to trade off immediate energy savings against long-term QoS.

Significance. If the RM-augmented state space remains compact and the learned policies reliably meet the time-averaged constraints, the work supplies a concrete, reusable technique for embedding long-horizon QoS requirements into RL-based network control. The approach is a direct application of existing RM theory to a high-impact domain (energy-efficient 5G/6G densification) and could be extended to other cumulative-constraint problems in communications.

major comments (2)

[§3 and §4] §3 (Reward Machine Construction) and §4 (Product MDP): the manuscript does not supply an explicit upper bound on the number of RM states or on the discretization granularity of the running QoS counters as a function of the number of users, traffic intensity, or averaging window length. Without such a bound, the product state space can grow linearly or worse with user count, undermining the abstract claim that the framework is 'scalable' for realistic multi-user scenarios.
[§5] §5 (Experimental Evaluation): the reported simulations use only small-scale topologies (few base stations, limited user counts). No ablation or scaling plot shows how RM state cardinality or learning sample complexity grows with user density or traffic variability, leaving the central scalability assertion unsupported by evidence.

minor comments (2)

[§2–§3] The notation for the RM transition function and the mapping from system state to RM input alphabet should be introduced once and used consistently; currently the same symbols appear with slightly different meanings in §2 and §3.
[Figure 2] Figure 2 (RM diagram) would benefit from an explicit legend indicating which RM states correspond to 'violation accumulated' versus 'within tolerance' regimes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger scalability analysis. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3 and §4] §3 (Reward Machine Construction) and §4 (Product MDP): the manuscript does not supply an explicit upper bound on the number of RM states or on the discretization granularity of the running QoS counters as a function of the number of users, traffic intensity, or averaging window length. Without such a bound, the product state space can grow linearly or worse with user count, undermining the abstract claim that the framework is 'scalable' for realistic multi-user scenarios.

Authors: We agree that an explicit upper bound was not derived in the original manuscript. The RM states are defined by the discretized values of the cumulative QoS counters (one per constraint type per user or aggregate group). With L discretization levels, M users, and K constraint types, the RM state cardinality is at most L^{M*K}, and the product MDP state space is the original MDP states multiplied by this quantity. In the revised manuscript we will add a dedicated paragraph in §4 that states this bound formally, discusses how the averaging window length and traffic intensity influence the required L to maintain constraint accuracy, and provides practical guidelines for selecting L to keep the state space tractable for typical 5G user densities. revision: yes
Referee: [§5] §5 (Experimental Evaluation): the reported simulations use only small-scale topologies (few base stations, limited user counts). No ablation or scaling plot shows how RM state cardinality or learning sample complexity grows with user density or traffic variability, leaving the central scalability assertion unsupported by evidence.

Authors: The experiments in the current version were chosen to isolate the effect of the RM augmentation on policy quality and constraint satisfaction in controlled settings. We acknowledge that this leaves the scaling behavior unexamined. In the revision we will add (i) a plot of RM state cardinality versus number of users for fixed discretization, (ii) an ablation of learning curves (sample complexity) across increasing user densities and traffic variability, and (iii) a brief discussion relating the observed growth rates to the theoretical bound introduced in §4. These additions will directly support the scalability claim with both theoretical and empirical evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: application of established RM formalism to sleep control without self-referential derivations

full rationale

The paper describes an application of reward machines to encode time-averaged QoS constraints for RL-based sleep decisions. No equations, parameter fits, or derivations are shown that reduce the claimed scalability or policy optimality to inputs by construction. The non-Markovian handling is attributed to the standard RM construction (abstract state tracking cumulative violations), which is imported from prior literature rather than redefined here. Central claims rest on the tractability of the product MDP, but this is presented as an engineering choice rather than a mathematical reduction to the paper's own fitted values or self-citations. No load-bearing self-citation chains or ansatzes are exhibited in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only. No free parameters, invented entities, or explicit axioms are stated. The central idea rests on the domain assumption that an abstract state from the reward machine can capture the necessary history for time-averaged QoS constraints.

axioms (1)

domain assumption An abstract state maintained by the reward machine can track cumulative QoS constraint violations over time.
Invoked when the abstract states that RMs account for history dependence by maintaining an abstract state that explicitly tracks QoS constraint violations.

pith-pipeline@v0.9.0 · 5506 in / 1146 out tokens · 79214 ms · 2026-05-10T17:35:59.781067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

A survey on 5G radio access network energy efficiency: Massive mimo, lean carrier design, sleep modes, and machine learning,

D. L´opez-P´erez, A. De Domenico, N. Piovesan, G. Xinli, H. Bao, S. Qitao, and M. Debbah, “A survey on 5G radio access network energy efficiency: Massive mimo, lean carrier design, sleep modes, and machine learning,” IEEE communications surveys & tutorials, vol. 24, no. 1, 2022

work page 2022
[2]

How much energy is needed to run a wireless network?

G. Auer, V . Giannini, C. Desset, I. Godor, P. Skillermark, M. Olsson, M. A. Imran, D. Sabella, M. J. Gonzalez, O. Blumeet al., “How much energy is needed to run a wireless network?”IEEE wireless communications, vol. 18, no. 5, pp. 40–49, 2011

work page 2011
[3]

Infso-ict-247733 earth deliverable d2. 3: Energy efficiency analysis of the reference systems, areas of improvements and target breakdown,

M. Imranet al., “Infso-ict-247733 earth deliverable d2. 3: Energy efficiency analysis of the reference systems, areas of improvements and target breakdown,” Tech. Rep, Tech. Rep., 2012

work page 2012
[4]

Study on Network Energy Savings for NR,

3GPP, “Study on Network Energy Savings for NR,” 3rd Generation Partnership Project (3GPP), Technical Report, 2024, release 18, Technical Specification Group Radio Access Network

work page 2024
[5]

Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems

M. Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems. Morgan & Claypool Publishers, 2010

work page 2010
[6]

A survey on delay-aware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning,

Y . Cui, V . K. N. Lau, R. Wang, H. Huang, and S. Zhang, “A survey on delay-aware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning,” IEEE Transactions on Information Theory, vol. 58, no. 3, 2012

work page 2012
[7]

Power minimization for age of information constrained dynamic control in wireless sensor networks,

M. Moltafet, M. Leinonen, M. Codreanu, and N. Pappas, “Power minimization for age of information constrained dynamic control in wireless sensor networks,”IEEE Transactions on Communications, vol. 70, no. 1, pp. 419–432, 2021

work page 2021
[8]

Reliable low latency machine learning for resource management in wireless networks,

A. Taleb Zadeh Kasgari, “Reliable low latency machine learning for resource management in wireless networks,” 2022

work page 2022
[9]

Altman,Constrained Markov Decision Processes

E. Altman,Constrained Markov Decision Processes. Routledge, 2021

work page 2021
[10]

Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,

Z. Jiang, B. Krishnamachari, S. Zhou, and Z. Niu, “Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,” IEEE Wireless Communications Letters, vol. 7, no. 3, pp. 436–439, 2017

work page 2017
[11]

Semantic-aware remote estimation of multiple markov sources under constraints,

J. Luo and N. Pappas, “Semantic-aware remote estimation of multiple markov sources under constraints,”IEEE Transactions on Communica- tions, vol. 73, no. 11, pp. 11 093–11 105, 2025

work page 2025
[12]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. Pmlr, 2017, pp. 22–31

work page 2017
[13]

Responsive safety in reinforcement learning by pid lagrangian methods,

A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by pid lagrangian methods,” inInternational conference on machine learning. PMLR, 2020, pp. 9133–9143

work page 2020
[14]

Reward machines: Exploiting reward function structure in reinforcement learning,

R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward machines: Exploiting reward function structure in reinforcement learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208, 2022

work page 2022
[15]

Capacity of a burst-noise channel,

E. N. Gilbert, “Capacity of a burst-noise channel,”Bell System Technical Journal, vol. 39, no. 5, pp. 1253–1265, 1960

work page 1960
[16]

Explainable reinforcement and causal learning for improving trust to 6g stakeholders,

M. Arana-Catania, A. Sonee, A.-M. Khan, K. Fatehi, Y . Tang, B. Jin, A. Soligo, D. Boyle, R. Calinescu, P. Yadavet al., “Explainable reinforcement and causal learning for improving trust to 6g stakeholders,” IEEE Open Journal of the Communications Society, 2025

work page 2025
[17]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. MIT press, 2018

work page 2018
[18]

Addressing function approximation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

work page 2018
[19]

Energy optimization with multi-sleeping control in 5g heterogeneous networks using reinforcement learning,

A. El Amine, J.-P. Chaiban, H. A. H. Hassan, P. Dini, L. Nuaymi, and R. Achkar, “Energy optimization with multi-sleeping control in 5g heterogeneous networks using reinforcement learning,”IEEE Transactions on Network and Service Management, vol. 19, no. 4, 2022

work page 2022
[20]

Stable-baselines3: Reliable reinforcement learning imple- mentations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning imple- mentations,”Journal of machine learning research, vol. 22, no. 268, 2021

work page 2021

[1] [1]

A survey on 5G radio access network energy efficiency: Massive mimo, lean carrier design, sleep modes, and machine learning,

D. L´opez-P´erez, A. De Domenico, N. Piovesan, G. Xinli, H. Bao, S. Qitao, and M. Debbah, “A survey on 5G radio access network energy efficiency: Massive mimo, lean carrier design, sleep modes, and machine learning,” IEEE communications surveys & tutorials, vol. 24, no. 1, 2022

work page 2022

[2] [2]

How much energy is needed to run a wireless network?

G. Auer, V . Giannini, C. Desset, I. Godor, P. Skillermark, M. Olsson, M. A. Imran, D. Sabella, M. J. Gonzalez, O. Blumeet al., “How much energy is needed to run a wireless network?”IEEE wireless communications, vol. 18, no. 5, pp. 40–49, 2011

work page 2011

[3] [3]

Infso-ict-247733 earth deliverable d2. 3: Energy efficiency analysis of the reference systems, areas of improvements and target breakdown,

M. Imranet al., “Infso-ict-247733 earth deliverable d2. 3: Energy efficiency analysis of the reference systems, areas of improvements and target breakdown,” Tech. Rep, Tech. Rep., 2012

work page 2012

[4] [4]

Study on Network Energy Savings for NR,

3GPP, “Study on Network Energy Savings for NR,” 3rd Generation Partnership Project (3GPP), Technical Report, 2024, release 18, Technical Specification Group Radio Access Network

work page 2024

[5] [5]

Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems

M. Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems. Morgan & Claypool Publishers, 2010

work page 2010

[6] [6]

A survey on delay-aware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning,

Y . Cui, V . K. N. Lau, R. Wang, H. Huang, and S. Zhang, “A survey on delay-aware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning,” IEEE Transactions on Information Theory, vol. 58, no. 3, 2012

work page 2012

[7] [7]

Power minimization for age of information constrained dynamic control in wireless sensor networks,

M. Moltafet, M. Leinonen, M. Codreanu, and N. Pappas, “Power minimization for age of information constrained dynamic control in wireless sensor networks,”IEEE Transactions on Communications, vol. 70, no. 1, pp. 419–432, 2021

work page 2021

[8] [8]

Reliable low latency machine learning for resource management in wireless networks,

A. Taleb Zadeh Kasgari, “Reliable low latency machine learning for resource management in wireless networks,” 2022

work page 2022

[9] [9]

Altman,Constrained Markov Decision Processes

E. Altman,Constrained Markov Decision Processes. Routledge, 2021

work page 2021

[10] [10]

Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,

Z. Jiang, B. Krishnamachari, S. Zhou, and Z. Niu, “Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,” IEEE Wireless Communications Letters, vol. 7, no. 3, pp. 436–439, 2017

work page 2017

[11] [11]

Semantic-aware remote estimation of multiple markov sources under constraints,

J. Luo and N. Pappas, “Semantic-aware remote estimation of multiple markov sources under constraints,”IEEE Transactions on Communica- tions, vol. 73, no. 11, pp. 11 093–11 105, 2025

work page 2025

[12] [12]

Constrained policy optimization,

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. Pmlr, 2017, pp. 22–31

work page 2017

[13] [13]

Responsive safety in reinforcement learning by pid lagrangian methods,

A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by pid lagrangian methods,” inInternational conference on machine learning. PMLR, 2020, pp. 9133–9143

work page 2020

[14] [14]

Reward machines: Exploiting reward function structure in reinforcement learning,

R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward machines: Exploiting reward function structure in reinforcement learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208, 2022

work page 2022

[15] [15]

Capacity of a burst-noise channel,

E. N. Gilbert, “Capacity of a burst-noise channel,”Bell System Technical Journal, vol. 39, no. 5, pp. 1253–1265, 1960

work page 1960

[16] [16]

Explainable reinforcement and causal learning for improving trust to 6g stakeholders,

M. Arana-Catania, A. Sonee, A.-M. Khan, K. Fatehi, Y . Tang, B. Jin, A. Soligo, D. Boyle, R. Calinescu, P. Yadavet al., “Explainable reinforcement and causal learning for improving trust to 6g stakeholders,” IEEE Open Journal of the Communications Society, 2025

work page 2025

[17] [17]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. MIT press, 2018

work page 2018

[18] [18]

Addressing function approximation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

work page 2018

[19] [19]

Energy optimization with multi-sleeping control in 5g heterogeneous networks using reinforcement learning,

A. El Amine, J.-P. Chaiban, H. A. H. Hassan, P. Dini, L. Nuaymi, and R. Achkar, “Energy optimization with multi-sleeping control in 5g heterogeneous networks using reinforcement learning,”IEEE Transactions on Network and Service Management, vol. 19, no. 4, 2022

work page 2022

[20] [20]

Stable-baselines3: Reliable reinforcement learning imple- mentations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning imple- mentations,”Journal of machine learning research, vol. 22, no. 268, 2021

work page 2021