Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks
Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3
The pith
Reinforcement learning with reward machines enables sleep decisions in mobile networks that respect time-averaged quality-of-service constraints while saving energy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that reward machines account for history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time, converting the non-Markovian reward into one that reinforcement learning can optimize directly for sleep-control decisions.
What carries the argument
Reward machines that maintain an abstract state to track cumulative QoS violations over time, converting non-Markovian rewards from time-averaged constraints into Markovian ones for RL optimization.
If this is right
- Sleep decisions can satisfy time-averaged packet drop rates for deadline-constrained traffic.
- Sleep decisions can satisfy time-averaged minimum-throughput guarantees for constant-rate users.
- The method applies across diverse traffic patterns and QoS requirements in next-generation networks.
- Energy management for network components becomes a principled optimization rather than a heuristic search.
Where Pith is reading between the lines
- The reward-machine construction could extend to other cumulative performance metrics common in wireless systems, such as average delay or energy fairness.
- Real deployments might combine the abstract state tracker with online traffic estimators to adjust violation thresholds dynamically.
- The approach suggests that reward-machine augmentation could improve RL sample efficiency in other resource allocation settings with long-term constraints.
Load-bearing premise
Reward machines can be designed to track cumulative QoS violations in a compact way that keeps the augmented state space manageable for the RL agent to learn policies satisfying the constraints.
What would settle it
A simulation of realistic traffic in which the learned policy exceeds the allowed time-averaged packet drop rate or throughput violation threshold, or in which the state space grows too large for training to converge.
Figures
read the original abstract
Energy efficiency in mobile networks is crucial for sustainable telecommunications infrastructure, particularly as network densification continues to increase power consumption. Sleep mechanisms for the components in mobile networks can reduce energy use, but deciding which components to put to sleep, when, and for how long while preserving quality of service (QoS) remains a difficult optimisation problem. In this paper, we utilise reinforcement learning with reward machines (RMs) to make sleep-control decisions that balance immediate energy savings and long-term QoS impact, i.e. time-averaged packet drop rates for deadline-constrained traffic and time-averaged minimum-throughput guarantees for constant-rate users. A challenge is that time-averaged constraints depend on cumulative performance over time rather than immediate performance. As a result, the effective reward is non-Markovian, and optimal actions depend on operational history rather than the instantaneous system state. RMs account for the history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time. Our framework provides a principled, scalable approach to energy management for next-generation mobile networks under diverse traffic patterns and QoS requirements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a reinforcement learning framework that augments standard RL with reward machines (RMs) to solve sleep-control decisions in mobile networks. The RMs maintain an abstract state that tracks cumulative, time-averaged QoS violations (packet-drop rates for deadline traffic and minimum-throughput shortfalls for constant-rate users), thereby converting the non-Markovian constraint problem into a Markovian product MDP on which an RL agent can be trained to trade off immediate energy savings against long-term QoS.
Significance. If the RM-augmented state space remains compact and the learned policies reliably meet the time-averaged constraints, the work supplies a concrete, reusable technique for embedding long-horizon QoS requirements into RL-based network control. The approach is a direct application of existing RM theory to a high-impact domain (energy-efficient 5G/6G densification) and could be extended to other cumulative-constraint problems in communications.
major comments (2)
- [§3 and §4] §3 (Reward Machine Construction) and §4 (Product MDP): the manuscript does not supply an explicit upper bound on the number of RM states or on the discretization granularity of the running QoS counters as a function of the number of users, traffic intensity, or averaging window length. Without such a bound, the product state space can grow linearly or worse with user count, undermining the abstract claim that the framework is 'scalable' for realistic multi-user scenarios.
- [§5] §5 (Experimental Evaluation): the reported simulations use only small-scale topologies (few base stations, limited user counts). No ablation or scaling plot shows how RM state cardinality or learning sample complexity grows with user density or traffic variability, leaving the central scalability assertion unsupported by evidence.
minor comments (2)
- [§2–§3] The notation for the RM transition function and the mapping from system state to RM input alphabet should be introduced once and used consistently; currently the same symbols appear with slightly different meanings in §2 and §3.
- [Figure 2] Figure 2 (RM diagram) would benefit from an explicit legend indicating which RM states correspond to 'violation accumulated' versus 'within tolerance' regimes.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger scalability analysis. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Reward Machine Construction) and §4 (Product MDP): the manuscript does not supply an explicit upper bound on the number of RM states or on the discretization granularity of the running QoS counters as a function of the number of users, traffic intensity, or averaging window length. Without such a bound, the product state space can grow linearly or worse with user count, undermining the abstract claim that the framework is 'scalable' for realistic multi-user scenarios.
Authors: We agree that an explicit upper bound was not derived in the original manuscript. The RM states are defined by the discretized values of the cumulative QoS counters (one per constraint type per user or aggregate group). With L discretization levels, M users, and K constraint types, the RM state cardinality is at most L^{M*K}, and the product MDP state space is the original MDP states multiplied by this quantity. In the revised manuscript we will add a dedicated paragraph in §4 that states this bound formally, discusses how the averaging window length and traffic intensity influence the required L to maintain constraint accuracy, and provides practical guidelines for selecting L to keep the state space tractable for typical 5G user densities. revision: yes
-
Referee: [§5] §5 (Experimental Evaluation): the reported simulations use only small-scale topologies (few base stations, limited user counts). No ablation or scaling plot shows how RM state cardinality or learning sample complexity grows with user density or traffic variability, leaving the central scalability assertion unsupported by evidence.
Authors: The experiments in the current version were chosen to isolate the effect of the RM augmentation on policy quality and constraint satisfaction in controlled settings. We acknowledge that this leaves the scaling behavior unexamined. In the revision we will add (i) a plot of RM state cardinality versus number of users for fixed discretization, (ii) an ablation of learning curves (sample complexity) across increasing user densities and traffic variability, and (iii) a brief discussion relating the observed growth rates to the theoretical bound introduced in §4. These additions will directly support the scalability claim with both theoretical and empirical evidence. revision: yes
Circularity Check
No circularity: application of established RM formalism to sleep control without self-referential derivations
full rationale
The paper describes an application of reward machines to encode time-averaged QoS constraints for RL-based sleep decisions. No equations, parameter fits, or derivations are shown that reduce the claimed scalability or policy optimality to inputs by construction. The non-Markovian handling is attributed to the standard RM construction (abstract state tracking cumulative violations), which is imported from prior literature rather than redefined here. Central claims rest on the tractability of the product MDP, but this is presented as an engineering choice rather than a mathematical reduction to the paper's own fitted values or self-citations. No load-bearing self-citation chains or ansatzes are exhibited in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An abstract state maintained by the reward machine can track cumulative QoS constraint violations over time.
Reference graph
Works this paper leans on
-
[1]
D. L´opez-P´erez, A. De Domenico, N. Piovesan, G. Xinli, H. Bao, S. Qitao, and M. Debbah, “A survey on 5G radio access network energy efficiency: Massive mimo, lean carrier design, sleep modes, and machine learning,” IEEE communications surveys & tutorials, vol. 24, no. 1, 2022
work page 2022
-
[2]
How much energy is needed to run a wireless network?
G. Auer, V . Giannini, C. Desset, I. Godor, P. Skillermark, M. Olsson, M. A. Imran, D. Sabella, M. J. Gonzalez, O. Blumeet al., “How much energy is needed to run a wireless network?”IEEE wireless communications, vol. 18, no. 5, pp. 40–49, 2011
work page 2011
-
[3]
M. Imranet al., “Infso-ict-247733 earth deliverable d2. 3: Energy efficiency analysis of the reference systems, areas of improvements and target breakdown,” Tech. Rep, Tech. Rep., 2012
work page 2012
-
[4]
Study on Network Energy Savings for NR,
3GPP, “Study on Network Energy Savings for NR,” 3rd Generation Partnership Project (3GPP), Technical Report, 2024, release 18, Technical Specification Group Radio Access Network
work page 2024
-
[5]
Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems
M. Neely,Stochastic Network Optimization with Application to Com- munication and Queueing Systems. Morgan & Claypool Publishers, 2010
work page 2010
-
[6]
Y . Cui, V . K. N. Lau, R. Wang, H. Huang, and S. Zhang, “A survey on delay-aware resource control for wireless systems—large deviation theory, stochastic lyapunov drift, and distributed stochastic learning,” IEEE Transactions on Information Theory, vol. 58, no. 3, 2012
work page 2012
-
[7]
Power minimization for age of information constrained dynamic control in wireless sensor networks,
M. Moltafet, M. Leinonen, M. Codreanu, and N. Pappas, “Power minimization for age of information constrained dynamic control in wireless sensor networks,”IEEE Transactions on Communications, vol. 70, no. 1, pp. 419–432, 2021
work page 2021
-
[8]
Reliable low latency machine learning for resource management in wireless networks,
A. Taleb Zadeh Kasgari, “Reliable low latency machine learning for resource management in wireless networks,” 2022
work page 2022
-
[9]
Altman,Constrained Markov Decision Processes
E. Altman,Constrained Markov Decision Processes. Routledge, 2021
work page 2021
-
[10]
Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,
Z. Jiang, B. Krishnamachari, S. Zhou, and Z. Niu, “Optimal sleeping mechanism for multiple servers with mmpp-based bursty traffic arrival,” IEEE Wireless Communications Letters, vol. 7, no. 3, pp. 436–439, 2017
work page 2017
-
[11]
Semantic-aware remote estimation of multiple markov sources under constraints,
J. Luo and N. Pappas, “Semantic-aware remote estimation of multiple markov sources under constraints,”IEEE Transactions on Communica- tions, vol. 73, no. 11, pp. 11 093–11 105, 2025
work page 2025
-
[12]
Constrained policy optimization,
J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. Pmlr, 2017, pp. 22–31
work page 2017
-
[13]
Responsive safety in reinforcement learning by pid lagrangian methods,
A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by pid lagrangian methods,” inInternational conference on machine learning. PMLR, 2020, pp. 9133–9143
work page 2020
-
[14]
Reward machines: Exploiting reward function structure in reinforcement learning,
R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward machines: Exploiting reward function structure in reinforcement learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208, 2022
work page 2022
-
[15]
Capacity of a burst-noise channel,
E. N. Gilbert, “Capacity of a burst-noise channel,”Bell System Technical Journal, vol. 39, no. 5, pp. 1253–1265, 1960
work page 1960
-
[16]
Explainable reinforcement and causal learning for improving trust to 6g stakeholders,
M. Arana-Catania, A. Sonee, A.-M. Khan, K. Fatehi, Y . Tang, B. Jin, A. Soligo, D. Boyle, R. Calinescu, P. Yadavet al., “Explainable reinforcement and causal learning for improving trust to 6g stakeholders,” IEEE Open Journal of the Communications Society, 2025
work page 2025
-
[17]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. MIT press, 2018
work page 2018
-
[18]
Addressing function approximation error in actor-critic methods,
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596
work page 2018
-
[19]
A. El Amine, J.-P. Chaiban, H. A. H. Hassan, P. Dini, L. Nuaymi, and R. Achkar, “Energy optimization with multi-sleeping control in 5g heterogeneous networks using reinforcement learning,”IEEE Transactions on Network and Service Management, vol. 19, no. 4, 2022
work page 2022
-
[20]
Stable-baselines3: Reliable reinforcement learning imple- mentations,
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning imple- mentations,”Journal of machine learning research, vol. 22, no. 268, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.