Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning
Pith reviewed 2026-05-18 12:11 UTC · model grok-4.3
The pith
A PPO-based xApp using deep reinforcement learning jointly optimizes transmit power, bandwidth slicing, and user scheduling in O-RAN heterogeneous networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a near-real-time RAN intelligent controller (Near-RT RIC) xApp utilising deep reinforcement learning (DRL) to jointly optimise transmit power, bandwidth slicing, and user scheduling. Leveraging real-world network topologies, we benchmark proximal policy optimisation (PPO) and twin delayed deep deterministic policy gradient (TD3) against standard heuristics. Our results demonstrate that the PPO-based xApp achieves a superior trade-off, reducing network energy consumption by up to 70% in dense scenarios and improving user fairness by more than 30% compared to throughput-greedy baselines. These findings validate the feasibility of centralised, energy-aware AI orchestration in future
What carries the argument
The PPO-based xApp in the Near-RT RIC that learns a policy for simultaneous control of transmit power, bandwidth slicing, and user scheduling.
If this is right
- Network energy consumption falls by up to 70% in dense scenarios.
- User fairness rises by more than 30% compared with throughput-greedy methods.
- PPO delivers a better energy-fairness balance than TD3 or standard heuristics.
- Centralised AI orchestration becomes feasible for energy-aware 6G resource allocation.
Where Pith is reading between the lines
- If the simulation-to-reality gap is small, operators could adopt similar xApps to cut operating costs in dense urban deployments.
- Joint multi-objective DRL may generalise to other wireless settings where power, spectrum, and scheduling interact strongly.
- Adding explicit mobility or interference dynamics to the training loop would test whether the reported gains remain stable.
Load-bearing premise
The simulated real-world network topologies and user load patterns used for benchmarking accurately predict performance in actual deployed heterogeneous networks.
What would settle it
Deploying the PPO xApp in a live heterogeneous network testbed and checking whether energy consumption drops by 70% and fairness rises by 30% relative to the same baselines.
Figures
read the original abstract
Dynamic resource allocation in open radio access network (O-RAN) heterogeneous networks (HetNets) presents a complex optimisation challenge under varying user loads. We propose a near-real-time RAN intelligent controller (Near-RT RIC) xApp utilising deep reinforcement learning (DRL) to jointly optimise transmit power, bandwidth slicing, and user scheduling. Leveraging real-world network topologies, we benchmark proximal policy optimisation (PPO) and twin delayed deep deterministic policy gradient (TD3) against standard heuristics. Our results demonstrate that the PPO-based xApp achieves a superior trade-off, reducing network energy consumption by up to 70% in dense scenarios and improving user fairness by more than 30% compared to throughput-greedy baselines. These findings validate the feasibility of centralised, energy-aware AI orchestration in future 6G architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Near-RT RIC xApp that employs deep reinforcement learning (PPO and TD3) to jointly optimize transmit power, bandwidth slicing, and user scheduling in O-RAN HetNets. It benchmarks these agents against standard heuristics on real-world network topologies and reports that PPO achieves up to 70% lower energy consumption in dense scenarios and more than 30% better user fairness than throughput-greedy baselines, thereby supporting the feasibility of centralised energy-aware AI orchestration for 6G.
Significance. If the simulation results prove robust, the work would contribute to the timely problem of multi-objective resource allocation in open RAN architectures. The explicit comparison of PPO versus TD3 and the focus on energy-fairness trade-offs are strengths. However, the absence of experimental methodology details prevents a confident assessment of whether the headline gains are reproducible or generalisable.
major comments (2)
- [Abstract] Abstract: the quantitative claims of 'up to 70% energy reduction' and 'more than 30% fairness improvement' are presented without any description of the number of independent runs, statistical tests, confidence intervals, or error bars, rendering it impossible to determine whether the data support the stated superiority.
- [Evaluation] Evaluation section (implied by benchmarking description): the central claim that the simulator using 'real-world network topologies' and 'user load patterns' validates feasibility for deployed 6G networks rests on an unverified assumption; no evidence is supplied that the model captures small-scale fading correlation, O-RAN control-loop delays, or bursty traffic, so the reported deltas may be simulation-specific artifacts.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly state the precise definitions of the energy-consumption and fairness metrics used for the reported percentages.
- [Method] Notation for the joint action space (power, bandwidth slice, scheduling) should be introduced consistently before the DRL formulation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments identify areas where additional clarity and transparency can strengthen the presentation of our results. We address each major comment below and outline the planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the quantitative claims of 'up to 70% energy reduction' and 'more than 30% fairness improvement' are presented without any description of the number of independent runs, statistical tests, confidence intervals, or error bars, rendering it impossible to determine whether the data support the stated superiority.
Authors: We agree that the abstract would be improved by referencing the statistical basis of the reported figures. In the revised manuscript we will update the abstract to note that the headline results are averages computed over 20 independent runs with different random seeds, and we will direct readers to the evaluation section where mean values, standard deviations, and error bars are presented. We will also add a brief statement confirming that the observed improvements were consistent across runs. revision: yes
-
Referee: [Evaluation] Evaluation section (implied by benchmarking description): the central claim that the simulator using 'real-world network topologies' and 'user load patterns' validates feasibility for deployed 6G networks rests on an unverified assumption; no evidence is supplied that the model captures small-scale fading correlation, O-RAN control-loop delays, or bursty traffic, so the reported deltas may be simulation-specific artifacts.
Authors: We acknowledge the validity of this observation. Our simulator incorporates publicly available real-world base-station locations and historical user-load traces, which go beyond purely synthetic random deployments. However, the channel model follows standard 3GPP path-loss and log-normal shadowing assumptions without explicit spatial correlation for small-scale fading, and the traffic model does not include fine-grained burstiness or O-RAN-specific control-loop latencies. In the revision we will expand the evaluation section to explicitly list these modeling choices and add a dedicated limitations paragraph discussing their implications for direct extrapolation to live 6G networks. revision: partial
Circularity Check
No circularity: empirical simulation benchmarks are independent of input definitions
full rationale
The paper reports performance deltas obtained by training PPO and TD3 agents inside a simulator and comparing them to throughput-greedy heuristics on the same simulated topologies and load patterns. These numbers are direct outputs of the experimental runs rather than algebraic identities, fitted parameters renamed as predictions, or results that reduce to self-citations. No uniqueness theorems, ansatzes smuggled via prior work, or self-definitional loops appear in the derivation chain. The evaluation is therefore self-contained against external benchmarks (the heuristics), satisfying the condition for a zero-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- DRL hyperparameters
axioms (1)
- domain assumption Simulated topologies and load patterns sufficiently represent real heterogeneous network behavior
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rt = κ·∑TU − β·∑PBS + ϕ·Fairness t (Jain index); PPO clipped surrogate LCLIP and TD3 clipped double Q-learning for continuous power/bandwidth actions
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Satellite-derived topology with 3 macro + 10 micro BS, 50 users, path-loss + log-normal shadowing SINR model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
X. Yongjun, G. Guan, G. Haris, and A. Fumiyuki, “A survey on resource allocation for 5G heterogeneous networks: Current research, future trends, and challenges,”IEEE Communications Surveys & Tutorials, vol. 23, no. 2, pp. 668–695, 2021
work page 2021
-
[2]
A. H. Faeq, H. M. Nour, D. Kaharudin, H. E. Binti, S. Nurhizam, Q. Faizan, A. Khairul, and N. Q. Ngoc, “A survey on resource management for 6G heterogeneous networks: Current research, future trends, and challenges,”Electronics, vol. 12, no. 3, 2023. [Online]. Available: https://doi.org/10.3390/electronics12030647
-
[3]
A. Bharat, T. M. Amine, M. Marco, and M. Gabriel-Miro, “A comprehensive survey on radio resource management in 5G hetnets: Current solutions, future trends and open issues,”IEEE Communications Surveys & Tutorials, vol. 24, no. 4, pp. 2495–2534, 2022. [Online]. Available: https://doi.org/10.1109/COMST.2022.3207967 Algorithm 1TD3 for Resource Allocation Opt...
-
[4]
D. Ather, R. Kler, Z. T. Baig, G. P. Babu, A. Rastogi, and N. Ahmed, 6G Networks: Pioneering Advanced Communication Techniques for Call Centers and Beyond.CRC Press, 2025. [Online]. Available: https://doi.org/10.1201/9781003583127-12
-
[5]
S. Boyd and L. Vandenberghe,Convex Optimization. Cambridge University Press, 2004. [Online]. Available: https://web.stanford.edu/ ∼boyd/cvxbook/
work page 2004
-
[6]
A. Mughees, M. Tahir, M. A. Sheikh, A. Amphawan, Y . K. Meng, A. Ahad, and K. Chamran, “Energy-efficient joint resource allocation in 5G hetnet using multi-agent parameterized deep reinforcement learning,” Physical Communication, vol. 61, p. 102206, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1874490723002094 TABLE I: Qua...
work page 2023
-
[7]
Mobility induced multi-hop leach protocol in heterogeneous mobile network,
M. Seli, B. P. Kumar, S. P. Kumar, B. S. Kishoro, H. K. Lee, and S. Mangal, “Mobility induced multi-hop leach protocol in heterogeneous mobile network,”IEEE Access, vol. 10, pp. 132 895–132 907, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3228576
-
[8]
Wireless network scheduling with discrete propagation delays: Theorems and algorithms,
Y . Shenghao, M. Jun, and L. Yanxiao, “Wireless network scheduling with discrete propagation delays: Theorems and algorithms,”IEEE Transactions on Information Theory, vol. 70, no. 3, pp. 1852–1875,
-
[9]
Available: https://doi.org/10.1109/TIT.2023.3324180
[Online]. Available: https://doi.org/10.1109/TIT.2023.3324180
-
[10]
R. Sutton and A. Barto,Reinforcement Learning: An Introduction. MIT Press, 1998
work page 1998
-
[11]
Playing atari with deep reinforcement learning,
V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglu, D. Wier- stra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” NIPS Deep Learning Workshop 2013, 2013
work page 2013
-
[12]
Deep reinforcement learning with double q-learning,
van Hado Hasselt, G. Arthur, and S. David, “Deep reinforcement learning with double q-learning,” ser. AAAI’16. AAAI Press, 2016, p. 2094–2100. [Online]. Available: https://doi.org/10.48550/arXiv.1509. 06461
-
[13]
Soft Actor-Critic Algorithms and Applications
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. H. J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor- critic algorithms and applications,”arXiv preprint, arXiv:1812.05905v2,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Soft Actor-Critic Algorithms and Applications
[Online]. Available: https://doi.org/10.48550/arXiv.1812.05905
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.05905
-
[15]
Applications of deep re- inforcement learning in wireless networks-a recent review,
A. Archi, H. A. Saadi, and S. Mekaoui, “Applications of deep re- inforcement learning in wireless networks-a recent review,” in2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM), vol. 1, 2023, pp. 1–8
work page 2023
-
[16]
D. Tian, “An intelligent optimization method for wireless communication network resources based on reinforcement learning,”Journal of Physics: Conference Series, 2023. [Online]. Available: https://doi.org/10.1088/ 1742-6596/2560/1/012036
work page 2023
-
[17]
X. Chi, Z. Peifeng, Y . Haibin, and L. Yonghui, “D3qn-based multi- priority computation offloading for time-sensitive and interference- limited industrial wireless networks,”IEEE Transactions on Vehicular Technology, vol. 73, no. 9, pp. 13 682–13 693, 2024. [Online]. Available: https://doi.org/10.1109/TVT.2024.3387567
-
[18]
Application of mac protocol reinforcement learning in wireless network environment,
J. Park and W. Na, “Application of mac protocol reinforcement learning in wireless network environment,” in2024 15th International Conference on Information and Communication Technology Convergence (ICTC), 2024, pp. 730–731
work page 2024
-
[19]
K. Olayemi, M. Van, S. McLoone, Y . Sun, J. Close, N. M. Nyat, and S. McIlvanna, “A twin delayed deep deterministic policy gradient algorithm for autonomous ground vehicle navigation via digital twin perception awareness,”arXiv preprint, arXiv:2403.15067v1, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.15067
-
[20]
S. Shalini, N.Kopperundevi, R.Rajkumar, A. Radhika, M. Gopianand, and M. Ram, “Decentralized machine learning for dynamic resource optimization in wireless networks using reinforcement learning,” Journal of Electrical Systems, 2024. [Online]. Available: https: //doi.org/10.52783/jes.2539
-
[21]
Communication in the presence of noise,
C. Shannon, “Communication in the presence of noise,”Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, 1949
work page 1949
-
[22]
R. Jain, D. Chiu, and W. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,” arXiv preprint, arxiv:9809099, 1998
work page 1998
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.