Deep Reinforcement Learning-Based Dynamic Resource Allocation in Cell-Free Massive MIMO
Pith reviewed 2026-05-16 12:43 UTC · model grok-4.3
The pith
Deep reinforcement learning maps large-scale fading to antenna and power coefficients in cell-free massive MIMO, delivering 50 percent higher energy efficiency and over 3000 times faster runtime than sequential convex approximation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A reinforcement learning agent can learn an effective mapping from large-scale fading coefficients to activation ratio, antenna coefficient, and power coefficient; these three scalars are then substituted into closed-form spectral-efficiency and energy-efficiency expressions to determine the number of active antennas at each access point and the power allocated to each user, producing substantially higher energy efficiency at far lower computational cost than direct optimization of the original mixed-integer variables.
What carries the argument
The DRL agent that learns a mapping from large-scale fading coefficients to AP activation ratio, antenna coefficient, and power coefficient, which are then used inside closed-form expressions to set active antennas and user powers.
Load-bearing premise
The reinforcement learning agent trained on particular channel statistics must produce coefficients that still improve energy efficiency when the resulting antenna and power settings are evaluated on new channel realizations.
What would settle it
Deploy the learned policy on a fresh set of large-scale fading realizations drawn from a different statistical distribution than the training data and check whether the resulting energy efficiency drops below the value achieved by sequential convex approximation on the same realizations.
Figures
read the original abstract
In this paper, we consider power allocation and antenna activation of cell-free massive multiple-input multiple-output (CFmMIMO) systems. We first derive closed-form expressions for the system spectral efficiency (SE) and energy efficiency (EE) as functions of the power allocation coefficients and the number of active antennas at the access points (APs). Then, we aim to enhance the EE through jointly optimizing antenna activation and power control. This task leads to a non-convex and mixed-integer design problem with high-dimensional design variables. To address this, we propose a novel DRL-based framework, in which the agent learns to map large-scale fading coefficients to AP activation ratio, antenna coefficient, and power coefficient. These coefficients are then employed to determine the number of active antennas per AP and the power factors assigned to users based on closed-form expressions. By optimizing these parameters instead of directly controlling antenna selection and power allocation, the proposed method transforms the intractable optimization into a low-dimensional learning task. Our extensive simulations demonstrate the efficiency and scalability of the proposed scheme. Specifically, in a CFmMIMO system with 40 APs and 20 users, it achieves a 50% EE improvement and 3350 times run time reduction compared to the conventional sequential convex approximation method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives closed-form expressions for spectral efficiency (SE) and energy efficiency (EE) in cell-free massive MIMO (CFmMIMO) systems as functions of power allocation coefficients and the number of active antennas per access point (AP). It then proposes a deep reinforcement learning (DRL) framework in which an agent learns a mapping from large-scale fading coefficients to three low-dimensional parameters (AP activation ratio, antenna coefficient, and power coefficient). These parameters are inserted into the closed-form expressions to determine antenna activation and user power factors. Simulations for a 40-AP, 20-user setup report a 50% EE improvement and 3350-fold runtime reduction relative to sequential convex approximation (SCA).
Significance. If the DRL policy generalizes reliably, the approach would convert an intractable high-dimensional non-convex mixed-integer optimization into a low-dimensional learning task, delivering both higher EE and orders-of-magnitude faster execution than conventional solvers. This could enable scalable, near-real-time resource allocation in large CFmMIMO deployments where SCA becomes prohibitive.
major comments (2)
- [Abstract / Simulation Results] Abstract and simulation section: The reported 50% EE gain and 3350x runtime reduction are obtained by substituting the DRL outputs (activation ratio, antenna coefficient, power coefficient) into the derived closed-form EE expression. No results are shown for test large-scale fading realizations drawn from distributions with different AP/user geometries or shadowing variances; without such out-of-distribution evaluation, the headline gains rest on an unverified generalization assumption and may not hold for unseen channels.
- [Proposed DRL Framework] Proposed framework section: The claim that optimizing the three learned coefficients reliably solves the original non-convex problem is central, yet the manuscript provides no analytic bound or optimality gap analysis showing how close the resulting EE lies to the global optimum obtained by exhaustive search or tighter relaxations on small instances.
minor comments (1)
- [Notation] Notation for the three learned coefficients (activation ratio, antenna coefficient, power coefficient) is introduced without a summary table relating them to the original variables; adding such a table would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Simulation Results] Abstract and simulation section: The reported 50% EE gain and 3350x runtime reduction are obtained by substituting the DRL outputs (activation ratio, antenna coefficient, power coefficient) into the derived closed-form EE expression. No results are shown for test large-scale fading realizations drawn from distributions with different AP/user geometries or shadowing variances; without such out-of-distribution evaluation, the headline gains rest on an unverified generalization assumption and may not hold for unseen channels.
Authors: We appreciate this observation. The current results use a representative 40-AP, 20-user setup in which large-scale fading coefficients (including AP/user positions and shadowing) are randomly generated according to the standard model, with the DRL agent trained across many realizations drawn from this distribution. To address concerns about generalization, we will add new simulation results in the revised manuscript that evaluate the learned policy on out-of-distribution cases, including different AP densities, user distributions, and shadowing variances. These additions will provide direct evidence regarding the robustness of the reported gains. revision: yes
-
Referee: [Proposed DRL Framework] Proposed framework section: The claim that optimizing the three learned coefficients reliably solves the original non-convex problem is central, yet the manuscript provides no analytic bound or optimality gap analysis showing how close the resulting EE lies to the global optimum obtained by exhaustive search or tighter relaxations on small instances.
Authors: We agree that an analytic optimality bound would strengthen the claims. However, obtaining a rigorous closed-form bound is intractable for this high-dimensional mixed-integer non-convex problem. Our validation relies on consistent outperformance of the SCA benchmark in both EE and runtime. In the revision we will add empirical comparisons on smaller instances (where exhaustive search or tighter relaxations become feasible) to quantify the gap, together with an expanded discussion of the approximation limitations. revision: partial
Circularity Check
No significant circularity; closed-form derivations and DRL mapping remain independent of target results.
full rationale
The paper derives closed-form SE and EE expressions from the underlying CFmMIMO channel model and statistics as an initial step. It then trains a DRL agent to produce low-dimensional coefficients (activation ratio, antenna coefficient, power coefficient) from large-scale fading inputs, which are substituted into the pre-derived closed-forms for performance evaluation. This structure does not reduce any claimed EE improvement or runtime gain to a fitted parameter by construction, nor does it invoke self-citations or uniqueness theorems to justify the core mapping. The comparison to SCA is performed externally via simulation, keeping the derivation chain self-contained against the system model.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Closed-form expressions for spectral efficiency and energy efficiency accurately capture system performance as functions of power coefficients and active antennas
Reference graph
Works this paper leans on
-
[1]
Cell-free massive mimo versus small cells,
H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta, “Cell-free massive mimo versus small cells,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1834–1850, 2017
work page 2017
-
[2]
On the total energy efficiency of cell-free massive mimo,
H. Q. Ngo, L.-N. Tran, T. Q. Duong, M. Matthaiou, and E. G. Larsson, “On the total energy efficiency of cell-free massive mimo,” IEEE Trans. Green Commun. Network., vol. 2, no. 1, pp. 25–39, 2017
work page 2017
-
[3]
Ultradense cell-free massive mimo for 6g: Technical overview and open questions,
H. Q. Ngo, G. Interdonato, E. G. Larsson, G. Caire, and J. G. Andrews, “Ultradense cell-free massive mimo for 6g: Technical overview and open questions,” Proc. IEEE, vol. 112, no. 7, pp. 805–831, 2024
work page 2024
-
[4]
N. T. Nguyen, K. Lee, and H. Dai, “Hybrid beamforming and adaptive rf chain activation for uplink cell-free millimeter-wave massive mimo systems,” IEEE Trans. Veh. Technol., vol. 71, no. 8, pp. 8739–8755, 2022
work page 2022
-
[5]
Energy efficiency maximization in large-scale cell-free massive mimo: A projected gradient approach,
T. C. Mai, H. Q. Ngo, and L.-N. Tran, “Energy efficiency maximization in large-scale cell-free massive mimo: A projected gradient approach,” IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6357–6371, 2022
work page 2022
-
[6]
Energy- efficient power control in cell-free and user-centric massive mimo at millimeter wave,
M. Alonzo, S. Buzzi, A. Zappone, and C. D’Elia, “Energy- efficient power control in cell-free and user-centric massive mimo at millimeter wave,” IEEE Trans. Green Commun. Network., vol. 3, no. 3, pp. 651–663, 2019
work page 2019
-
[7]
Energy efficiency of the cell-free massive mimo uplink with optimal uniform quantization,
M. Bashar, K. Cumanan, A. G. Burr, H. Q. Ngo, E. G. Larsson, and P. Xiao, “Energy efficiency of the cell-free massive mimo uplink with optimal uniform quantization,” IEEE Trans. Green Commun. Network., vol. 3, no. 4, pp. 971–987, 2019
work page 2019
-
[8]
Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,
G. Wang, P. Cheng, Z. Chen, B. Vucetic, and Y. Li, “Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,” IEEE Trans. Signal Process., vol. 72, pp. 2751–2766, 2024
work page 2024
-
[9]
Downlink power control for cell-free massive mimo with deep reinforcement learning,
L. Luo, J. Zhang, S. Chen, X. Zhang, B. Ai, and D. W. K. Ng, “Downlink power control for cell-free massive mimo with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6772–6777, 2022
work page 2022
-
[10]
N. Ghiasi, S. Mashhadi, S. Farahmand, S. M. Razavizadeh, and I. Lee, “Energy efficient ap selection for cell-free massive mimo systems: Deep reinforcement learning approach,” IEEE Trans. Green Commun. Network., vol. 7, no. 1, pp. 29–41, 2022
work page 2022
-
[11]
User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,
C. F. Mendoza, S. Schwarz, and M. Rupp, “User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,” in ICC 2023-IEEE International Conference on Com- munications. IEEE, 2023, pp. 1036–1041
work page 2023
-
[12]
Z. Liu, J. Zhang, Z. Liu, D. W. K. Ng, and B. Ai, “Joint cooperative clustering and power control for energy-efficient cell- free xl-mimo with multi-agent reinforcement learning,” IEEE Trans. Commun., vol. 72, no. 12, pp. 7772–7786, 2024
work page 2024
-
[13]
O. A. Topal, Q. He, O. T. Demir, M. Masoudi, and C. Cavdar, “Drl-based joint ap deployment and network-centric cluster formation for maximizing long-term energy efficiency in cell-free massive mimo,” in 2023 57th Asilomar Conference on Signals, Systems, and Computers. IEEE, 2023, pp. 993–999
work page 2023
-
[14]
S. M. Kay, Fundamentals of statistical signal processing: esti- mation theory. Prentice-Hall, Inc., 1993
work page 1993
-
[15]
T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo, Fundamentals of massive MIMO. Cambridge University Press, 2016
work page 2016
-
[16]
Foundations of user-centric cell-free massive mimo,
Ö. T. Demir, E. Björnson, L. Sanguinetti et al., “Foundations of user-centric cell-free massive mimo,” Foundations and Trends® in Signal Processing, vol. 14, no. 3-4, pp. 162–472, 2021
work page 2021
-
[17]
R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1
work page 1998
-
[18]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.