Deep Reinforcement Learning-Based Dynamic Resource Allocation in Cell-Free Massive MIMO

Hien Quoc Ngo; Markku Juntti; Nhan Thanh Nguyen; Phuong Nam Tran

arxiv: 2601.13934 · v4 · submitted 2026-01-20 · 📡 eess.SP

Deep Reinforcement Learning-Based Dynamic Resource Allocation in Cell-Free Massive MIMO

Phuong Nam Tran , Nhan Thanh Nguyen , Hien Quoc Ngo , Markku Juntti This is my paper

Pith reviewed 2026-05-16 12:43 UTC · model grok-4.3

classification 📡 eess.SP

keywords cell-free massive MIMOdeep reinforcement learningenergy efficiencypower allocationantenna activationresource allocationspectral efficiency

0 comments

The pith

Deep reinforcement learning maps large-scale fading to antenna and power coefficients in cell-free massive MIMO, delivering 50 percent higher energy efficiency and over 3000 times faster runtime than sequential convex approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first derives closed-form expressions for spectral efficiency and energy efficiency in cell-free massive MIMO as functions of power allocation coefficients and the number of active antennas per access point. It then frames joint antenna activation and power control as a non-convex mixed-integer problem and solves it with a deep reinforcement learning agent that learns a direct mapping from large-scale fading coefficients to three compact parameters: AP activation ratio, antenna coefficient, and power coefficient. These parameters are inserted into the closed-form expressions to set the actual number of active antennas and user powers. The reduction from high-dimensional discrete decisions to low-dimensional continuous coefficients makes the learning task tractable. Simulations with 40 access points and 20 users confirm the approach yields a 50 percent energy-efficiency gain together with a 3350-fold runtime reduction relative to conventional sequential convex approximation.

Core claim

A reinforcement learning agent can learn an effective mapping from large-scale fading coefficients to activation ratio, antenna coefficient, and power coefficient; these three scalars are then substituted into closed-form spectral-efficiency and energy-efficiency expressions to determine the number of active antennas at each access point and the power allocated to each user, producing substantially higher energy efficiency at far lower computational cost than direct optimization of the original mixed-integer variables.

What carries the argument

The DRL agent that learns a mapping from large-scale fading coefficients to AP activation ratio, antenna coefficient, and power coefficient, which are then used inside closed-form expressions to set active antennas and user powers.

Load-bearing premise

The reinforcement learning agent trained on particular channel statistics must produce coefficients that still improve energy efficiency when the resulting antenna and power settings are evaluated on new channel realizations.

What would settle it

Deploy the learned policy on a fresh set of large-scale fading realizations drawn from a different statistical distribution than the training data and check whether the resulting energy efficiency drops below the value achieved by sequential convex approximation on the same realizations.

Figures

Figures reproduced from arXiv: 2601.13934 by Hien Quoc Ngo, Markku Juntti, Nhan Thanh Nguyen, Phuong Nam Tran.

**Figure 2.** Figure 2: Convergence, energy efficiency, and runtime performance of the proposed DRL scheme compared with benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

In this paper, we consider power allocation and antenna activation of cell-free massive multiple-input multiple-output (CFmMIMO) systems. We first derive closed-form expressions for the system spectral efficiency (SE) and energy efficiency (EE) as functions of the power allocation coefficients and the number of active antennas at the access points (APs). Then, we aim to enhance the EE through jointly optimizing antenna activation and power control. This task leads to a non-convex and mixed-integer design problem with high-dimensional design variables. To address this, we propose a novel DRL-based framework, in which the agent learns to map large-scale fading coefficients to AP activation ratio, antenna coefficient, and power coefficient. These coefficients are then employed to determine the number of active antennas per AP and the power factors assigned to users based on closed-form expressions. By optimizing these parameters instead of directly controlling antenna selection and power allocation, the proposed method transforms the intractable optimization into a low-dimensional learning task. Our extensive simulations demonstrate the efficiency and scalability of the proposed scheme. Specifically, in a CFmMIMO system with 40 APs and 20 users, it achieves a 50% EE improvement and 3350 times run time reduction compared to the conventional sequential convex approximation method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces joint antenna activation and power control in CFmMIMO to a low-dimensional DRL task by learning three coefficients from large-scale fading that plug into closed-form EE expressions, with reported 50% EE gains and big runtime wins versus SCA.

read the letter

The main point is that this work derives closed-form SE and EE expressions first, then trains a DRL agent to output just an activation ratio, antenna coefficient, and power coefficient from the large-scale fading coefficients. Those three numbers then set the number of active antennas per AP and the power factors for users. This sidesteps the original high-dimensional mixed-integer non-convex problem and turns it into a lower-dimensional learning task that runs fast at inference time.

Referee Report

2 major / 1 minor

Summary. The manuscript derives closed-form expressions for spectral efficiency (SE) and energy efficiency (EE) in cell-free massive MIMO (CFmMIMO) systems as functions of power allocation coefficients and the number of active antennas per access point (AP). It then proposes a deep reinforcement learning (DRL) framework in which an agent learns a mapping from large-scale fading coefficients to three low-dimensional parameters (AP activation ratio, antenna coefficient, and power coefficient). These parameters are inserted into the closed-form expressions to determine antenna activation and user power factors. Simulations for a 40-AP, 20-user setup report a 50% EE improvement and 3350-fold runtime reduction relative to sequential convex approximation (SCA).

Significance. If the DRL policy generalizes reliably, the approach would convert an intractable high-dimensional non-convex mixed-integer optimization into a low-dimensional learning task, delivering both higher EE and orders-of-magnitude faster execution than conventional solvers. This could enable scalable, near-real-time resource allocation in large CFmMIMO deployments where SCA becomes prohibitive.

major comments (2)

[Abstract / Simulation Results] Abstract and simulation section: The reported 50% EE gain and 3350x runtime reduction are obtained by substituting the DRL outputs (activation ratio, antenna coefficient, power coefficient) into the derived closed-form EE expression. No results are shown for test large-scale fading realizations drawn from distributions with different AP/user geometries or shadowing variances; without such out-of-distribution evaluation, the headline gains rest on an unverified generalization assumption and may not hold for unseen channels.
[Proposed DRL Framework] Proposed framework section: The claim that optimizing the three learned coefficients reliably solves the original non-convex problem is central, yet the manuscript provides no analytic bound or optimality gap analysis showing how close the resulting EE lies to the global optimum obtained by exhaustive search or tighter relaxations on small instances.

minor comments (1)

[Notation] Notation for the three learned coefficients (activation ratio, antenna coefficient, power coefficient) is introduced without a summary table relating them to the original variables; adding such a table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / Simulation Results] Abstract and simulation section: The reported 50% EE gain and 3350x runtime reduction are obtained by substituting the DRL outputs (activation ratio, antenna coefficient, power coefficient) into the derived closed-form EE expression. No results are shown for test large-scale fading realizations drawn from distributions with different AP/user geometries or shadowing variances; without such out-of-distribution evaluation, the headline gains rest on an unverified generalization assumption and may not hold for unseen channels.

Authors: We appreciate this observation. The current results use a representative 40-AP, 20-user setup in which large-scale fading coefficients (including AP/user positions and shadowing) are randomly generated according to the standard model, with the DRL agent trained across many realizations drawn from this distribution. To address concerns about generalization, we will add new simulation results in the revised manuscript that evaluate the learned policy on out-of-distribution cases, including different AP densities, user distributions, and shadowing variances. These additions will provide direct evidence regarding the robustness of the reported gains. revision: yes
Referee: [Proposed DRL Framework] Proposed framework section: The claim that optimizing the three learned coefficients reliably solves the original non-convex problem is central, yet the manuscript provides no analytic bound or optimality gap analysis showing how close the resulting EE lies to the global optimum obtained by exhaustive search or tighter relaxations on small instances.

Authors: We agree that an analytic optimality bound would strengthen the claims. However, obtaining a rigorous closed-form bound is intractable for this high-dimensional mixed-integer non-convex problem. Our validation relies on consistent outperformance of the SCA benchmark in both EE and runtime. In the revision we will add empirical comparisons on smaller instances (where exhaustive search or tighter relaxations become feasible) to quantify the gap, together with an expanded discussion of the approximation limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; closed-form derivations and DRL mapping remain independent of target results.

full rationale

The paper derives closed-form SE and EE expressions from the underlying CFmMIMO channel model and statistics as an initial step. It then trains a DRL agent to produce low-dimensional coefficients (activation ratio, antenna coefficient, power coefficient) from large-scale fading inputs, which are substituted into the pre-derived closed-forms for performance evaluation. This structure does not reduce any claimed EE improvement or runtime gain to a fitted parameter by construction, nor does it invoke self-citations or uniqueness theorems to justify the core mapping. The comparison to SCA is performed externally via simulation, keeping the derivation chain self-contained against the system model.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits full audit. The method rests on derived closed-form SE/EE expressions whose accuracy is assumed, plus standard wireless channel models. No invented entities are introduced.

axioms (1)

domain assumption Closed-form expressions for spectral efficiency and energy efficiency accurately capture system performance as functions of power coefficients and active antennas
Invoked to enable the DRL mapping; derivation details not provided in abstract

pith-pipeline@v0.9.0 · 5533 in / 1307 out tokens · 60579 ms · 2026-05-16T12:43:19.171383+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Cell-free massive mimo versus small cells,

H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta, “Cell-free massive mimo versus small cells,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1834–1850, 2017

work page 2017
[2]

On the total energy eﬀiciency of cell-free massive mimo,

H. Q. Ngo, L.-N. Tran, T. Q. Duong, M. Matthaiou, and E. G. Larsson, “On the total energy eﬀiciency of cell-free massive mimo,” IEEE Trans. Green Commun. Network., vol. 2, no. 1, pp. 25–39, 2017

work page 2017
[3]

Ultradense cell-free massive mimo for 6g: Technical overview and open questions,

H. Q. Ngo, G. Interdonato, E. G. Larsson, G. Caire, and J. G. Andrews, “Ultradense cell-free massive mimo for 6g: Technical overview and open questions,” Proc. IEEE, vol. 112, no. 7, pp. 805–831, 2024

work page 2024
[4]

Hybrid beamforming and adaptive rf chain activation for uplink cell-free millimeter-wave massive mimo systems,

N. T. Nguyen, K. Lee, and H. Dai, “Hybrid beamforming and adaptive rf chain activation for uplink cell-free millimeter-wave massive mimo systems,” IEEE Trans. Veh. Technol., vol. 71, no. 8, pp. 8739–8755, 2022

work page 2022
[5]

Energy eﬀiciency maximization in large-scale cell-free massive mimo: A projected gradient approach,

T. C. Mai, H. Q. Ngo, and L.-N. Tran, “Energy eﬀiciency maximization in large-scale cell-free massive mimo: A projected gradient approach,” IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6357–6371, 2022

work page 2022
[6]

Energy- eﬀicient power control in cell-free and user-centric massive mimo at millimeter wave,

M. Alonzo, S. Buzzi, A. Zappone, and C. D’Elia, “Energy- eﬀicient power control in cell-free and user-centric massive mimo at millimeter wave,” IEEE Trans. Green Commun. Network., vol. 3, no. 3, pp. 651–663, 2019

work page 2019
[7]

Energy eﬀiciency of the cell-free massive mimo uplink with optimal uniform quantization,

M. Bashar, K. Cumanan, A. G. Burr, H. Q. Ngo, E. G. Larsson, and P. Xiao, “Energy eﬀiciency of the cell-free massive mimo uplink with optimal uniform quantization,” IEEE Trans. Green Commun. Network., vol. 3, no. 4, pp. 971–987, 2019

work page 2019
[8]

Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,

G. Wang, P. Cheng, Z. Chen, B. Vucetic, and Y. Li, “Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,” IEEE Trans. Signal Process., vol. 72, pp. 2751–2766, 2024

work page 2024
[9]

Downlink power control for cell-free massive mimo with deep reinforcement learning,

L. Luo, J. Zhang, S. Chen, X. Zhang, B. Ai, and D. W. K. Ng, “Downlink power control for cell-free massive mimo with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6772–6777, 2022

work page 2022
[10]

Energy eﬀicient ap selection for cell-free massive mimo systems: Deep reinforcement learning approach,

N. Ghiasi, S. Mashhadi, S. Farahmand, S. M. Razavizadeh, and I. Lee, “Energy eﬀicient ap selection for cell-free massive mimo systems: Deep reinforcement learning approach,” IEEE Trans. Green Commun. Network., vol. 7, no. 1, pp. 29–41, 2022

work page 2022
[11]

User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,

C. F. Mendoza, S. Schwarz, and M. Rupp, “User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,” in ICC 2023-IEEE International Conference on Com- munications. IEEE, 2023, pp. 1036–1041

work page 2023
[12]

Joint cooperative clustering and power control for energy-eﬀicient cell- free xl-mimo with multi-agent reinforcement learning,

Z. Liu, J. Zhang, Z. Liu, D. W. K. Ng, and B. Ai, “Joint cooperative clustering and power control for energy-eﬀicient cell- free xl-mimo with multi-agent reinforcement learning,” IEEE Trans. Commun., vol. 72, no. 12, pp. 7772–7786, 2024

work page 2024
[13]

Drl-based joint ap deployment and network-centric cluster formation for maximizing long-term energy eﬀiciency in cell-free massive mimo,

O. A. Topal, Q. He, O. T. Demir, M. Masoudi, and C. Cavdar, “Drl-based joint ap deployment and network-centric cluster formation for maximizing long-term energy eﬀiciency in cell-free massive mimo,” in 2023 57th Asilomar Conference on Signals, Systems, and Computers. IEEE, 2023, pp. 993–999

work page 2023
[14]

S. M. Kay, Fundamentals of statistical signal processing: esti- mation theory. Prentice-Hall, Inc., 1993

work page 1993
[15]

T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo, Fundamentals of massive MIMO. Cambridge University Press, 2016

work page 2016
[16]

Foundations of user-centric cell-free massive mimo,

Ö. T. Demir, E. Björnson, L. Sanguinetti et al., “Foundations of user-centric cell-free massive mimo,” Foundations and Trends® in Signal Processing, vol. 14, no. 3-4, pp. 162–472, 2021

work page 2021
[17]

R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

work page 1998
[18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Cell-free massive mimo versus small cells,

H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta, “Cell-free massive mimo versus small cells,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1834–1850, 2017

work page 2017

[2] [2]

On the total energy eﬀiciency of cell-free massive mimo,

H. Q. Ngo, L.-N. Tran, T. Q. Duong, M. Matthaiou, and E. G. Larsson, “On the total energy eﬀiciency of cell-free massive mimo,” IEEE Trans. Green Commun. Network., vol. 2, no. 1, pp. 25–39, 2017

work page 2017

[3] [3]

Ultradense cell-free massive mimo for 6g: Technical overview and open questions,

H. Q. Ngo, G. Interdonato, E. G. Larsson, G. Caire, and J. G. Andrews, “Ultradense cell-free massive mimo for 6g: Technical overview and open questions,” Proc. IEEE, vol. 112, no. 7, pp. 805–831, 2024

work page 2024

[4] [4]

Hybrid beamforming and adaptive rf chain activation for uplink cell-free millimeter-wave massive mimo systems,

N. T. Nguyen, K. Lee, and H. Dai, “Hybrid beamforming and adaptive rf chain activation for uplink cell-free millimeter-wave massive mimo systems,” IEEE Trans. Veh. Technol., vol. 71, no. 8, pp. 8739–8755, 2022

work page 2022

[5] [5]

Energy eﬀiciency maximization in large-scale cell-free massive mimo: A projected gradient approach,

T. C. Mai, H. Q. Ngo, and L.-N. Tran, “Energy eﬀiciency maximization in large-scale cell-free massive mimo: A projected gradient approach,” IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6357–6371, 2022

work page 2022

[6] [6]

Energy- eﬀicient power control in cell-free and user-centric massive mimo at millimeter wave,

M. Alonzo, S. Buzzi, A. Zappone, and C. D’Elia, “Energy- eﬀicient power control in cell-free and user-centric massive mimo at millimeter wave,” IEEE Trans. Green Commun. Network., vol. 3, no. 3, pp. 651–663, 2019

work page 2019

[7] [7]

Energy eﬀiciency of the cell-free massive mimo uplink with optimal uniform quantization,

M. Bashar, K. Cumanan, A. G. Burr, H. Q. Ngo, E. G. Larsson, and P. Xiao, “Energy eﬀiciency of the cell-free massive mimo uplink with optimal uniform quantization,” IEEE Trans. Green Commun. Network., vol. 3, no. 4, pp. 971–987, 2019

work page 2019

[8] [8]

Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,

G. Wang, P. Cheng, Z. Chen, B. Vucetic, and Y. Li, “Green cell- free massive mimo: An optimization embedded deep reinforce- ment learning approach,” IEEE Trans. Signal Process., vol. 72, pp. 2751–2766, 2024

work page 2024

[9] [9]

Downlink power control for cell-free massive mimo with deep reinforcement learning,

L. Luo, J. Zhang, S. Chen, X. Zhang, B. Ai, and D. W. K. Ng, “Downlink power control for cell-free massive mimo with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6772–6777, 2022

work page 2022

[10] [10]

Energy eﬀicient ap selection for cell-free massive mimo systems: Deep reinforcement learning approach,

N. Ghiasi, S. Mashhadi, S. Farahmand, S. M. Razavizadeh, and I. Lee, “Energy eﬀicient ap selection for cell-free massive mimo systems: Deep reinforcement learning approach,” IEEE Trans. Green Commun. Network., vol. 7, no. 1, pp. 29–41, 2022

work page 2022

[11] [11]

User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,

C. F. Mendoza, S. Schwarz, and M. Rupp, “User-centric clus- tering in cell-free mimo networks using deep reinforcement learning,” in ICC 2023-IEEE International Conference on Com- munications. IEEE, 2023, pp. 1036–1041

work page 2023

[12] [12]

Joint cooperative clustering and power control for energy-eﬀicient cell- free xl-mimo with multi-agent reinforcement learning,

Z. Liu, J. Zhang, Z. Liu, D. W. K. Ng, and B. Ai, “Joint cooperative clustering and power control for energy-eﬀicient cell- free xl-mimo with multi-agent reinforcement learning,” IEEE Trans. Commun., vol. 72, no. 12, pp. 7772–7786, 2024

work page 2024

[13] [13]

Drl-based joint ap deployment and network-centric cluster formation for maximizing long-term energy eﬀiciency in cell-free massive mimo,

O. A. Topal, Q. He, O. T. Demir, M. Masoudi, and C. Cavdar, “Drl-based joint ap deployment and network-centric cluster formation for maximizing long-term energy eﬀiciency in cell-free massive mimo,” in 2023 57th Asilomar Conference on Signals, Systems, and Computers. IEEE, 2023, pp. 993–999

work page 2023

[14] [14]

S. M. Kay, Fundamentals of statistical signal processing: esti- mation theory. Prentice-Hall, Inc., 1993

work page 1993

[15] [15]

T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo, Fundamentals of massive MIMO. Cambridge University Press, 2016

work page 2016

[16] [16]

Foundations of user-centric cell-free massive mimo,

Ö. T. Demir, E. Björnson, L. Sanguinetti et al., “Foundations of user-centric cell-free massive mimo,” Foundations and Trends® in Signal Processing, vol. 14, no. 3-4, pp. 162–472, 2021

work page 2021

[17] [17]

R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

work page 1998

[18] [18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017