Two-Layer Reinforcement Learning-Assisted Joint Beamforming and Trajectory Optimization for Multi-UAV Downlink Communications
Pith reviewed 2026-05-16 13:44 UTC · model grok-4.3
The pith
A two-layer AI system pairs graph neural networks for instant beamforming with multi-agent reinforcement learning for trajectory planning to raise sum rates in multi-UAV downlink networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that modeling the dynamic interference topology explicitly with GraphNorm-augmented GNNs for beamforming and solving the trajectory sub-problem via centralized-training decentralized-execution multi-agent RL produces higher system sum rates than either numerical solvers or standard deep-learning baselines while meeting real-time latency constraints.
What carries the argument
A hierarchically decoupled framework that uses a topology-aware GNN beamformer on the short timescale and multi-agent proximal policy optimization for decentralized trajectory planning on the long timescale.
If this is right
- Beamforming decisions can be computed at sub-millisecond latency without solving non-convex programs at every slot.
- UAVs can learn cooperative trajectory policies that improve collective sum rate without requiring a central controller at inference time.
- The same separation of timescales can be reused for other coupled resource-allocation tasks that mix fast radio-frequency variables with slow physical movement variables.
Where Pith is reading between the lines
- The framework may extend to uplink or full-duplex UAV scenarios if the GNN graph is redefined to include uplink interference edges.
- Adding a third layer for energy or regulatory constraints could be tested by extending the Markov decision process state with battery or no-fly-zone indicators.
- Real-world validation would require replacing the paper's perfect-CSI assumption with online channel estimation and checking whether the GNN still generalizes.
Load-bearing premise
That performance measured under idealized channel models and perfect channel-state information will remain high when real hardware imperfections and imperfect channel estimates are present.
What would settle it
A field trial on actual UAV hardware that records a drop in achieved sum rate below the simulated baseline once measured channel-state information error exceeds the level assumed in the paper's simulations.
Figures
read the original abstract
Unmanned aerial vehicles (UAVs) are pivotal for future 6G non-terrestrial networks, yet their high mobility creates a complex coupled optimization problem for beamforming and trajectory design. Existing numerical methods suffer from prohibitive latency, while standard deep learning often ignores dynamic interference topology, limiting scalability. To address these issues, this paper proposes a hierarchically decoupled framework synergizing graph neural networks (GNNs) with multi-agent reinforcement learning. Specifically, on the short timescale, we develop a topology-aware GNN beamformer by incorporating GraphNorm. By modeling the dynamic UAV-user association as a time-varying heterogeneous graph, this method explicitly extracts interference patterns to achieve sub-millisecond inference. On the long timescale, trajectory planning is modeled as a decentralized partially observable Markov decision process and solved via the multi-agent proximal policy optimization algorithm under the centralized training with decentralized execution paradigm, facilitating cooperative behaviors. Extensive simulation results demonstrate that the proposed framework significantly outperforms state-of-the-art optimization heuristics and deep learning baselines in terms of system sum rate, convergence speed, and generalization capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hierarchically decoupled two-layer framework for joint beamforming and trajectory optimization in multi-UAV downlink communications. Short-timescale beamforming is handled by a topology-aware GNN incorporating GraphNorm to model dynamic UAV-user associations as time-varying heterogeneous graphs for sub-millisecond inference. Long-timescale trajectory planning is formulated as a decentralized POMDP and solved using multi-agent proximal policy optimization (MAPPO) under centralized training with decentralized execution. Extensive simulations are reported to demonstrate significant gains over optimization heuristics and deep learning baselines in sum rate, convergence speed, and generalization.
Significance. The hierarchical GNN-MAPPO decoupling addresses a relevant scalability challenge in 6G non-terrestrial networks by separating interference-aware beamforming from cooperative trajectory planning. The explicit use of graph structure for dynamic interference topology and the CTDE paradigm for multi-agent cooperation are technically sound ideas that could enable real-time operation if the performance advantages hold. However, the idealized simulation regime limits the assessed significance for practical deployment.
major comments (2)
- [Simulation results section] Simulation results section: the central claims of superior sum rate, convergence, and generalization rest on simulations under perfect instantaneous CSI and standard path-loss models with no reported error bars, no description of baseline implementation details, and no data exclusion rules. This prevents independent verification of the stated gains and is load-bearing for the outperformance claim.
- [Proposed framework section] Proposed framework section: the short-timescale GNN beamformer and long-timescale MAPPO planner are evaluated only under perfect CSI; no robustness curves are provided under CSI estimation error (e.g., 5-15% normalized MSE) or hardware impairments. This mismatch between assumed and actual interference topology directly affects whether the hierarchical decoupling retains its reported advantage.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the clarity and verifiability of our work. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Simulation results section] Simulation results section: the central claims of superior sum rate, convergence, and generalization rest on simulations under perfect instantaneous CSI and standard path-loss models with no reported error bars, no description of baseline implementation details, and no data exclusion rules. This prevents independent verification of the stated gains and is load-bearing for the outperformance claim.
Authors: We agree that additional details are required for independent verification. In the revised manuscript, we have added error bars (mean ± one standard deviation over 100 independent Monte Carlo runs) to all sum-rate, convergence, and generalization plots in Section IV. We have also expanded the simulation setup subsection with complete baseline implementation details, including the exact optimization algorithms, iteration limits, and neural-network hyperparameters used for the deep-learning baselines. We explicitly state that no data exclusion rules were applied; all simulation runs are retained and averaged. These changes are now included in the main text and a new appendix. revision: yes
-
Referee: [Proposed framework section] Proposed framework section: the short-timescale GNN beamformer and long-timescale MAPPO planner are evaluated only under perfect CSI; no robustness curves are provided under CSI estimation error (e.g., 5-15% normalized MSE) or hardware impairments. This mismatch between assumed and actual interference topology directly affects whether the hierarchical decoupling retains its reported advantage.
Authors: The original evaluations assume perfect CSI to isolate the benefits of the hierarchical GNN-MAPPO decoupling. To address robustness, we have added a new figure and accompanying analysis in the revised simulation section showing sum-rate performance under CSI estimation errors with normalized MSE ranging from 0% to 15%. The results confirm that the proposed framework retains its advantage over baselines, although the margin narrows at higher error levels. For hardware impairments, we have inserted a limitations paragraph explaining that incorporating specific models (e.g., phase noise or quantization) would require extending the channel model beyond the paper’s scope; we discuss how the framework could be adapted in future work. This constitutes a partial but substantive revision. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents a hierarchical framework with GNN-based short-timescale beamforming and MAPPO-based long-timescale trajectory planning. Claims of outperformance rest on simulation comparisons to external baselines under idealized CSI and channel models. No equations reduce outputs to inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing steps rely on self-citations that collapse the central result. The derivation remains self-contained against the stated simulation benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Dynamic UAV-user associations can be represented as time-varying heterogeneous graphs whose interference patterns are extractable by GNNs.
- domain assumption Trajectory planning constitutes a decentralized partially observable Markov decision process solvable by MAPPO under CTDE.
Reference graph
Works this paper leans on
-
[1]
G. Geraci et al., ”What Will the Future of UA V Cellular Communications Be? A Flight From 5G to 6G,” in IEEE Communications Surveys & Tutorials, vol. 24, no. 3, pp. 1304-1335, thirdquarter 2022, doi: 10.1109/COMST.2022.3171135
-
[2]
Z. Xiao et al., ”Antenna Array Enabled Space/Air/Ground Communi- cations and Networking for 6G,” in IEEE Journal on Selected Areas in Communications, vol. 40, no. 10, pp. 2773-2804, Oct. 2022, doi: 10.1109/JSAC.2022.3196320
-
[3]
X. Tang et al., ”Deep Graph Reinforcement Learning for UA V-Enabled Multi-User Secure Communications,” in IEEE Transactions on Mo- bile Computing, vol. 24, no. 9, pp. 8780-8793, Sept. 2025, doi: 10.1109/TMC.2025.3558790
-
[4]
S. Wang, X. Song, T. Song and Y . Yang, ”Joint Optimization of Beamforming and Trajectory for UA V-RIS-Assisted MU-MISO Systems Using GNN and SD3,” in IEEE Transactions on Mobile Computing, vol. 24, no. 10, pp. 9539-9553, Oct. 2025, doi: 10.1109/TMC.2025.3563072
-
[5]
J. Chen, K. Zhai, Z. Wang, Y . Liu, J. Jia and X. Wang, ”CoMP and RIS-Assisted Multicast Transmission in a Multi-UA V Communication System,” in IEEE Transactions on Communications, vol. 72, no. 6, pp. 3602-3617, June 2024, doi: 10.1109/TCOMM.2024.3357428
-
[6]
S. Liu et al., ”UA V-Enabled Collaborative Beamforming via Multi- Agent Deep Reinforcement Learning,” in IEEE Transactions on Mo- bile Computing, vol. 23, no. 12, pp. 13015-13032, Dec. 2024, doi: 10.1109/TMC.2024.3419915
-
[7]
Yu, Chao, et al. ”The surprising effectiveness of ppo in cooperative multi-agent games.” Advances in neural information processing systems 35 (2022): 24611-24624
work page 2022
-
[8]
Q. Wu, Y . Zeng and R. Zhang, ”Joint Trajectory and Communication Design for Multi-UA V Enabled Wireless Networks,” in IEEE Transactions on Wireless Communications, vol. 17, no. 3, pp. 2109-2121, March 2018, doi: 10.1109/TWC.2017.2789293
-
[9]
Z. Xiao, H. Dong, L. Bai, D. O. Wu and X. -G. Xia, ”Unmanned Aerial Vehicle Base Station (UA V-BS) Deployment With Millimeter- Wave Beamforming,” in IEEE Internet of Things Journal, vol. 7, no. 2, pp. 1336-1349, Feb. 2020, doi: 10.1109/JIOT.2019.2954620
-
[10]
L. Zhu, J. Zhang, Z. Xiao, X. -G. Xia and R. Zhang, ”Multi-UA V Aided Millimeter-Wave Networks: Positioning, Clustering, and Beamforming,” in IEEE Transactions on Wireless Communications, vol. 21, no. 7, pp. 4637-4653, July 2022, doi: 10.1109/TWC.2021.3131580
-
[11]
X. Yuan, H. Jiang, Y . Hu and A. Schmeink, ”Joint Analog Beamforming and Trajectory Planning for Energy-Efficient UA V-Enabled Nonlinear Wireless Power Transfer,” in IEEE Journal on Selected Areas in Communications, vol. 40, no. 10, pp. 2914-2929, Oct. 2022, doi: 10.1109/JSAC.2022.3196108
-
[12]
S. Li, B. Duo, X. Yuan, Y . -C. Liang and M. Di Renzo, ”Reconfigurable In- telligent Surface Assisted UA V Communication: Joint Trajectory Design and Passive Beamforming,” in IEEE Wireless Communications Letters, vol. 9, no. 5, pp. 716-720, May 2020, doi: 10.1109/LWC.2020.2966705
-
[13]
L. Ge, P. Dong, H. Zhang, J. -B. Wang and X. You, ”Joint Beamforming and Trajectory Optimization for Intelligent Reflecting Surfaces-Assisted UA V Communications,” in IEEE Access, vol. 8, pp. 78702-78712, 2020, doi: 10.1109/ACCESS.2020.2990166
-
[14]
X. Pang, N. Zhao, J. Tang, C. Wu, D. Niyato and K. -K. Wong, ”IRS- Assisted Secure UA V Transmission via Joint Trajectory and Beamforming Design,” in IEEE Transactions on Communications, vol. 70, no. 2, pp. 1140-1152, Feb. 2022, doi: 10.1109/TCOMM.2021.3136563
-
[15]
Joint maneuver and beamforming design for UA V-enabled integrated sensing and communication,
Z. Lyu et al., “Joint maneuver and beamforming design for UA V-enabled integrated sensing and communication,” IEEE Transactions on Wireless Communications, vol. 22, no. 4, pp. 2424–2439, Apr. 2023
work page 2023
-
[16]
G. Cheng, X. Song, Z. Lyu and J. Xu, ”Networked ISAC for Low-Altitude Economy: Coordinated Transmit Beamforming and UA V Trajectory Design,” in IEEE Transactions on Communications, vol. 73, no. 8, pp. 5832-5847, Aug. 2025, doi: 10.1109/TCOMM.2025.3541027
-
[17]
B. Li, H. Zhang, Y . Rong and Z. Han, ”A Control-based Design of Beam- forming and Trajectory for UA V-Enabled ISAC System,” in IEEE Trans- actions on Wireless Communications, doi: 10.1109/TWC.2025.3604344
-
[18]
D. Deng, W. Zhou, X. Li, D. B. da Costa, D. W. K. Ng and A. Nallanathan, ”Joint Beamforming and UA V Trajectory Optimization for Covert Communications in ISAC Networks,” in IEEE Transactions on Wireless Communications, vol. 24, no. 2, pp. 1016-1030, Feb. 2025, doi: 10.1109/TWC.2024.3503726
-
[19]
J. Yu et al., ”Joint 3D Beamforming-and-Trajectory Design for UA V- Satellite Uplink Covert Communication,” in IEEE Transactions on Communications, vol. 73, no. 5, pp. 3469-3481, May 2025, doi: 10.1109/TCOMM.2024.3480979
-
[20]
Y . Yao et al., ”UA V-Relay-Aided Secure Maritime Networks Coex- isting with Satellite Networks: Robust Beamforming and Trajectory Optimization,” in IEEE Transactions on Wireless Communications, doi: 10.1109/TWC.2025.3596136
-
[21]
X. Liu, Y . Liu and Y . Chen, ”Machine Learning Empowered Trajectory and Passive Beamforming Design in UA V-RIS Wireless Networks,” in IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2042-2055, July 2021, doi: 10.1109/JSAC.2020.3041401
-
[22]
L. Wang, K. Wang, C. Pan and N. Aslam, ”Joint Trajectory and Passive Beamforming Design for Intelligent Reflecting Surface-Aided UA V Communications: A Deep Reinforcement Learning Approach,” in IEEE Transactions on Mobile Computing, vol. 22, no. 11, pp. 6543-6553, 1 Nov. 2023, doi: 10.1109/TMC.2022.3200998
-
[23]
C. Liu, W. Yuan, Z. Wei, X. Liu and D. W. K. Ng, ”Location-Aware Predictive Beamforming for UA V Communications: A Deep Learning Approach,” in IEEE Wireless Communications Letters, vol. 10, no. 3, pp. 668-672, March 2021, doi: 10.1109/LWC.2020.3045150
-
[24]
H. -L. Chiang, K. -C. Chen, W. Rave, M. Khalili Marandi and G. Fettweis, ”Machine-Learning Beam Tracking and Weight Optimization for mmWave Multi-UA V Links,” in IEEE Transactions on Wireless Communications, vol. 20, no. 8, pp. 5481-5494, Aug. 2021, doi: 10.1109/TWC.2021.3068206
-
[25]
K. Guo, M. Wu, X. Li, H. Song and N. Kumar, ”Deep Reinforcement Learning and NOMA-Based Multi-Objective RIS-Assisted IS-UA V-TNs: Trajectory Optimization and Beamforming Design,” in IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 9, pp. 10197-10210, Sept. 2023, doi: 10.1109/TITS.2023.3267607
-
[26]
K. Guo, M. Wu, X. Li, Z. Lin and T. A. Tsiftsis, ”Joint Trajectory and Beamforming Optimization for Federated DRL-Aided Space-Aerial- Terrestrial Relay Networks With RIS and RSMA,” in IEEE Transactions on Wireless Communications, vol. 23, no. 12, pp. 18456-18471, Dec. 2024, doi: 10.1109/TWC.2024.3468298
-
[27]
B. Yin, X. Fang, X. Wang, L. Yan, J. Wu and J. Wang, ”Trajectory Design and Beamforming in UA V-Assisted Wireless Networks: A Fine- Tuned M2LLM-Driven DRL-Based Framework,” in IEEE Transactions on Wireless Communications, doi: 10.1109/TWC.2025.3605277
-
[28]
Graph neural network-based scheduling for multi-UA V- enabled communications in D2D networks,
P. Li et al., “Graph neural network-based scheduling for multi-UA V- enabled communications in D2D networks,” Digital Communications and Networks, vol. 10, no. 1, pp. 45–52, 2024
work page 2024
-
[29]
H. Zhao, K. Liu, M. Liu, S. Garg and M. Alrashoud, ”Intelligent Beam- forming for UA V-Assisted IIoT Based on Hypergraph Inspired Explain- able Deep Learning,” in IEEE Transactions on Consumer Electronics, vol. 70, no. 1, pp. 1972-1982, Feb. 2024, doi: 10.1109/TCE.2023.3325128
-
[30]
Q. Wang, Y . Lu, W. Chen, B. Ai, Z. Zhong and D. Niyato, ”GNN- Enabled Optimization of Placement and Transmission Design for UA V Communications,” in IEEE Transactions on Vehicular Technology, vol. 74, no. 4, pp. 6656-6661, April 2025, doi: 10.1109/TVT.2024.3514860
-
[31]
Y . Pan, X. Wang, Z. Xu, N. Cheng, W. Xu and J. -J. Zhang, ”GNN- Empowered Effective Partial Observation MARL Method for AoI Management in Multi-UA V Network,” in IEEE Internet of Things Journal, vol. 11, no. 21, pp. 34541-34553, 1 Nov.1, 2024, doi: 10.1109/JIOT.2024.3447774
-
[32]
Z. Chen, Z. Zhang, Z. Xiao, Z. Yang and R. Jin, ”Deep Learning-Based Multi-User Positioning in Wireless FDMA Cellular Networks,” in IEEE Journal on Selected Areas in Communications, vol. 41, no. 12, pp. 3848- 3862, Dec. 2023, doi: 10.1109/JSAC.2023.3322799
-
[33]
M. Mozaffari, W. Saad, M. Bennis and M. Debbah, ”Efficient Deployment of Multiple Unmanned Aerial Vehicles for Optimal Wireless Coverage,” in IEEE Communications Letters, vol. 20, no. 8, pp. 1647-1650, Aug. 2016, doi: 10.1109/LCOMM.2016.2578312
-
[34]
GraphNorm: A principled approach to accelerating graph neural network training,
T. Cai, S. Luo, K. Xu, D. He, T.-Y . Liu, and L. Wang, “GraphNorm: A principled approach to accelerating graph neural network training,” in Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2021, pp. 1204–1215
work page 2021
-
[35]
A review on Genetic Algorithm: Past, present, and future,
S. Katoch, S. S. Chauhan, and V . Kumar, “A review on Genetic Algorithm: Past, present, and future,” Multimedia Tools and Applications, vol. 80, no. 5, pp. 8091–8126, Oct. 2020. doi:10.1007/s11042-020-10139-6
-
[36]
J. Blank and K. Deb, ”Pymoo: Multi-Objective Optimization in Python,” in IEEE Access, vol. 8, pp. 89497-89509, 2020, doi: 10.1109/AC- CESS.2020.2990567
work page doi:10.1109/ac- 2020
-
[37]
M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola, “Deep sets,” arXiv preprint arXiv:1703.06114, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
PointNet++: Deep hierarchical feature learning on point sets in a metric space,
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 5099–5108
work page 2017
-
[39]
Value-decomposition networks for cooperative multi- agent learning,
P. Sunehag et al., “Value-decomposition networks for cooperative multi- agent learning,” in Proc. Int. Conf. Auto. Agents Multiagent Syst. (AAMAS), Stockholm, Sweden, Jul. 2018, pp. 2085–2087. 15
work page 2018
-
[40]
Monotonic value function factorisation for deep multi-agent reinforcement learning,
T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,” J. Mach. Learn. Res., vol. 21, no. 178, pp. 1–51, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.