Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations

Behzad Khamidehi; Elvino S. Sousa

arxiv: 1906.09550 · v2 · pith:ZO42D72Jnew · submitted 2019-06-23 · 📡 eess.SP · cs.AI· cs.LG· cs.NI

Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations

Behzad Khamidehi , Elvino S. Sousa This is my paper

Pith reviewed 2026-05-25 18:14 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.LGcs.NI

keywords Q-learningtrajectory optimizationaerial base stationssum-rate maximizationpower allocationsub-channel assignmentreinforcement learningdistributed algorithm

0 comments

The pith

Q-learning lets aerial base stations learn optimal trajectories from reward signals that reflect network topology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to set flight paths for multiple aerial base stations so that the total data rate to served users is as large as possible. It separates the overall task into two parts: choosing the paths and assigning power plus sub-channels to users. A distributed Q-learning procedure lets each station improve its path choices using only the reward values it receives, with little need to share data back to a central controller. The authors report that this model-free method succeeds because the rewards embed enough details about user locations and channel conditions.

Core claim

Dividing the multi-ABS sum-rate maximization task into a trajectory sub-problem and a joint power and sub-channel assignment sub-problem allows a distributed Q-learning algorithm to train each aerial base station on reward signals that carry topology information, producing trajectories that increase delivered rates without large amounts of information exchange.

What carries the argument

Q-learning, a model-free reinforcement learning technique that updates action-value estimates from received rewards to discover policies for trajectory, power, and sub-channel choices.

If this is right

The algorithm runs in a distributed fashion and requires only modest information exchange with the core network.
Joint optimization of trajectories together with power and sub-channel assignments is achieved through the same learning process.
Performance gains appear in simulations even though no explicit model of the propagation environment is supplied to the learner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-driven approach could be tested on problems where the number of aerial base stations or users changes over time.
If the reward design can be made robust to partial observability, the method might apply to scenarios with limited sensing at each station.
Comparing the learned trajectories against solutions from centralized optimization solvers would quantify the price of the distributed, model-free restriction.

Load-bearing premise

The reward signals given to each aerial base station contain enough information about the locations and channel conditions of the users to guide effective trajectory choices.

What would settle it

Run the same simulation setup but replace the topology-dependent rewards with random or constant values and check whether the learned trajectories still produce higher sum-rates than a non-learning baseline.

Figures

Figures reproduced from arXiv: 1906.09550 by Behzad Khamidehi, Elvino S. Sousa.

**Figure 2.** Figure 2: Final trajectory of the ABSs. trained to find their trajectories based on the topology of the network. This figure also shows that in addition to moving from the initial position toward the final position, the ABSs repeatedly decrease their distances to their associated users. This is essential for the ABSs since by reducing the distance to a user, the link quality between the ABS and the aformentioned use… view at source ↗

read the original abstract

In this paper, the trajectory optimization problem for a multi-aerial base station (ABS) communication network is investigated. The objective is to find the trajectory of the ABSs so that the sum-rate of the users served by each ABS is maximized. To reach this goal, along with the optimal trajectory design, optimal power and sub-channel allocation is also of great importance to support the users with the highest possible data rates. To solve this complicated problem, we divide it into two sub-problems: ABS trajectory optimization sub-problem, and joint power and sub-channel assignment sub-problem. Then, based on the Q-learning method, we develop a distributed algorithm which solves these sub-problems efficiently, and does not need significant amount of information exchange between the ABSs and the core network. Simulation results show that although Q-learning is a model-free reinforcement learning technique, it has a remarkable capability to train the ABSs to optimize their trajectories based on the received reward signals, which carry decent information from the topology of the network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies standard Q-learning to multi-ABS trajectory and resource allocation in a distributed setup, but the central claim about local rewards carrying enough topology information rests on thin evidence.

read the letter

The paper splits the joint trajectory-plus-resource problem for multiple aerial base stations into two sub-problems and applies Q-learning to each in a distributed fashion that requires little information exchange. That split and the emphasis on minimal communication are the main concrete moves. The distributed framing fits a practical scenario where core-network coordination is limited, and the authors correctly note that model-free RL can in principle learn from reward signals without an explicit channel model. Those are the usable parts. The rest is thin. The abstract states that simulations show the method works and that the rewards carry decent topology information, yet no numbers, baselines, convergence plots, or setup parameters appear. Without those, it is impossible to tell whether the learned trajectories beat simple heuristics or even centralized greedy allocation. The stress-test concern lands: if each ABS only observes its own served users' rates, the reward signal is local and cannot directly encode cross-ABS interference or global topology. The paper would need to demonstrate either that local observations are still sufficient or that some implicit coordination emerges; the current text does not. This work is aimed at researchers already working on RL for drone or temporary wireless networks. A reader in that niche might pick up the sub-problem decomposition as a starting point, but the lack of quantitative evidence and the open question on reward sufficiency make it hard to treat as a solid reference. It deserves peer review so the authors can supply the missing simulation details and test whether the distributed rewards actually support the claimed global optimization.

Referee Report

2 major / 0 minor

Summary. The paper investigates trajectory optimization in multi-ABS networks to maximize user sum-rate. It decomposes the joint problem into an ABS trajectory sub-problem and a joint power/sub-channel assignment sub-problem, then proposes a distributed Q-learning algorithm to solve both with no significant information exchange between ABSs. The central claim is that model-free Q-learning can train the ABSs to optimize trajectories because the received reward signals carry decent information from the network topology, as demonstrated by simulations.

Significance. If the simulation evidence were provided and the reward-topology link were shown to hold under the distributed constraint, the work would illustrate a practical route for applying model-free RL to decentralized trajectory design in aerial networks, reducing reliance on centralized coordination.

major comments (2)

[Abstract] Abstract: The claim that reward signals 'carry decent information from the topology of the network' is load-bearing for the central assertion that Q-learning can produce globally useful trajectories. The same paragraph states that the algorithm requires 'no significant amount of information exchange' between ABSs; this distributed constraint implies each ABS's reward is computed from local observations only, which cannot encode cross-ABS interference or the full topology and therefore undermines the learning guarantee.
[Abstract] Abstract: The paper invokes 'simulation results' to support the 'remarkable capability' of Q-learning, yet supplies no description of the simulation setup, number of ABSs/users, channel models, baselines, quantitative metrics (e.g., sum-rate improvement, convergence), or statistical significance. Without these details the empirical claim cannot be evaluated and is not load-bearing evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that reward signals 'carry decent information from the topology of the network' is load-bearing for the central assertion that Q-learning can produce globally useful trajectories. The same paragraph states that the algorithm requires 'no significant amount of information exchange' between ABSs; this distributed constraint implies each ABS's reward is computed from local observations only, which cannot encode cross-ABS interference or the full topology and therefore undermines the learning guarantee.

Authors: We appreciate the referee's observation on the distributed setting. Each ABS computes its reward from the sum-rate of its locally served users; because these rates are affected by inter-cell interference (which depends on the locations, powers, and sub-channel choices of neighboring ABSs), the scalar reward implicitly encodes topology-dependent effects without requiring explicit message passing. The Q-table therefore learns policies that mitigate such interference. We will revise the abstract to state this mechanism more explicitly and add a short clarifying paragraph in the algorithm description (Section III). revision: yes
Referee: [Abstract] Abstract: The paper invokes 'simulation results' to support the 'remarkable capability' of Q-learning, yet supplies no description of the simulation setup, number of ABSs/users, channel models, baselines, quantitative metrics (e.g., sum-rate improvement, convergence), or statistical significance. Without these details the empirical claim cannot be evaluated and is not load-bearing evidence.

Authors: We agree that the abstract is too terse on the empirical evidence. The full manuscript (Section IV) reports results for 2–4 ABSs serving 10–20 users under a 3GPP urban macro channel model with distance-dependent path loss, log-normal shadowing, and Rayleigh fading; baselines include static hovering, random-walk trajectories, and centralized exhaustive search; metrics show 15–35 % sum-rate gains and convergence within roughly 800–1200 episodes. We will expand the abstract with a concise summary of these parameters and the main quantitative outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: simulation-driven Q-learning with external rewards

full rationale

The paper decomposes the problem into trajectory and power/sub-channel sub-problems, then applies standard Q-learning in a distributed fashion. The central claim that reward signals carry topology information is presented as an empirical outcome of the simulations rather than a derivation that reduces to fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are shown that collapse by construction to the inputs. The approach relies on external simulation rewards and standard RL, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach assumes reward signals encode useful topology information and that sub-problem decomposition is valid.

pith-pipeline@v0.9.0 · 5713 in / 922 out tokens · 20397 ms · 2026-05-25T18:14:12.848232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,

I. Bor-Yaliniz and H. Yanikomeroglu, “The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,” IEEE Commun. Mag. , vol. 54, no. 11, pp. 48–55, Nov. 2016

work page 2016
[2]

Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,

Y . Zeng, J. Lyu, and R. Zhang, “Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,” IEEE Wireless Commun., vol. 26, no. 1, pp. 120–127, Feb. 2019

work page 2019
[3]

Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,

Y . Zeng, R. Zhang, and T. J. Lim, “Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,” IEEE Commun. Mag. , vol. 54, no. 5, pp. 36–42, May 2016

work page 2016
[4]

Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,

J. Lyu, Y . Zeng, R. Zhang, and T. J. Lim, “Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,” IEEE Commun. Lett, vol. 21, no. 3, pp. 604–607, Mar. 2017

work page 2017
[5]

3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efﬁcient Maximal Coverage,

M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efﬁcient Maximal Coverage,” IEEE Wireless Commun. Lett. , vol. 6, no. 4, pp. 434–437, Aug. 2017

work page 2017
[6]

Efﬁcient 3D aerial base station placement considering users mobility by reinforcement learning,

R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu, “Efﬁcient 3D aerial base station placement considering users mobility by reinforcement learning,” in Proc. IEEE WCNC , Apr. 2018, pp. 1–6

work page 2018
[7]

Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,

Q. Wu, Y . Zeng, and R. Zhang, “Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,”IEEE Trans. Wireless Commun. , vol. 17, no. 3, pp. 2109–2121, Mar. 2018

work page 2018
[8]

UA V Trajectory Optimization for Data Ofﬂoading at the Edge of Multiple Cells,

F. Cheng, S. Zhang, Z. Li, Y . Chen, N. Zhao, F. R. Yu, and V . C. M. Leung, “UA V Trajectory Optimization for Data Ofﬂoading at the Edge of Multiple Cells,” IEEE Trans. V eh. Technol., vol. 67, no. 7, pp. 6732–6736, Jul. 2018

work page 2018
[9]

Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,

Q. Wu and R. Zhang, “Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,”

work page
[10]

Common Throughput Maximization in UAV-Enabled OFDMA Systems with Delay Consideration

[Online]. Available: http://arxiv.org/abs/1801.00444

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Optimal LAP Altitude for Maximum Coverage,

A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP Altitude for Maximum Coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec 2014

work page 2014
[12]

R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998

work page 1998
[13]

Q-Learning,

C. J. C. H. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992

work page 1992
[14]

Boyd and L

S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004

work page 2004

[1] [1]

The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,

I. Bor-Yaliniz and H. Yanikomeroglu, “The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,” IEEE Commun. Mag. , vol. 54, no. 11, pp. 48–55, Nov. 2016

work page 2016

[2] [2]

Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,

Y . Zeng, J. Lyu, and R. Zhang, “Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,” IEEE Wireless Commun., vol. 26, no. 1, pp. 120–127, Feb. 2019

work page 2019

[3] [3]

Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,

Y . Zeng, R. Zhang, and T. J. Lim, “Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,” IEEE Commun. Mag. , vol. 54, no. 5, pp. 36–42, May 2016

work page 2016

[4] [4]

Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,

J. Lyu, Y . Zeng, R. Zhang, and T. J. Lim, “Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,” IEEE Commun. Lett, vol. 21, no. 3, pp. 604–607, Mar. 2017

work page 2017

[5] [5]

3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efﬁcient Maximal Coverage,

M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efﬁcient Maximal Coverage,” IEEE Wireless Commun. Lett. , vol. 6, no. 4, pp. 434–437, Aug. 2017

work page 2017

[6] [6]

Efﬁcient 3D aerial base station placement considering users mobility by reinforcement learning,

R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu, “Efﬁcient 3D aerial base station placement considering users mobility by reinforcement learning,” in Proc. IEEE WCNC , Apr. 2018, pp. 1–6

work page 2018

[7] [7]

Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,

Q. Wu, Y . Zeng, and R. Zhang, “Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,”IEEE Trans. Wireless Commun. , vol. 17, no. 3, pp. 2109–2121, Mar. 2018

work page 2018

[8] [8]

UA V Trajectory Optimization for Data Ofﬂoading at the Edge of Multiple Cells,

F. Cheng, S. Zhang, Z. Li, Y . Chen, N. Zhao, F. R. Yu, and V . C. M. Leung, “UA V Trajectory Optimization for Data Ofﬂoading at the Edge of Multiple Cells,” IEEE Trans. V eh. Technol., vol. 67, no. 7, pp. 6732–6736, Jul. 2018

work page 2018

[9] [9]

Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,

Q. Wu and R. Zhang, “Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,”

work page

[10] [10]

Common Throughput Maximization in UAV-Enabled OFDMA Systems with Delay Consideration

[Online]. Available: http://arxiv.org/abs/1801.00444

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Optimal LAP Altitude for Maximum Coverage,

A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP Altitude for Maximum Coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec 2014

work page 2014

[12] [12]

R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998

work page 1998

[13] [13]

Q-Learning,

C. J. C. H. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992

work page 1992

[14] [14]

Boyd and L

S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004

work page 2004