pith. sign in

arxiv: 1906.09550 · v2 · pith:ZO42D72Jnew · submitted 2019-06-23 · 📡 eess.SP · cs.AI· cs.LG· cs.NI

Reinforcement Learning-Based Trajectory Design for the Aerial Base Stations

Pith reviewed 2026-05-25 18:14 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.LGcs.NI
keywords Q-learningtrajectory optimizationaerial base stationssum-rate maximizationpower allocationsub-channel assignmentreinforcement learningdistributed algorithm
0
0 comments X

The pith

Q-learning lets aerial base stations learn optimal trajectories from reward signals that reflect network topology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to set flight paths for multiple aerial base stations so that the total data rate to served users is as large as possible. It separates the overall task into two parts: choosing the paths and assigning power plus sub-channels to users. A distributed Q-learning procedure lets each station improve its path choices using only the reward values it receives, with little need to share data back to a central controller. The authors report that this model-free method succeeds because the rewards embed enough details about user locations and channel conditions.

Core claim

Dividing the multi-ABS sum-rate maximization task into a trajectory sub-problem and a joint power and sub-channel assignment sub-problem allows a distributed Q-learning algorithm to train each aerial base station on reward signals that carry topology information, producing trajectories that increase delivered rates without large amounts of information exchange.

What carries the argument

Q-learning, a model-free reinforcement learning technique that updates action-value estimates from received rewards to discover policies for trajectory, power, and sub-channel choices.

If this is right

  • The algorithm runs in a distributed fashion and requires only modest information exchange with the core network.
  • Joint optimization of trajectories together with power and sub-channel assignments is achieved through the same learning process.
  • Performance gains appear in simulations even though no explicit model of the propagation environment is supplied to the learner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-driven approach could be tested on problems where the number of aerial base stations or users changes over time.
  • If the reward design can be made robust to partial observability, the method might apply to scenarios with limited sensing at each station.
  • Comparing the learned trajectories against solutions from centralized optimization solvers would quantify the price of the distributed, model-free restriction.

Load-bearing premise

The reward signals given to each aerial base station contain enough information about the locations and channel conditions of the users to guide effective trajectory choices.

What would settle it

Run the same simulation setup but replace the topology-dependent rewards with random or constant values and check whether the learned trajectories still produce higher sum-rates than a non-learning baseline.

Figures

Figures reproduced from arXiv: 1906.09550 by Behzad Khamidehi, Elvino S. Sousa.

Figure 1
Figure 1. Figure 1: Interaction between agent and environment. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Final trajectory of the ABSs. trained to find their trajectories based on the topology of the network. This figure also shows that in addition to moving from the initial position toward the final position, the ABSs repeatedly decrease their distances to their associated users. This is essential for the ABSs since by reducing the distance to a user, the link quality between the ABS and the aformentioned use… view at source ↗
read the original abstract

In this paper, the trajectory optimization problem for a multi-aerial base station (ABS) communication network is investigated. The objective is to find the trajectory of the ABSs so that the sum-rate of the users served by each ABS is maximized. To reach this goal, along with the optimal trajectory design, optimal power and sub-channel allocation is also of great importance to support the users with the highest possible data rates. To solve this complicated problem, we divide it into two sub-problems: ABS trajectory optimization sub-problem, and joint power and sub-channel assignment sub-problem. Then, based on the Q-learning method, we develop a distributed algorithm which solves these sub-problems efficiently, and does not need significant amount of information exchange between the ABSs and the core network. Simulation results show that although Q-learning is a model-free reinforcement learning technique, it has a remarkable capability to train the ABSs to optimize their trajectories based on the received reward signals, which carry decent information from the topology of the network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper investigates trajectory optimization in multi-ABS networks to maximize user sum-rate. It decomposes the joint problem into an ABS trajectory sub-problem and a joint power/sub-channel assignment sub-problem, then proposes a distributed Q-learning algorithm to solve both with no significant information exchange between ABSs. The central claim is that model-free Q-learning can train the ABSs to optimize trajectories because the received reward signals carry decent information from the network topology, as demonstrated by simulations.

Significance. If the simulation evidence were provided and the reward-topology link were shown to hold under the distributed constraint, the work would illustrate a practical route for applying model-free RL to decentralized trajectory design in aerial networks, reducing reliance on centralized coordination.

major comments (2)
  1. [Abstract] Abstract: The claim that reward signals 'carry decent information from the topology of the network' is load-bearing for the central assertion that Q-learning can produce globally useful trajectories. The same paragraph states that the algorithm requires 'no significant amount of information exchange' between ABSs; this distributed constraint implies each ABS's reward is computed from local observations only, which cannot encode cross-ABS interference or the full topology and therefore undermines the learning guarantee.
  2. [Abstract] Abstract: The paper invokes 'simulation results' to support the 'remarkable capability' of Q-learning, yet supplies no description of the simulation setup, number of ABSs/users, channel models, baselines, quantitative metrics (e.g., sum-rate improvement, convergence), or statistical significance. Without these details the empirical claim cannot be evaluated and is not load-bearing evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that reward signals 'carry decent information from the topology of the network' is load-bearing for the central assertion that Q-learning can produce globally useful trajectories. The same paragraph states that the algorithm requires 'no significant amount of information exchange' between ABSs; this distributed constraint implies each ABS's reward is computed from local observations only, which cannot encode cross-ABS interference or the full topology and therefore undermines the learning guarantee.

    Authors: We appreciate the referee's observation on the distributed setting. Each ABS computes its reward from the sum-rate of its locally served users; because these rates are affected by inter-cell interference (which depends on the locations, powers, and sub-channel choices of neighboring ABSs), the scalar reward implicitly encodes topology-dependent effects without requiring explicit message passing. The Q-table therefore learns policies that mitigate such interference. We will revise the abstract to state this mechanism more explicitly and add a short clarifying paragraph in the algorithm description (Section III). revision: yes

  2. Referee: [Abstract] Abstract: The paper invokes 'simulation results' to support the 'remarkable capability' of Q-learning, yet supplies no description of the simulation setup, number of ABSs/users, channel models, baselines, quantitative metrics (e.g., sum-rate improvement, convergence), or statistical significance. Without these details the empirical claim cannot be evaluated and is not load-bearing evidence.

    Authors: We agree that the abstract is too terse on the empirical evidence. The full manuscript (Section IV) reports results for 2–4 ABSs serving 10–20 users under a 3GPP urban macro channel model with distance-dependent path loss, log-normal shadowing, and Rayleigh fading; baselines include static hovering, random-walk trajectories, and centralized exhaustive search; metrics show 15–35 % sum-rate gains and convergence within roughly 800–1200 episodes. We will expand the abstract with a concise summary of these parameters and the main quantitative outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: simulation-driven Q-learning with external rewards

full rationale

The paper decomposes the problem into trajectory and power/sub-channel sub-problems, then applies standard Q-learning in a distributed fashion. The central claim that reward signals carry topology information is presented as an empirical outcome of the simulations rather than a derivation that reduces to fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are shown that collapse by construction to the inputs. The approach relies on external simulation rewards and standard RL, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach assumes reward signals encode useful topology information and that sub-problem decomposition is valid.

pith-pipeline@v0.9.0 · 5713 in / 922 out tokens · 20397 ms · 2026-05-25T18:14:12.848232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,

    I. Bor-Yaliniz and H. Yanikomeroglu, “The New Frontier in RAN Heterogeneity: Multi-Tier Drone-Cells,” IEEE Commun. Mag. , vol. 54, no. 11, pp. 48–55, Nov. 2016

  2. [2]

    Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,

    Y . Zeng, J. Lyu, and R. Zhang, “Cellular-Connected UA V: Po- tential, Challenges, and Promising Technologies,” IEEE Wireless Commun., vol. 26, no. 1, pp. 120–127, Feb. 2019

  3. [3]

    Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,

    Y . Zeng, R. Zhang, and T. J. Lim, “Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges,” IEEE Commun. Mag. , vol. 54, no. 5, pp. 36–42, May 2016

  4. [4]

    Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,

    J. Lyu, Y . Zeng, R. Zhang, and T. J. Lim, “Placement Optimiza- tion of UA V-Mounted Mobile Base Stations,” IEEE Commun. Lett, vol. 21, no. 3, pp. 604–607, Mar. 2017

  5. [5]

    3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efficient Maximal Coverage,

    M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3-D Placement of an Unmanned Aerial Vehicle Base Station (UA V- BS) for Energy-Efficient Maximal Coverage,” IEEE Wireless Commun. Lett. , vol. 6, no. 4, pp. 434–437, Aug. 2017

  6. [6]

    Efficient 3D aerial base station placement considering users mobility by reinforcement learning,

    R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu, “Efficient 3D aerial base station placement considering users mobility by reinforcement learning,” in Proc. IEEE WCNC , Apr. 2018, pp. 1–6

  7. [7]

    Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,

    Q. Wu, Y . Zeng, and R. Zhang, “Joint Trajectory and Communi- cation Design for Multi-UA V Enabled Wireless Networks,”IEEE Trans. Wireless Commun. , vol. 17, no. 3, pp. 2109–2121, Mar. 2018

  8. [8]

    UA V Trajectory Optimization for Data Offloading at the Edge of Multiple Cells,

    F. Cheng, S. Zhang, Z. Li, Y . Chen, N. Zhao, F. R. Yu, and V . C. M. Leung, “UA V Trajectory Optimization for Data Offloading at the Edge of Multiple Cells,” IEEE Trans. V eh. Technol., vol. 67, no. 7, pp. 6732–6736, Jul. 2018

  9. [9]

    Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,

    Q. Wu and R. Zhang, “Common Throughput Maximization in UA V-Enabled OFDMA Systems with Delay Consideration,”

  10. [10]
  11. [11]

    Optimal LAP Altitude for Maximum Coverage,

    A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP Altitude for Maximum Coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec 2014

  12. [12]

    R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998

  13. [13]

    Q-Learning,

    C. J. C. H. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992

  14. [14]

    Boyd and L

    S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004