arxiv: 2605.01041 · v3 · submitted 2026-05-01 · 💻 cs.MA · cs.AI· cs.GT· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning

Iman Sharifi , Hyeong Tae Kim , Maheed Hatem Ahmed , Mahsa Ghasemi , Peng Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:23 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.GTcs.LGcs.RO

keywords multi-agent reinforcement learningseparation assuranceunmanned aerial systemsheterogeneous fleetstactical deconflictionpolicy equilibriumurban airspace management

0 comments

The pith

Two fleets of small unmanned aircraft can learn separate policies that reach an equilibrium ensuring safe separation in dense airspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether reinforcement learning can enable different companies to operate their own drone fleets without causing conflicts in the same urban airspace. Each fleet trains its own policy using an attention-enhanced actor-critic method while keeping its training private. Simulations of package deliveries over Dallas demonstrate that the policies converge to maintain safe distances between aircraft from both fleets. The approach also proves more effective at resolving conflicts than standard rule-based methods in several tested scenarios, although outcomes can disadvantage fleets with inferior equipment.

Core claim

Two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy. Equilibria between similar policy types tend to favor fleets with stronger configurations, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.

What carries the argument

An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework in which each fleet independently trains a shared policy for its homogeneous aircraft to handle both intra-fleet and inter-fleet deconfliction.

If this is right

Policies can converge without a central authority coordinating between companies.
Learned policies adapt safely when interacting with non-learning rule-based systems.
Equilibria tend to benefit fleets with better sensing or communication ranges.
Fairness considerations become necessary for long-term multi-company operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding more than two fleets might still work if each maintains independent training.
Real-world testing with actual flight data would be needed to confirm generalization beyond the Dallas scenario.
Incorporating fairness terms into the reward function could mitigate advantages for stronger fleets.
The method could inform regulations for shared urban airspace among multiple operators.

Load-bearing premise

The simulated environment and reward functions accurately capture the dynamics, sensing ranges, communication capabilities, and mission constraints of real small unmanned aerial systems.

What would settle it

Running the learned policies in a physical test with actual drones or a higher-fidelity simulator that includes wind, sensor noise, and variable mission demands to check if separation is still maintained.

Figures

Figures reproduced from arXiv: 2605.01041 by Hyeong Tae Kim, Iman Sharifi, Maheed Hatem Ahmed, Mahsa Ghasemi, Peng Wei.

**Figure 1.** Figure 1: Use-case scenario based on Frisco, a suburb area in view at source ↗

**Figure 2.** Figure 2: Average total rewards gained by both Co. A and Co. B view at source ↗

**Figure 3.** Figure 3: Average total reward and number of successful agents with average NMACs per episode for both fleets. view at source ↗

read the original abstract

In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict-free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi-agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy-configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two heterogeneous sUAS fleets can train separate attention-enhanced PPOA2C policies that reach a separation equilibrium in simulation, but the equilibria tend to favor the stronger configuration.

read the letter

The main takeaway is that independent per-fleet training of attention-enhanced PPOA2C policies produces equilibria that keep conflicts low in a multi-operator package-delivery scenario over Dallas. The paper also reports that these policies beat two rule-based baselines on conflict resolution and show some adaptivity when paired against a rule-based opponent. That combination of privacy-preserving training and explicit fairness checks on the resulting equilibria is the concrete step forward from earlier single-fleet or homogeneous MARL deconfliction work. The fairness observation—that similar policies still tilt toward the better-equipped fleet—is useful and worth noting for anyone thinking about regulation or market design in urban air mobility. The simulation setup and reward structure are described clearly enough in the abstract to see how the equilibria emerge, and the authors flag the favoritism issue rather than glossing over it. The soft spots are straightforward. Everything rests on an idealized simulator with fixed sensing ranges and no reported validation against flight data, wind, or sensor noise. The abstract gives no numbers on runs, variance, or statistical tests, and the post-hoc policy-configuration sweeps raise the usual selection-effect worry. Those gaps are common at this stage but limit how far the equilibrium claim can be taken without more evidence. This paper is aimed at the MARL-for-UAVs crowd and at people modeling multi-operator airspace rules. A reader already working on attention mechanisms or fairness in multi-agent systems will get the most out of the fairness results. It is solid enough on its own terms to deserve referee time; the simulation details and statistical reporting are the obvious places for revision. I would send it out for review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper investigates multi-agent reinforcement learning for tactical separation assurance between two heterogeneous fleets of small unmanned aerial systems (sUAS) performing package-delivery missions in a simulated Dallas environment. Each fleet independently trains a shared attention-enhanced PPOA2C policy for its homogeneous aircraft while preserving privacy. The central claim is that distinct policies can converge to an equilibrium maintaining safe separation, outperform rule-based baselines in conflict resolution, exhibit safer cross-policy interactions, and that post-hoc policy-configuration evaluations show equilibria tending to favor fleets with stronger configurations.

Significance. If the simulation results hold under more rigorous validation, the work provides evidence that decentralized, privacy-preserving RL can address deconfliction in dense heterogeneous urban airspace without central coordination. The independent per-fleet training and attention mechanism are technical strengths, and the observation that equilibria may disadvantage weaker configurations offers a useful caution for fairness in multi-operator systems. The approach is a solid contribution to RL applications in aviation, though its broader impact depends on bridging the gap from idealized simulation to real dynamics.

major comments (3)

[Experimental results] Experimental results (as summarized in the abstract): The claims that policies 'reach an equilibrium' and 'outperform two strong rule-based baselines' lack any reported details on the number of independent runs, variance or confidence intervals on metrics, statistical significance tests, or sensitivity to PPOA2C hyperparameters and reward weights. This is load-bearing for the equilibrium and outperformance assertions, as RL outcomes are known to be sensitive to initialization and tuning.
[Simulation environment] Simulation environment and reward design (central to all experiments): The custom multi-agent simulator with fixed sensing/communication ranges and mission constraints over Dallas is not validated against real sUAS flight data, nor is robustness to unmodeled effects (wind, sensor noise, dropouts) analyzed. Since the equilibrium convergence and baseline comparisons rest entirely on this environment's fidelity, the absence of validation limits the reliability of the reported equilibria.
[Policy-configuration evaluations] Policy-configuration evaluations (abstract): The finding that 'equilibria between similar policy types tend to favor fleets with stronger configurations' and that 'the equilibrium favors one of the heterogeneous policies' is based on post-hoc comparisons. Without pre-specified protocols, correction for multiple testing, or explicit definitions of 'favor' via primary metrics (e.g., conflict rate per episode), these results risk selection effects and weaken the fairness-related conclusions.

minor comments (2)

[Abstract] The abstract introduces 'PPOA2C' without expanding the acronym or briefly describing how the attention mechanism augments the Advantage Actor-Critic architecture.
[Presentation] No learning curves, per-episode conflict rates, or equilibrium metric tables are referenced in the provided summary, which would aid clarity in presenting convergence behavior.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and commitments to revise the manuscript where the concerns are valid and actionable.

read point-by-point responses

Referee: Experimental results (as summarized in the abstract): The claims that policies 'reach an equilibrium' and 'outperform two strong rule-based baselines' lack any reported details on the number of independent runs, variance or confidence intervals on metrics, statistical significance tests, or sensitivity to PPOA2C hyperparameters and reward weights. This is load-bearing for the equilibrium and outperformance assertions, as RL outcomes are known to be sensitive to initialization and tuning.

Authors: We agree that the current manuscript insufficiently reports statistical details supporting the equilibrium and outperformance claims. In the revision we will add results aggregated over multiple independent training runs (with specific seed counts), report means with standard deviations and confidence intervals for key metrics such as conflict rate and separation distance, and include statistical significance tests (e.g., paired t-tests) against the rule-based baselines. We will also briefly discuss hyperparameter choices and note that the attention mechanism was introduced partly to improve training stability, while acknowledging that exhaustive sensitivity analysis remains future work. revision: yes
Referee: Simulation environment and reward design (central to all experiments): The custom multi-agent simulator with fixed sensing/communication ranges and mission constraints over Dallas is not validated against real sUAS flight data, nor is robustness to unmodeled effects (wind, sensor noise, dropouts) analyzed. Since the equilibrium convergence and baseline comparisons rest entirely on this environment's fidelity, the absence of validation limits the reliability of the reported equilibria.

Authors: The simulator is a custom abstraction capturing core mission geometry, sensing ranges, and package-delivery constraints over a Dallas map; it is not claimed to be a high-fidelity digital twin. We cannot validate it against proprietary real-world sUAS flight logs. We will add an explicit limitations subsection that states the idealized dynamics, fixed ranges, and lack of robustness testing to wind, sensor noise, or communication dropouts, while arguing that the environment still isolates the multi-agent deconfliction question under controlled conditions. revision: partial
Referee: Policy-configuration evaluations (abstract): The finding that 'equilibria between similar policy types tend to favor fleets with stronger configurations' and that 'the equilibrium favors one of the heterogeneous policies' is based on post-hoc comparisons. Without pre-specified protocols, correction for multiple testing, or explicit definitions of 'favor' via primary metrics (e.g., conflict rate per episode), these results risk selection effects and weaken the fairness-related conclusions.

Authors: We will revise the relevant section to define 'favor' explicitly via primary metrics (conflict rate per episode and mission completion time) and to present the configuration sweeps as exploratory rather than confirmatory. We acknowledge the post-hoc nature of the comparisons and the absence of multiple-testing correction; these will be stated as caveats. The systematic enumeration of configuration pairs will be retained but framed with the appropriate qualifiers. revision: yes

standing simulated objections not resolved

Validation of the custom simulator against real sUAS flight data or analysis of robustness to unmodeled effects such as wind, sensor noise, and communication dropouts.

Circularity Check

0 steps flagged

No circularity: empirical RL results are outputs of simulation training, not reductions by construction

full rationale

The paper's central claim rests on experimental outcomes from training distinct PPOA2C policies in a custom multi-agent simulation of package-delivery missions. No mathematical derivation chain exists that reduces the equilibrium result to fitted parameters or self-citations by the paper's own equations. The simulation environment and rewards function as explicit inputs to the training process; observed equilibria and performance comparisons versus baselines are reported as empirical findings, not tautological predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. This is a standard experimental RL setup with independent policy training per fleet.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus simulation-specific modeling choices; no new physical entities are postulated.

free parameters (2)

PPOA2C hyperparameters (learning rate, clip range, attention parameters)
Tuned to achieve policy convergence in the Dallas delivery scenario; values not reported in abstract.
Reward function weights for separation, mission completion, and efficiency
Designed to balance safety and performance; directly shapes the learned equilibria.

axioms (2)

domain assumption The sUAS environment is treated as a partially observable Markov decision process for each agent
Standard assumption enabling independent policy training in multi-agent RL.
domain assumption Simulation dynamics and sensor models are sufficiently realistic for policy transfer
Invoked when claiming real-world relevance of the equilibria.

pith-pipeline@v0.9.0 · 5624 in / 1274 out tokens · 35348 ms · 2026-05-11T02:23:32.932965+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts... Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation.
IndisputableMonolith/Foundation/Cost.lean Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R_LoS = -1 if d < d_NMAC, linear penalty between d_NMAC and d_LoWC; R_V, R_A, R_M, R_T penalties/bonuses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Unmanned aircraft systems (UASs): current state, emerging technologies, and future trends,

G. Ariante and G. Del Core, “Unmanned aircraft systems (UASs): current state, emerging technologies, and future trends,”Drones, vol. 9, no. 1, p. 59, 2025

work page 2025
[2]

R. W. Beard and T. W. McLain,Small unmanned aircraft: Theory and practice. Princeton University Press, 2012

work page 2012
[3]

Review of deep reinforcement learning approaches for conflict resolution in air traffic control,

Z. Wang, W. Pan, H. Li, X. Wang, and Q. Zuo, “Review of deep reinforcement learning approaches for conflict resolution in air traffic control,”Aerospace, vol. 9, no. 6, p. 294, 2022

work page 2022
[4]

Service-oriented separation assurance for small UAS traffic management,

G. Hunter and P. Wei, “Service-oriented separation assurance for small UAS traffic management,” in2019 Integrated Communications, Navigation and Surveillance Conference (ICNS), pp. 1–11, IEEE, 2019

work page 2019
[5]

An integrated localization and control framework for multi-agent formation,

Y . Cai and Y . Shen, “An integrated localization and control framework for multi-agent formation,”IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1941–1956, 2019

work page 1941
[6]

Markov decision process- based distributed conflict resolution for drone air traffic management,

H. Y . Ong and M. J. Kochenderfer, “Markov decision process- based distributed conflict resolution for drone air traffic management,” Journal of Guidance, Control, and Dynamics, vol. 40, no. 1, pp. 69– 80, 2017

work page 2017
[7]

Autonomous separation assurance in a high- density en route sector: A deep multi-agent reinforcement learning approach,

M. Brittain and P. Wei, “Autonomous separation assurance in a high- density en route sector: A deep multi-agent reinforcement learning approach,” in2019 IEEE Intelligent Transportation Systems Confer- ence (ITSC), pp. 3256–3262, IEEE, 2019

work page 2019
[8]

One to any: Distributed conflict resolution with deep multi-agent reinforcement learning and long short-term memory,

M. W. Brittain and P. Wei, “One to any: Distributed conflict resolution with deep multi-agent reinforcement learning and long short-term memory,” inAIAA Scitech 2021 Forum, p. 1952, 2021

work page 2021
[9]

Safety enhancement for deep reinforcement learning in autonomous separation assurance,

W. Guo, M. Brittain, and P. Wei, “Safety enhancement for deep reinforcement learning in autonomous separation assurance,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 348–354, IEEE, 2021

work page 2021
[10]

Improving au- tonomous separation assurance through distributed reinforcement learning with attention networks,

M. W. Brittain, L. E. Alvarez, and K. Breeden, “Improving au- tonomous separation assurance through distributed reinforcement learning with attention networks,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 22857–22863, Mar. 2024

work page 2024
[11]

ACAS sXu: Robust decentralized detect and avoid for small un- manned aircraft systems,

L. E. Alvarez, I. Jessen, M. P. Owen, J. Silbermann, and P. Wood, “ACAS sXu: Robust decentralized detect and avoid for small un- manned aircraft systems,” in2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), pp. 1–9, IEEE, 2019

work page 2019
[12]

Scalable autonomous separation assurance with heterogeneous multi-agent reinforcement learning,

M. Brittain and P. Wei, “Scalable autonomous separation assurance with heterogeneous multi-agent reinforcement learning,”IEEE Trans- actions on Automation Science and Engineering, vol. 19, no. 4, pp. 2837–2848, 2022

work page 2022
[13]

FAA remote identification of un- manned aircraft,

Federal Aviation Administration, “FAA remote identification of un- manned aircraft,” 2020. Accessed: Aug 30, 2025

work page 2020
[14]

Multi-UA V conflict resolution with graph convolutional reinforcement learning,

R. Isufaj, M. Omeri, and M. A. Piera, “Multi-UA V conflict resolution with graph convolutional reinforcement learning,”Applied Sciences, vol. 12, no. 2, p. 610, 2022

work page 2022
[15]

Autonomous separation as- surance with deep multi-agent reinforcement learning,

M. W. Brittain, X. Yang, and P. Wei, “Autonomous separation as- surance with deep multi-agent reinforcement learning,”Journal of Aerospace Information Systems, vol. 18, no. 12, pp. 890–905, 2021

work page 2021
[16]

Comparing attention- based methods with long short-term memory for state encoding in reinforcement learning-based separation management,

D. Groot, J. Ellerbroek, and J. Hoekstra, “Comparing attention- based methods with long short-term memory for state encoding in reinforcement learning-based separation management,”Engineering Applications of Artificial Intelligence, vol. 159, p. 111592, 2025

work page 2025
[17]

3D RVO- enhanced multi-agent deep reinforcement learning for collision avoid- ance in urban structured airspace,

G. Zhong, Y . Liu, S. Du, F. Wang, J. Zhou, and H. Zhang, “3D RVO- enhanced multi-agent deep reinforcement learning for collision avoid- ance in urban structured airspace,”Aerospace Science and Technology, vol. 164, p. 110378, 2025

work page 2025
[18]

Physics informed deep reinforcement learning for aircraft conflict resolution,

P. Zhao and Y . Liu, “Physics informed deep reinforcement learning for aircraft conflict resolution,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8288–8301, 2021

work page 2021
[19]

Asynchronous methods for deep reinforcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” inInternational Conference on Machine Learning (ICML), pp. 1928–1937, PmLR, 2016

work page 1928
[20]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

The surprising effectiveness of ppo in cooperative multi- agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi- agent games,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 24611–24624, 2022

work page 2022
[22]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estima- tion,”arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review arXiv 2015
[23]

Integrated conflict management for UAM with strategic demand capacity balancing and learning-based tactical deconfliction,

S. Chen, A. D. Evans, M. Brittain, and P. Wei, “Integrated conflict management for UAM with strategic demand capacity balancing and learning-based tactical deconfliction,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 8, pp. 10049–10061, 2024

work page 2024
[24]

Bluesky ATC simulator project: an open data and open source approach,

J. M. Hoekstra and J. Ellerbroek, “Bluesky ATC simulator project: an open data and open source approach,” inProceedings of the 7th International Conference on Research in Air Transportation, vol. 131, p. 132, FAA/Eurocontrol Washington, DC, USA, 2016

work page 2016
[25]

Fine-tuning large language models for cooperative tactical deconfliction of small unmanned aerial sys- tems,

I. Sharifi, A. Zongo, and P. Wei, “Fine-tuning large language models for cooperative tactical deconfliction of small unmanned aerial sys- tems,”arXiv preprint arXiv:2603.28561, 2026

work page internal anchor Pith review arXiv 2026