Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

Adam Andersson; Jacob Ljungberg; Jonas Lidman; Mika Persson; Samuel Sandelius

arxiv: 2512.09682 · v2 · submitted 2025-12-10 · 📡 eess.SY · cs.AI· cs.GT· cs.MA· cs.SY

Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

Mika Persson , Jonas Lidman , Jacob Ljungberg , Samuel Sandelius , Adam Andersson This is my paper

Pith reviewed 2026-05-16 23:40 UTC · model grok-4.3

classification 📡 eess.SY cs.AIcs.GTcs.MAcs.SY

keywords multi-agent reinforcement learningUAV swarmsdata relayingscalabilitydecentralized controldeterministic gamesshortest path

0 comments

The pith

Two off-the-shelf MARL algorithms match a shortest-path baseline for small UAV data-relay swarms but encounter scalability problems as agent count grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of deterministic games that model the task of small and sparse UAV swarms delivering a critical data package to a fixed target location using decentralized control. It compares two standard MARL algorithms against a robust baseline that restricts motion and applies Dijkstra's algorithm to find shortest paths. Experiments show the MARL methods achieve competitive delivery performance when the number of agents is small, yet their effectiveness declines sharply as the swarm size increases. This game family is positioned as a controlled testbed for studying how multi-agent learning scales in coordination problems with sparse resources.

Core claim

A family of deterministic games is introduced to represent dynamic one-time delivery of critical data by small and sparse UAV swarms. Computational experiments demonstrate that two off-the-shelf MARL algorithms perform competitively with a Dijkstra-based baseline policy for small agent counts, but face significant scalability issues as the number of agents increases.

What carries the argument

The family of deterministic games that encode the UAV relaying task with position-based rewards and motion constraints for MARL scaling studies.

If this is right

MARL algorithms can match a centralized shortest-path baseline in small-scale decentralized UAV coordination.
Performance of standard MARL methods degrades as the number of agents in the swarm increases.
The baseline policy that restricts agent motion and applies Dijkstra's algorithm provides a strong reference for evaluating learned decentralized policies.
Public source code for the game family and experiments enables direct reproduction and extension of the scaling tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The model could be used to benchmark new MARL variants that incorporate explicit communication or hierarchical policies to improve scaling.
Similar game structures might transfer to other sparse multi-robot tasks such as distributed sensing or target tracking.
Results point toward the value of hybrid approaches that combine learned policies with classical planning for larger swarms.

Load-bearing premise

The introduced deterministic games capture the essential dynamics and constraints of real-world UAV data relaying well enough to serve as a valid testbed for MARL scaling behavior.

What would settle it

Physical UAV flight tests in an environment matching the game rules that measure whether delivery success rates for the MARL policies match the simulation trends across increasing swarm sizes.

Figures

Figures reproduced from arXiv: 2512.09682 by Adam Andersson, Jacob Ljungberg, Jonas Lidman, Mika Persson, Samuel Sandelius.

**Figure 2.** Figure 2: A scene with baseline trajectories and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Values for the isotropic non-jammed scenario, the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: The budget for different R and K. 2.5 Discussion about generality Several convenient design choices were made in the proposed model problem. A deterministic game was chosen to allow for a simple and very fast value evaluation, although this is not necessary. Adding noise to the state transition in position and orientation is both straightforward and natural, turning the game into a Markov game. Noise ofte… view at source ↗

**Figure 5.** Figure 5: Illustration of the geometry behind the baseline [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of MAPPO versus baseline, for the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of MADDPG versus baseline, for [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Rollout trajectories from the baseline (left) and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: Rollout trajectories from the baseline (top), [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

This work studies the application of Multi-Agent Reinforcement Learning (MARL) to decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for MARL scaling studies. A robust baseline policy is proposed which restricts agent motion and applies Dijkstra's shortest path algorithm. Computational experiment results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but face scalability issues as the number of agents increases. Source code and animations are available online at https://github.com/mikapersson/Information-Relaying.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a clean, reproducible benchmark of deterministic games for testing MARL scalability in a UAV data-relay task, with public code and a Dijkstra baseline that exposes clear performance drop-off as agent count rises.

read the letter

This paper sets up a new family of deterministic games as a testbed for MARL scaling in small UAV swarms that must deliver one critical data packet. The core finding is that two off-the-shelf MARL algorithms match a simple baseline at low agent counts but degrade as the swarm grows, all within the supplied simulation environment. The code and animations are public, which removes the usual reproducibility headaches. They also give an explicit baseline that restricts motion and runs Dijkstra on the remaining graph, so the comparison is straightforward and falsifiable. That combination of a purpose-built benchmark plus verifiable results is the main contribution. The experiments are run across varying agent numbers with independent trials, and the central claim rests on direct performance numbers rather than fitted parameters or self-referential predictions. The paper stays modest about scope, calling the games a model problem rather than a high-fidelity simulator. One soft spot is the narrow scenario: everything is deterministic, one-time delivery to a fixed point, with no wind, sensor noise, or communication dropouts. That keeps the setup clean for scaling studies but leaves open whether the observed limits would appear under more realistic dynamics. The choice of only two MARL algorithms also means the scalability wall might be specific to those methods rather than universal. This is useful reading for anyone working on MARL for multi-robot coordination or decentralized control who needs a shared testbed. It is not a finished application paper, but the benchmark itself is new and the empirical comparison is solid enough to merit referee time. I would send it to review.

Referee Report

0 major / 3 minor

Summary. The paper introduces a family of deterministic games modeling the problem of small, sparse UAV swarms relaying a single critical data packet to a known target location under decentralized control. It defines a Dijkstra-based baseline policy that restricts motion to a grid and computes shortest paths, then reports computational experiments comparing two off-the-shelf MARL algorithms (PPO and MADDPG) against this baseline across increasing agent counts. The central empirical finding is that the MARL methods match baseline performance at small scales but degrade in success rate and efficiency as the number of agents grows. Public code and animations are provided for verification.

Significance. If the results hold, the work supplies a clean, reproducible testbed specifically designed for MARL scaling studies in a deterministic multi-agent coordination setting. The explicit baseline, public implementation, and direct scaling curves constitute a useful benchmark contribution for the community studying decentralized UAV control and MARL limitations. The framing as a model problem rather than a high-fidelity simulator keeps the claims appropriately scoped.

minor comments (3)

§4 (Computational Experiments): full hyperparameter tables for both MARL algorithms and the precise environment configuration (grid size, communication range, reward shaping) are only partially disclosed; adding these would strengthen reproducibility of the reported scalability trends.
§3 (Game Definition): the transition from the general game family to the specific instances used in the experiments could be made more explicit, e.g., by listing the exact parameter values for each agent-count regime shown in the figures.
Figure 3 and associated text: variance or confidence intervals across random seeds are not reported; including them would clarify whether the observed degradation is statistically consistent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review. We are pleased that the manuscript is viewed as providing a clean, reproducible testbed for MARL scaling studies, with the explicit baseline, public code, and scaling curves recognized as useful contributions. The recommendation for minor revision is noted, and we will incorporate any editorial or minor clarifications in the revised version. No major technical concerns were raised that require substantive changes to the core claims or experiments.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a family of deterministic games explicitly as a testbed for studying MARL scaling behavior, proposes a baseline policy that applies the standard Dijkstra shortest-path algorithm after restricting agent motion, and reports direct empirical results from independent simulation runs comparing two off-the-shelf MARL algorithms against that baseline. Performance claims (competitive at small agent counts, degradation at larger counts) are outputs of those runs rather than quantities fitted or defined in terms of the same data. No self-citations are invoked to justify uniqueness or load-bearing premises, no ansatz is smuggled, and no prediction reduces by construction to an input parameter. The central derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the definition of the game family and standard algorithmic components; no free parameters are fitted to produce the reported scaling result, and no new entities are postulated.

axioms (1)

standard math Dijkstra's algorithm computes shortest paths in non-negative weighted graphs
Invoked directly in the baseline policy construction

pith-pipeline@v0.9.0 · 5429 in / 1189 out tokens · 31486 ms · 2026-05-16T23:40:30.857059+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The shared reward function of all agents is given by r(x,a)=budget(R,w;K)−cδp∑k=1K∥δpk∥2−cδϕ∑k=1K|δϕk|2 … budget … is set as the discounted accumulated motion cost … solved … using the baseline policy
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A robust baseline policy is proposed which restricts agent motion and applies Dijkstra’s shortest path algorithm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Bai, H., Wang, H., He, R., Du, J., Li, G., Xu, Y., and Jiao, Y. (2024). Multi-hop UAV relay covert communication: A multi-agent reinforcement learning approach.Chinese Journal of Aeronautics, 38(10), 103440. doi:10.1016/j. cja.2025.103440. Bai, Y., Zhao, H., Zhang, X., Chang, Z., J¨ antti, R., and Yang, K. (2023). Toward autonomous multi-UAV wireless netw...

work page doi:10.1016/j 2024
[2]

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., and Stone, P

Rollout trajectories from the baseline (left) and MAPPO (right) for the the directional transmission and jammed scenario with K “ 7 agents and three initial states, one per row. Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., and Stone, P. (2020). Curriculum learning for re- inforcement learning domains: A framework and survey. Journal of...

work page doi:10.1109/tvt.2020.3014788 2020

[1] [1]

Bai, H., Wang, H., He, R., Du, J., Li, G., Xu, Y., and Jiao, Y. (2024). Multi-hop UAV relay covert communication: A multi-agent reinforcement learning approach.Chinese Journal of Aeronautics, 38(10), 103440. doi:10.1016/j. cja.2025.103440. Bai, Y., Zhao, H., Zhang, X., Chang, Z., J¨ antti, R., and Yang, K. (2023). Toward autonomous multi-UAV wireless netw...

work page doi:10.1016/j 2024

[2] [2]

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., and Stone, P

Rollout trajectories from the baseline (left) and MAPPO (right) for the the directional transmission and jammed scenario with K “ 7 agents and three initial states, one per row. Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., and Stone, P. (2020). Curriculum learning for re- inforcement learning domains: A framework and survey. Journal of...

work page doi:10.1109/tvt.2020.3014788 2020