pith. sign in

arxiv: 2605.11469 · v1 · submitted 2026-05-12 · 💻 cs.LG

Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

Pith reviewed 2026-05-13 01:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-agent path findingadversarial trainingrandomized smoothingreinforcement learningobservation attacksrobust policiesPOGEMAdecentralized MAPF
0
0 comments X

The pith

Adversarial training plus on-policy smoothness makes multi-agent pathfinding policies robust to observation perturbations with minimal clean performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Decentralized multi-agent path finding routes agents on a grid where each follows a shared neural policy trained with PPO from its own local view. A small perturbation to one agent's observation often changes its action, which then blocks neighbors and causes the entire team to fail. The paper introduces two recipes that keep the same network and deployment unchanged yet train against worst-case perturbations. Adv-PPO alone lifts worst-case success from 2.5 percent to 59.2 percent at one point of clean cost. Adding MACER fine-tuning with a smoothness term derived from randomized smoothing certification raises worst-case success to 77.5 percent across seeds while staying within one point of the original clean rate on 8x8 maps with four agents.

Core claim

A standard PPO policy for decentralized MAPF on POGEMA 8x8 maps with four agents reaches 95.8 percent success on clean observations but only 2.5 percent under the strongest attack. Adv-PPO, which trains against worst-case input perturbations and selects checkpoints by adversarial performance, recovers worst-case success to 59.2 percent at roughly one percentage point of clean cost. Fine-tuning the checkpoint with Adv-PPO+MACER, which adds a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing, further recovers worst-case success to 77.5 percent plus or minus 6.0 percent across three seeds at less than one percentage point of clean cost. The work

What carries the argument

Adv-PPO, which adversarially perturbs the policy's input during training and selects the checkpoint by worst-case performance, combined with MACER, an on-policy smoothness term whose gradient follows the certified radius of randomized smoothing on the policy wrapper.

If this is right

  • The same network architecture and argmax deployment loop can be retained with no added runtime cost.
  • Per-attack curves demonstrate improvement across varying perturbation strengths.
  • A certified action-stability check on the smoothed wrapper provides a quantitative sanity measure.
  • Side-by-side rollout visualizations show specific failure modes prevented inside individual environment instances.
  • Results hold across three independent seeds with standard deviation of 6.0 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training pattern could apply to other decentralized multi-agent reinforcement learning settings where local observation noise triggers cascading coordination failures.
  • Physical robot teams facing sensor noise or interference might benefit from analogous adversarial-plus-smoothing training.
  • Direct robustness guarantees for the deployed policy would require extending certification beyond the smoothed wrapper.
  • Scaling the approach to larger grids or more agents would test whether the recovered robustness holds without increased clean cost.

Load-bearing premise

That perturbations used in training and the smoothness term, whose certification applies only to the smoothed wrapper, will protect the actual deployed argmax policy against real-world observation attacks.

What would settle it

Testing the final deployed argmax policy without the smoothing wrapper on the same POGEMA 8x8 four-agent maps under the attack model and finding success rates near the original 2.5 percent instead of 77.5 percent.

Figures

Figures reproduced from arXiv: 2605.11469 by Riad Ahmed.

Figure 1
Figure 1. Figure 1: Six attack types visualized on the 8×8 grid at step 6 (ϵ=0.15). Top row: baseline PPO policy. Bottom row: our Adv-PPO+MACER policy. Each agent’s 5×5 field of view is shown with a dashed border. Inside the field of view, red fill indicates a ghost wall the policy incorrectly sees (obstacle channel hallucinated); blue fill indicates a ghost agent; amber fill indicates a phantom goal hint. Hatched cells mark … view at source ↗
Figure 2
Figure 2. Figure 2: Headline comparison on POGEMA 8×8, 4 agents, 30 episodes per attack setting. Each method reports clean success, the mean attacked success across 21 attack settings, and the worst single attacked cell. The two proposed meth￾ods (Ours-1 and Ours-2) keep clean performance and re￾cover most of the worst-case loss; the two post-hoc rows show the negative result. cess [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Certified action-stability curve from randomized [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rollout storyboard on a single POGEMA episode (seed [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Decentralized multi-agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv-PPO, trains the shared policy against worst-case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv-PPO+MACER, fine-tunes that checkpoint with a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv-PPO recovers worst-case success to 59.2% at one percentage point of clean cost. Adv-PPO+MACER recovers it to 77.5% +/- 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per-attack curves, a certified action-stability sanity check (which measures the smoothed-policy wrapper, not the deployed argmax policy), and side-by-side rollout storyboards that show the failure mode and the fix inside one environment instance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that two training recipes—Adv-PPO (adversarial training of a shared PPO policy against input perturbations) and Adv-PPO+MACER (fine-tuning with an on-policy smoothness term whose gradient follows the certified radius from randomized smoothing)—produce robust decentralized MAPF policies. On POGEMA 8x8 maps with four agents, the baseline PPO reaches 95.8% clean success but only 2.5% under strongest attack; Adv-PPO recovers worst-case success to 59.2% at ~1% clean cost; Adv-PPO+MACER reaches 77.5% ±6.0% (three seeds) at <1% clean cost. Support includes per-attack curves, a certified action-stability check on the smoothed wrapper, and rollout visualizations.

Significance. If the empirical gains hold and robustness transfers to the deployed policy, the work offers a practical recipe for observation-robust MAPF with relevance to safety-critical multi-agent robotics. The combination of adversarial training plus certified smoothing, the multi-seed error bars, and the side-by-side storyboards are concrete strengths that would make the result useful if the central transfer assumption is validated.

major comments (2)
  1. Abstract: The certified action-stability sanity check is explicitly limited to the smoothed-policy wrapper, yet the deployed policy is the deterministic argmax over logits. No argument, theorem, or experiment is supplied showing that the certified radius or the on-policy smoothness term transfers robustness to the argmax policy under the same observation attacks; this link is load-bearing for the claim that Adv-PPO+MACER supplies a 'principled' defense.
  2. Results and experimental details: The reported recovery to 77.5% ±6.0% worst-case success depends on the specific attack generation procedure and the values chosen for the perturbation budget and smoothness coefficient, yet these are not fully specified. Without them the numerical gains cannot be reproduced or stress-tested, undermining the strength of the empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: Abstract: The certified action-stability sanity check is explicitly limited to the smoothed-policy wrapper, yet the deployed policy is the deterministic argmax over logits. No argument, theorem, or experiment is supplied showing that the certified radius or the on-policy smoothness term transfers robustness to the argmax policy under the same observation attacks; this link is load-bearing for the claim that Adv-PPO+MACER supplies a 'principled' defense.

    Authors: We acknowledge that the certified sanity check applies only to the smoothed wrapper, as already noted in the manuscript abstract. The on-policy smoothness term uses the certified radius to regularize the base policy logits; we expect this to improve robustness of the subsequent argmax policy, and the primary supporting evidence is the empirical worst-case success of the deployed policy. No formal transfer theorem is provided. We will revise the abstract and add a short discussion paragraph clarifying the scope of the sanity check and the intended (empirically supported) role of the smoothness term. revision: partial

  2. Referee: Results and experimental details: The reported recovery to 77.5% ±6.0% worst-case success depends on the specific attack generation procedure and the values chosen for the perturbation budget and smoothness coefficient, yet these are not fully specified. Without them the numerical gains cannot be reproduced or stress-tested, undermining the strength of the empirical claim.

    Authors: We agree that full reproducibility requires explicit specification of the attack procedure, perturbation budgets, and smoothness coefficient. These values appear in the experimental section and appendix but are not listed in a single, easily referenceable location. In the revision we will add a dedicated table (or expanded subsection) giving the exact attack optimizer, step count, step size, training and evaluation ε values, and the MACER smoothness coefficient. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training results on held-out scenarios

full rationale

The paper reports empirical success rates from training Adv-PPO and Adv-PPO+MACER on POGEMA 8x8 maps, with numbers obtained by running the trained policies on test scenarios. The abstract explicitly flags that the certified action-stability check applies only to the smoothed wrapper and not the deployed argmax policy, but this is presented as a sanity check rather than a load-bearing derivation. No equation, parameter fit, or self-citation reduces the central numerical claims to a tautology or to the training inputs by construction. The results remain falsifiable on new maps and attacks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of adversarial training and randomized-smoothing-derived smoothness when applied to a shared MAPF policy, plus the assumption that the chosen perturbation budget and smoothness coefficient transfer to real attacks.

free parameters (2)
  • perturbation budget
    Strength of worst-case input perturbations used during Adv-PPO training; chosen to balance robustness and clean performance.
  • smoothness coefficient
    Weight of the on-policy smoothness term added in the MACER fine-tuning stage.
axioms (2)
  • domain assumption Worst-case perturbations generated during training approximate the observation attacks that matter in deployment.
    Invoked in the design and checkpoint selection of Adv-PPO.
  • domain assumption The certified radius from randomized smoothing on the smoothed wrapper provides meaningful protection for the deployed argmax policy.
    Used to motivate the MACER term and the action-stability sanity check.

pith-pipeline@v0.9.0 · 5595 in / 1543 out tokens · 56044 ms · 2026-05-13T01:57:25.023388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Symposium on Combinatorial Search (SoCS) , year=

    Multi-Agent Pathfinding: Definitions, Variants, and Benchmarks , author=. Symposium on Combinatorial Search (SoCS) , year=

  2. [2]

    Artificial Intelligence , volume=

    Conflict-Based Search for Optimal Multi-Agent Pathfinding , author=. Artificial Intelligence , volume=

  3. [3]

    Sartoretti, Guillaume and Kerr, Justin and Shi, Yunfei and Wagner, Glenn and Kumar, T. K. Satish and Koenig, Sven and Choset, Howie , journal=

  4. [4]

    Skrynnik, Alexey and Andreychuk, Anton and Nikiforov, Konstantin and Yakovlev, Konstantin and Panov, Aleksandr , journal=

  5. [5]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  6. [6]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  7. [7]

    International Conference on Learning Representations (ICLR) , year=

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations (ICLR) , year=

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  9. [9]

    International Conference on Learning Representations (ICLR) , year=

    Explaining and Harnessing Adversarial Examples , author=. International Conference on Learning Representations (ICLR) , year=

  10. [10]

    International Conference on Learning Representations (ICLR) , year=

    Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations (ICLR) , year=

  11. [11]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Theoretically Principled Trade-off between Robustness and Accuracy , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  12. [12]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Robust Adversarial Reinforcement Learning , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  13. [13]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Action Robust Reinforcement Learning and Applications in Continuous Control , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  14. [14]

    International Conference on Learning Representations (ICLR) , year=

    Adversarial Policies: Attacking Deep Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  15. [15]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-Free Attacks , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  16. [16]

    IEEE Security and Privacy Workshops (SPW) , pages=

    On the Robustness of Cooperative Multi-Agent Reinforcement Learning , author=. IEEE Security and Privacy Workshops (SPW) , pages=

  17. [17]

    arXiv preprint arXiv:2203.03722 , year=

    Adversarial Attacks on Multi-Agent Deep Reinforcement Learning Models in Continuous Action Space , author=. arXiv preprint arXiv:2203.03722 , year=

  18. [18]

    International Conference on Learning Representations (ICLR) , year=

    MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius , author=. International Conference on Learning Representations (ICLR) , year=

  19. [19]

    International Conference on Machine Learning (ICML) , year=

    Certified Adversarial Robustness via Randomized Smoothing , author=. International Conference on Machine Learning (ICML) , year=

  20. [20]

    AISTATS , year=

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , author=. AISTATS , year=