Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe
Pith reviewed 2026-05-13 01:57 UTC · model grok-4.3
The pith
Adversarial training plus on-policy smoothness makes multi-agent pathfinding policies robust to observation perturbations with minimal clean performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A standard PPO policy for decentralized MAPF on POGEMA 8x8 maps with four agents reaches 95.8 percent success on clean observations but only 2.5 percent under the strongest attack. Adv-PPO, which trains against worst-case input perturbations and selects checkpoints by adversarial performance, recovers worst-case success to 59.2 percent at roughly one percentage point of clean cost. Fine-tuning the checkpoint with Adv-PPO+MACER, which adds a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing, further recovers worst-case success to 77.5 percent plus or minus 6.0 percent across three seeds at less than one percentage point of clean cost. The work
What carries the argument
Adv-PPO, which adversarially perturbs the policy's input during training and selects the checkpoint by worst-case performance, combined with MACER, an on-policy smoothness term whose gradient follows the certified radius of randomized smoothing on the policy wrapper.
If this is right
- The same network architecture and argmax deployment loop can be retained with no added runtime cost.
- Per-attack curves demonstrate improvement across varying perturbation strengths.
- A certified action-stability check on the smoothed wrapper provides a quantitative sanity measure.
- Side-by-side rollout visualizations show specific failure modes prevented inside individual environment instances.
- Results hold across three independent seeds with standard deviation of 6.0 percent.
Where Pith is reading between the lines
- The same training pattern could apply to other decentralized multi-agent reinforcement learning settings where local observation noise triggers cascading coordination failures.
- Physical robot teams facing sensor noise or interference might benefit from analogous adversarial-plus-smoothing training.
- Direct robustness guarantees for the deployed policy would require extending certification beyond the smoothed wrapper.
- Scaling the approach to larger grids or more agents would test whether the recovered robustness holds without increased clean cost.
Load-bearing premise
That perturbations used in training and the smoothness term, whose certification applies only to the smoothed wrapper, will protect the actual deployed argmax policy against real-world observation attacks.
What would settle it
Testing the final deployed argmax policy without the smoothing wrapper on the same POGEMA 8x8 four-agent maps under the attack model and finding success rates near the original 2.5 percent instead of 77.5 percent.
Figures
read the original abstract
Decentralized multi-agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv-PPO, trains the shared policy against worst-case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv-PPO+MACER, fine-tunes that checkpoint with a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv-PPO recovers worst-case success to 59.2% at one percentage point of clean cost. Adv-PPO+MACER recovers it to 77.5% +/- 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per-attack curves, a certified action-stability sanity check (which measures the smoothed-policy wrapper, not the deployed argmax policy), and side-by-side rollout storyboards that show the failure mode and the fix inside one environment instance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that two training recipes—Adv-PPO (adversarial training of a shared PPO policy against input perturbations) and Adv-PPO+MACER (fine-tuning with an on-policy smoothness term whose gradient follows the certified radius from randomized smoothing)—produce robust decentralized MAPF policies. On POGEMA 8x8 maps with four agents, the baseline PPO reaches 95.8% clean success but only 2.5% under strongest attack; Adv-PPO recovers worst-case success to 59.2% at ~1% clean cost; Adv-PPO+MACER reaches 77.5% ±6.0% (three seeds) at <1% clean cost. Support includes per-attack curves, a certified action-stability check on the smoothed wrapper, and rollout visualizations.
Significance. If the empirical gains hold and robustness transfers to the deployed policy, the work offers a practical recipe for observation-robust MAPF with relevance to safety-critical multi-agent robotics. The combination of adversarial training plus certified smoothing, the multi-seed error bars, and the side-by-side storyboards are concrete strengths that would make the result useful if the central transfer assumption is validated.
major comments (2)
- Abstract: The certified action-stability sanity check is explicitly limited to the smoothed-policy wrapper, yet the deployed policy is the deterministic argmax over logits. No argument, theorem, or experiment is supplied showing that the certified radius or the on-policy smoothness term transfers robustness to the argmax policy under the same observation attacks; this link is load-bearing for the claim that Adv-PPO+MACER supplies a 'principled' defense.
- Results and experimental details: The reported recovery to 77.5% ±6.0% worst-case success depends on the specific attack generation procedure and the values chosen for the perturbation budget and smoothness coefficient, yet these are not fully specified. Without them the numerical gains cannot be reproduced or stress-tested, undermining the strength of the empirical claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: Abstract: The certified action-stability sanity check is explicitly limited to the smoothed-policy wrapper, yet the deployed policy is the deterministic argmax over logits. No argument, theorem, or experiment is supplied showing that the certified radius or the on-policy smoothness term transfers robustness to the argmax policy under the same observation attacks; this link is load-bearing for the claim that Adv-PPO+MACER supplies a 'principled' defense.
Authors: We acknowledge that the certified sanity check applies only to the smoothed wrapper, as already noted in the manuscript abstract. The on-policy smoothness term uses the certified radius to regularize the base policy logits; we expect this to improve robustness of the subsequent argmax policy, and the primary supporting evidence is the empirical worst-case success of the deployed policy. No formal transfer theorem is provided. We will revise the abstract and add a short discussion paragraph clarifying the scope of the sanity check and the intended (empirically supported) role of the smoothness term. revision: partial
-
Referee: Results and experimental details: The reported recovery to 77.5% ±6.0% worst-case success depends on the specific attack generation procedure and the values chosen for the perturbation budget and smoothness coefficient, yet these are not fully specified. Without them the numerical gains cannot be reproduced or stress-tested, undermining the strength of the empirical claim.
Authors: We agree that full reproducibility requires explicit specification of the attack procedure, perturbation budgets, and smoothness coefficient. These values appear in the experimental section and appendix but are not listed in a single, easily referenceable location. In the revision we will add a dedicated table (or expanded subsection) giving the exact attack optimizer, step count, step size, training and evaluation ε values, and the MACER smoothness coefficient. revision: yes
Circularity Check
No circularity: empirical training results on held-out scenarios
full rationale
The paper reports empirical success rates from training Adv-PPO and Adv-PPO+MACER on POGEMA 8x8 maps, with numbers obtained by running the trained policies on test scenarios. The abstract explicitly flags that the certified action-stability check applies only to the smoothed wrapper and not the deployed argmax policy, but this is presented as a sanity check rather than a load-bearing derivation. No equation, parameter fit, or self-citation reduces the central numerical claims to a tautology or to the training inputs by construction. The results remain falsifiable on new maps and attacks.
Axiom & Free-Parameter Ledger
free parameters (2)
- perturbation budget
- smoothness coefficient
axioms (2)
- domain assumption Worst-case perturbations generated during training approximate the observation attacks that matter in deployment.
- domain assumption The certified radius from randomized smoothing on the smoothed wrapper provides meaningful protection for the deployed argmax policy.
Reference graph
Works this paper leans on
-
[1]
Symposium on Combinatorial Search (SoCS) , year=
Multi-Agent Pathfinding: Definitions, Variants, and Benchmarks , author=. Symposium on Combinatorial Search (SoCS) , year=
-
[2]
Artificial Intelligence , volume=
Conflict-Based Search for Optimal Multi-Agent Pathfinding , author=. Artificial Intelligence , volume=
-
[3]
Sartoretti, Guillaume and Kerr, Justin and Shi, Yunfei and Wagner, Glenn and Kumar, T. K. Satish and Koenig, Sven and Choset, Howie , journal=
-
[4]
Skrynnik, Alexey and Andreychuk, Anton and Nikiforov, Konstantin and Yakovlev, Konstantin and Panov, Aleksandr , journal=
-
[5]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Learn to Follow: Decentralized Lifelong Multi-Agent Pathfinding via Planning and Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[6]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
International Conference on Learning Representations (ICLR) , year=
High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations (ICLR) , year=
-
[8]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[9]
International Conference on Learning Representations (ICLR) , year=
Explaining and Harnessing Adversarial Examples , author=. International Conference on Learning Representations (ICLR) , year=
-
[10]
International Conference on Learning Representations (ICLR) , year=
Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations (ICLR) , year=
-
[11]
Proceedings of the International Conference on Machine Learning (ICML) , pages=
Theoretically Principled Trade-off between Robustness and Accuracy , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=
-
[12]
Proceedings of the International Conference on Machine Learning (ICML) , pages=
Robust Adversarial Reinforcement Learning , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=
-
[13]
Proceedings of the International Conference on Machine Learning (ICML) , pages=
Action Robust Reinforcement Learning and Applications in Continuous Control , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=
-
[14]
International Conference on Learning Representations (ICLR) , year=
Adversarial Policies: Attacking Deep Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=
-
[15]
Proceedings of the International Conference on Machine Learning (ICML) , pages=
Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-Free Attacks , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=
-
[16]
IEEE Security and Privacy Workshops (SPW) , pages=
On the Robustness of Cooperative Multi-Agent Reinforcement Learning , author=. IEEE Security and Privacy Workshops (SPW) , pages=
-
[17]
arXiv preprint arXiv:2203.03722 , year=
Adversarial Attacks on Multi-Agent Deep Reinforcement Learning Models in Continuous Action Space , author=. arXiv preprint arXiv:2203.03722 , year=
-
[18]
International Conference on Learning Representations (ICLR) , year=
MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius , author=. International Conference on Learning Representations (ICLR) , year=
-
[19]
International Conference on Machine Learning (ICML) , year=
Certified Adversarial Robustness via Randomized Smoothing , author=. International Conference on Machine Learning (ICML) , year=
-
[20]
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , author=. AISTATS , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.