Architecture Matters for Multi-Agent Security

Ben Hagag; Christian Schroeder de Witt; Sarah Scheffler; William L. Anderson

arxiv: 2604.23459 · v1 · submitted 2026-04-25 · 💻 cs.MA · cs.CR· cs.LG

Architecture Matters for Multi-Agent Security

Ben Hagag , William L. Anderson , Christian Schroeder de Witt , Sarah Scheffler This is my paper

Pith reviewed 2026-05-08 06:47 UTC · model grok-4.3

classification 💻 cs.MA cs.CRcs.LG

keywords multi-agent systemsAI securityagent architectureattack resistancecommunication topologyagent rolesmemoryempirical evaluation

0 comments

The pith

Multi-agent AI architectures create attack vulnerabilities not present in single-agent systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests thirteen different ways to organize networks of AI agents across three environments to measure how design affects both task success and resistance to attacks. It separates attack progress into stages like planning refusal and full completion to isolate where failures occur. In most configurations, these networks prove easier to compromise than a single agent working alone, even when the multi-agent version matches or exceeds normal task accuracy. Attack success rates differ by as much as 3.8 times depending on how roles, communication, and memory are set up. The study concludes that no arrangement protects against every threat.

Core claim

Across three agentic environments and thirteen architectural configurations, multi-agent systems are more vulnerable than standalone agents in the majority of cases, with attack success rates varying by up to 3.8x at comparable or higher benign accuracy, and no single design is universally safer.

What carries the argument

Stagewise attack evaluations that track planning refusal, execution interception, partial harmful execution, and successful completion, applied while varying agent roles, communication topology, and memory.

If this is right

Security evaluations of AI agents must include full system architecture rather than testing agents in isolation.
Attack resistance depends on specific choices for roles, topology, and memory, not just individual agent strength.
Developers cannot rely on any one multi-agent design as safe across different tasks or threats.
Further evaluations are needed that move beyond single-agent security properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems using multiple agents may require architecture-specific testing before deployment to avoid hidden attack surfaces.
The results raise the question of whether adding more agents tends to widen the security gap unless topology and memory are deliberately restricted.
Similar stagewise measurements could be applied to other coordination mechanisms, such as tool access or external memory stores, to map additional risks.

Load-bearing premise

The thirteen tested configurations and the attack scenarios chosen for the three environments stand in for the full range of security-relevant designs and real-world threats that multi-agent systems will encounter.

What would settle it

An experiment that applies the same stagewise attack measurements to a broader set of environments or attack methods and finds that single-agent systems are not less vulnerable overall, or that architecture choices no longer produce large differences in success rates, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.23459 by Ben Hagag, Christian Schroeder de Witt, Sarah Scheffler, William L. Anderson.

**Figure 1.** Figure 1: Architecture flips refusal behavior on the same task. A single agent refuses the full harmful request at planning time (a); the same base model, reconfigured as a star-topology orchestrator dispatching to four BrowserART specialists (b), decomposes the task into atomic browser calls that each specialist executes without refusal. Only the architecture changes. This raises HT from 10.0% to 31.0% on the entir… view at source ↗

read the original abstract

Multi-agent systems (MAS), composed of networks of two or more autonomous AI agents, have become increasingly popular in production deployments, yet introduce security risks that do not arise in single-agent settings. Even if individual agents exhibit robust security, architectural decisions governing their coordination can create attack surfaces that have not been systematically characterized. In this work, we present an empirical study of how MAS design decisions shape the tradeoff between task performance and attack resistance. Across three agentic environments (browser, desktop, and code) and 13 architectural configurations, we use stagewise evaluations that distinguish planning refusal, execution-stage interception, partial harmful execution, and successful attack completion to study three key design choices: (i) agent roles, which determine how authority and responsibility are allocated; (ii) communication topology, which shapes how and when agents interact; and (iii) memory, which determines the context and state visibility accessible to each agent. We find that multi-agent architectures are more vulnerable than standalone agents in the majority of configurations, with attack success rates varying by up to 3.8x at comparable or higher benign accuracy, and that no single design is universally safer. These results motivate the development of further evaluations that move beyond the security properties of a single agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Architecture choices affect attack success in multi-agent systems by up to 3.8x, but the 13 configs and fixed attacks in three environments leave the broader claim under-supported.

read the letter

The core finding is that multi-agent setups can be more vulnerable than single agents in most tested cases, with attack success rates swinging by a factor of 3.8 while benign accuracy stays comparable or better, and no design wins across the board. That matches the abstract's claim and comes from stagewise metrics that separate planning refusal from full execution success. The work is new in running the same attack scenarios across role allocations, communication topologies, and memory choices in browser, desktop, and code environments. The stagewise lens is a practical addition because it shows where defenses actually fail rather than just reporting end-to-end success. That kind of breakdown is missing from most single-agent security papers and gives engineers something concrete to measure against when they pick a topology. The paper does a clean job of holding task performance roughly constant while varying the architecture, which lets the security differences stand out. The main limitation is scope. Thirteen configurations are a start, but they may miss common patterns such as hierarchical planning or external tool calls that real deployments use. The attacks appear fixed rather than optimized against each architecture, so the 3.8x spread could shrink or reverse if adversaries adapt. Without the full methods, statistical tests, or raw data it is hard to judge whether the majority-vulnerability result holds after correcting for multiple comparisons or baseline selection. The environments are reasonable but narrow. This paper is for teams already shipping or evaluating multi-agent systems who need a first cut at architecture-level risks. Security researchers looking for empirical baselines will find the numbers and the call for better evaluations useful. It is not yet a definitive map of the design space. The work deserves peer review because it surfaces a real gap with actual experiments rather than speculation. A referee can push for clearer protocols, more varied attacks, and checks on generalizability without discarding the contribution.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study of security in multi-agent systems (MAS) versus single-agent baselines across three environments (browser, desktop, code) and 13 architectural configurations that vary agent roles, communication topology, and memory. Using stagewise metrics that separate planning refusal, execution interception, partial harmful execution, and full attack completion, the authors report that MAS designs are more vulnerable than standalone agents in the majority of tested configurations, with attack success rates (ASR) differing by up to 3.8x while maintaining comparable or higher benign accuracy, and that no single architecture is universally safer.

Significance. If the empirical scope is representative, the work is significant for establishing that architectural choices in MAS create distinct attack surfaces not captured by single-agent evaluations, thereby motivating architecture-aware security assessments. The use of multiple environments and stagewise metrics strengthens internal validity and provides concrete, falsifiable measurements of the performance-security tradeoff. The absence of machine-checked proofs or parameter-free derivations is expected for an empirical study, but the reproducible experimental design across 13 configs is a strength.

major comments (2)

[Experimental design] Experimental design section: The claim that MAS are more vulnerable 'in the majority of configurations' and that ASR varies by up to 3.8x is load-bearing for the central thesis, yet the manuscript provides no explicit coverage argument or sampling rationale for the 13 architectural variants (roles, topology, memory) or the chosen attack scenarios. Without this, it is unclear whether the majority finding and variation factor would persist under other common MAS patterns such as hierarchical planning or external tool integration.
[Results] Results and statistical analysis: The 3.8x ASR variation and 'majority' vulnerability statements require supporting details on trial counts, confidence intervals, exact baseline comparisons, and controls for post-hoc configuration selection; these are not fully reported, which directly affects verifiability of the quantitative claims.

minor comments (2)

[Evaluation methodology] The stagewise evaluation framework is clearly described but would benefit from an explicit diagram or table mapping each stage to the corresponding success/failure criteria.
[Attack scenarios] Some environment-specific attack descriptions could be expanded with pseudocode or example prompts to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study of multi-agent system security. We address each major comment below and will incorporate revisions to improve the clarity and verifiability of our claims.

read point-by-point responses

Referee: [Experimental design] Experimental design section: The claim that MAS are more vulnerable 'in the majority of configurations' and that ASR varies by up to 3.8x is load-bearing for the central thesis, yet the manuscript provides no explicit coverage argument or sampling rationale for the 13 architectural variants (roles, topology, memory) or the chosen attack scenarios. Without this, it is unclear whether the majority finding and variation factor would persist under other common MAS patterns such as hierarchical planning or external tool integration.

Authors: We selected the 13 configurations to systematically vary the three core dimensions (roles, topology, and memory) across representative patterns drawn from existing MAS literature and production systems, including specialist/generalist role allocations, star/chain/fully-connected topologies, and shared/private memory setups. However, we acknowledge that an explicit coverage argument and sampling rationale were not detailed in the manuscript. In the revision, we will add a dedicated subsection to the Experimental Design section that justifies the choice of variants, maps them to common MAS patterns (including why hierarchical planning and external tool integration are partially covered via our role and topology variations), and discusses the attack scenarios as standard benchmarks from each environment. This will clarify the scope and support the generalizability of the majority-vulnerability and 3.8x variation findings. revision: yes
Referee: [Results] Results and statistical analysis: The 3.8x ASR variation and 'majority' vulnerability statements require supporting details on trial counts, confidence intervals, exact baseline comparisons, and controls for post-hoc configuration selection; these are not fully reported, which directly affects verifiability of the quantitative claims.

Authors: We agree that fuller statistical reporting is needed to support the quantitative claims. Our experiments used 100 independent trials per configuration-environment pair, with single-agent baselines run under identical conditions. In the revised manuscript, we will expand the Results section and add a supplementary table to report: (i) exact trial counts, (ii) bootstrap 95% confidence intervals for all ASR values, (iii) direct per-configuration comparisons to the single-agent baselines, and (iv) confirmation that the 13 configurations were pre-specified based on the three design axes rather than selected post-hoc. These additions will make the 3.8x variation and majority-vulnerability statements fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of attack rates

full rationale

The paper conducts a direct empirical evaluation across 13 fixed architectural configurations and three environments, reporting measured attack success rates and benign accuracies as experimental outcomes. No equations, fitted parameters, predictions derived from prior author-defined quantities, or load-bearing self-citations appear in the derivation of the central claims. Results are obtained by running the described stagewise evaluations on the chosen setups; they do not reduce to any input by construction or via self-referential justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements rather than derivations. The study assumes the chosen attack vectors and environments capture relevant threats but introduces no new mathematical entities or fitted constants.

axioms (1)

domain assumption The stagewise attack scenarios used represent realistic threats that would arise in deployed multi-agent systems.
The paper relies on this to interpret the measured attack success rates as meaningful security differences.

pith-pipeline@v0.9.0 · 5523 in / 1320 out tokens · 34277 ms · 2026-05-08T06:47:42.967879+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Multi-Agent Security Tax: Trading off Security and Collaboration Capabilities in Multi-Agent Systems , shorttitle =

NeurIPS 2024 Workshop on Safe Generative AI. Motwani, S. R., Baranchuk, M., Strohmeier, M., Bolina, V ., Torr, P. H. S., Hammond, L., and de Witt, C. S. Secret collusion among ai agents: Multi-agent deception via steganography, 2025. URL https://arxiv.org/ abs/2402.07510. Nguyen, T., Ndebugre, M., and Arremsetty, D. Security considerations for multi-agent...

work page doi:10.1609/aaai.v39i26.34970 2024
[2]

Local” exposes the agent’s own past reasoning; “Shared

URL https://arxiv.org/abs/2505.0 2077. Shang, Z. and Wei, W. Evolving security in llms: A study of jailbreak attacks and defenses.arXiv preprint arXiv:2504.02080, 2025. UK AI Security Institute. Inspect AI: Framework for Large Language Model Evaluations, 2024. URLhttps://gi thub.com/UKGovernmentBEIS/inspect_ai. Weckbecker, M., M ¨uller, J., Hagag, B., and...

work page arXiv 2077

[1] [1]

Multi-Agent Security Tax: Trading off Security and Collaboration Capabilities in Multi-Agent Systems , shorttitle =

NeurIPS 2024 Workshop on Safe Generative AI. Motwani, S. R., Baranchuk, M., Strohmeier, M., Bolina, V ., Torr, P. H. S., Hammond, L., and de Witt, C. S. Secret collusion among ai agents: Multi-agent deception via steganography, 2025. URL https://arxiv.org/ abs/2402.07510. Nguyen, T., Ndebugre, M., and Arremsetty, D. Security considerations for multi-agent...

work page doi:10.1609/aaai.v39i26.34970 2024

[2] [2]

Local” exposes the agent’s own past reasoning; “Shared

URL https://arxiv.org/abs/2505.0 2077. Shang, Z. and Wei, W. Evolving security in llms: A study of jailbreak attacks and defenses.arXiv preprint arXiv:2504.02080, 2025. UK AI Security Institute. Inspect AI: Framework for Large Language Model Evaluations, 2024. URLhttps://gi thub.com/UKGovernmentBEIS/inspect_ai. Weckbecker, M., M ¨uller, J., Hagag, B., and...

work page arXiv 2077