pith. sign in

arxiv: 2606.12918 · v1 · pith:CNMSZV6Pnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Pith reviewed 2026-06-27 06:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords multi-agent systemsred-teamingShapley valuescollusionadversarial attackshierarchical systemssystem robustnesscoordinated attacks
0
0 comments X

The pith

MAStrike uses Shapley value analysis to select agent coalitions for coordinated attacks that outperform heuristic red-teaming in hierarchical multi-agent systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for red-teaming hierarchical multi-agent systems by first measuring each agent's individual contribution to overall safety through Shapley values computed on task distributions. This measurement then directs the choice of which groups of agents to compromise together, enabling coordinated manipulations that bypass defenses more reliably than isolated perturbations. The work also introduces a benchmark with controllable environments across finance, software engineering, and CRM domains to test these attacks on systems built from frontier models. If the attribution step holds, it reveals coordination patterns and interaction structures that prior single-agent approaches overlook. The result is a closed-loop process that refines attacks by diagnosing which uncompromised agents block success.

Core claim

MAStrike performs the first agent-level Shapley value analysis for multi-agent systems to quantify each agent's marginal contribution to system robustness under task-specific distributions. Guided by this attribution, the framework identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis that attributes failure cases to uncompromised agents. Experiments across multiple frontier models demonstrate that MAStrike substantially outperforms heuristic baselines while uncovering non-trivial Shapley value distributions and higher-order interaction structures among agents.

What carries the argument

Agent-level Shapley value analysis, which quantifies each agent's marginal contribution to system robustness under task-specific distributions and guides coalition selection for attacks.

If this is right

  • Vulnerable agent coalitions become identifiable through attribution rather than manual or random choice.
  • Coordinated, role-aware adversarial manipulations can be generated to exploit cross-agent interactions.
  • Iterative causal diagnosis allows refinement by targeting agents that block initial attack attempts.
  • Higher-order interaction structures among agents become visible for analysis beyond single-agent methods.
  • A standardized benchmark enables consistent evaluation of red-teaming approaches across topologies and domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenders could apply the same attribution technique to redistribute safety responsibilities and reduce the impact of any single coalition.
  • The benchmark environments could support testing of defensive mechanisms that detect or disrupt coordinated behaviors.
  • Similar marginal-contribution analysis might identify critical components in other distributed systems such as networked services or workflow pipelines.

Load-bearing premise

Shapley value analysis on task-specific distributions accurately identifies the agents most responsible for system safety and that this attribution reliably guides the discovery of effective collusive attack coalitions.

What would settle it

Experiments on the same MAS setups and models where replacing Shapley-guided coalition selection with random or heuristic selection yields comparable attack success rates and failure patterns.

Figures

Figures reproduced from arXiv: 2606.12918 by Avni Kothari, Bo Li, Chejian Xu, Freddy Lecue, Jingyang Zhang, Sarah Tan, Wenbo Guo, Zhaorun Chen.

Figure 1
Figure 1. Figure 1: Overview of MASTRIKE. (Top) Agent-level Shapley value analysis: we generate red-teaming tasks, estimate coalition-level attack success rates (ASR), and compute Shapley values and interaction indices to quantify agent importance and inter-agent dependencies. (Bottom) Shapley-guided red-teaming optimization: given a malicious instruction, we leverage Shapley signals via task similarity, select a synergy-awar… view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative examples of attacks against hierarchical MAS. Agent names are enclosed in angle brackets. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ASR (%) vs. coalition size k on Claude￾Opus-4.7. MASTRIKE consistently outperforms baselines and scales effectively with k The performance gap is particularly pronounced in the engineering and finance domains, where MAS￾TRIKE frequently achieves ASR above 70–100%, sug￾gesting that coordinated attacks can effectively by￾pass distributed safety mechanisms across multiple agents. Notably, AiTM fails almost en… view at source ↗
Figure 4
Figure 4. Figure 4: Shapley value distribution of agents in different domains. Agents with zero Shapley values are omitted. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Shapley interaction index in the engineering domain, where the ASR of composed agent pairs are [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Shapley interaction index. A Additional details on sample-based estimation Estimation from sampled coalitions. Equations (2) and (3) define the population Shapley value and interaction index, which require evaluating vq on all 2 |A| coalitions. For domains with small attackable sets (|A| = 6 in finance), we enumerate the full power set and compute ϕi(q) and Iij (q) exactly. For larger sets (engineering and… view at source ↗
Figure 7
Figure 7. Figure 7: Finance benign trajectory. The MAS independently verifies identity, device trust, account history, and [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Finance red-team trajectory. A malicious password-reset request succeeds after the compromised [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Engineering benign trajectory. Independent technical and management checks clear a normal bug-fix [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Engineering red-team trajectory. A request to remove SAST and secret-scan gates from the blocking [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CRM benign trajectory. The MAS creates an inbound Salesforce lead and sends an acknowledgment [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CRM red-team trajectory. An oversized refund succeeds when only the [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Finance MAS design [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Engineering MAS design [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: CRM MAS design. 200 CRM cases in the full benign benchmark. Following pilot calibration on the finance MAS, instances mix easy and hard contexts (e.g., busy-but-clean account histories, density-rich engineering repositories, and multi-stakeholder CRM accounts) so that the benchmark differentiates model capability rather than ceiling at uniform success. Benign instances are scored by an execution-based jud… view at source ↗
read the original abstract

Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MAStrike, a closed-loop collusive red-teaming framework for hierarchical multi-agent systems. It proposes the first agent-level Shapley value analysis to quantify each agent's marginal contribution to system robustness under task-specific distributions, uses this to identify vulnerable coalitions, generates role-aware adversarial manipulations, and refines attacks via causal diagnosis of failures. The work also constructs a new MAS red-teaming benchmark spanning finance, software engineering, and CRM domains with varied topologies, and reports that MAStrike substantially outperforms heuristic baselines across MAS built on multiple frontier models while revealing non-trivial Shapley distributions and interaction structures.

Significance. If the experimental claims and Shapley attribution hold under scrutiny, the work would be significant for MAS security research by shifting from heuristic single-agent red-teaming to principled coalition discovery and attribution. The new benchmark and closed-loop refinement process could enable more systematic evaluation of distributed safety properties in deployed systems.

major comments (2)
  1. Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.
  2. Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.
minor comments (1)
  1. Abstract: Typo 'GGuided' should be 'Guided'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential significance. We address the two major comments on the abstract below.

read point-by-point responses
  1. Referee: Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.

    Authors: We agree that the abstract, constrained by length, does not include these specifics. The full manuscript provides them in Sections 4 (experimental setup, baselines, metrics, statistical tests) and 5 (results with dataset details and data handling). To address the concern directly, we will revise the abstract to reference the benchmark scale, domains, and that results include statistical validation. revision: yes

  2. Referee: Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.

    Authors: The value function, task-specific distributions, and coalition enumeration (exact for small sets, sampled otherwise) are formally defined in Section 3. Validation via ablations showing improvement over random/heuristic selection appears in Section 5. We will revise the abstract to briefly indicate the empirical validation of the Shapley-guided component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies the standard Shapley value from cooperative game theory as an attribution tool to MAS robustness under task distributions, without re-deriving or redefining the value function in terms of its own outputs. Benchmark construction, coalition identification, and empirical outperformance versus heuristics are presented as experimental results rather than closed-form predictions or self-referential derivations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description that would reduce the central claims to inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, methods sections, or implementation details from which to extract free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5811 in / 1022 out tokens · 16782 ms · 2026-06-27T06:39:10.128889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Syntaxshap: Syntax-aware explainability method for text generation

    Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Syntaxshap: Syntax-aware explainability method for text generation. InACL, 2024

  2. [2]

    Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate

    Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Yang Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. InEMNLP, 2024

  3. [3]

    Agentharm: A benchmark for measuring harmfulness of llm agents, 2024

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. InICLR, 2025

  4. [4]

    Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

    Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

  5. [5]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. InNeurIPS, 2024

  6. [6]

    Lira: A multi-agent framework for reliable and readable literature review generation

    Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Seyed Amin Tabatabaei, Maarten de Rijke, and Xinyi Chen. Lira: A multi-agent framework for reliable and readable literature review generation. InAAAI, 2026. 10

  7. [7]

    An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

    Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

  8. [8]

    Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. InICML, 2024

  9. [9]

    Large language model based multi-agents: A survey of progress and challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024

  10. [10]

    Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

    Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenˇciak, et al. Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

  11. [11]

    Red-teaming llm multi-agent systems via communication attacks

    Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. InACL, 2025

  12. [12]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InICLR, 2024

  13. [13]

    On the resilience of llm-based multi-agent collaboration with faulty agents

    Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents. InICML, 2024

  14. [14]

    Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

    Ruoxi Jia, David Dao, Boxin Wang, Frances A Hubis, Nezihe M Gürel, Nick Hynes, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

  15. [15]

    Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

    Mintong Kang, Linyi Li, and Bo Li. Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

  16. [16]

    Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

    Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

  17. [17]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

  18. [18]

    A unified approach to interpreting model predictions

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017

  19. [19]

    Communicative agents for software development

    Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. InACL, 2024

  20. [20]

    Identifying the risks of lm agents with an lm-emulated sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. InICLR, 2024

  21. [21]

    Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning

    Franco Ruggeri, William Emanuelsson, Ahmad Terra, Rafia Inam, and Karl H Johansson. Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning. InIEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024

  22. [22]

    A value for n-person games

    Lloyd S Shapley et al. A value for n-person games. 1953

  23. [23]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InACL, 2024. 11

  24. [24]

    Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

    Yiling Tao, Xinran Zheng, Shuo Yang, Meiling Tao, and Xingjun Wang. Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

  25. [25]

    Shapley q-value: A local reward approach to solve global reward games

    Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. InAAAI, 2020

  26. [26]

    Tokenshapley: Token level context attribution with shapley value

    Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T Wang, and Jian Du. Tokenshapley: Token level context attribution with shapley value. InACL, 2025

  27. [27]

    Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

    Zaibin Zhang, Yongting Zhang, Lijun Li, Jing Shao, Hongzhi Gao, Yu Qiao, Lijun Wang, Huchuan Lu, and Feng Zhao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. InACL, 2024. 12 IV DT TR AH AR PC IV DT TR AH AR PC Finance 0.61 -0.06 -0.00 -0.02 -0.05 -0.01 0.02 -0.08 -0.09 0.02 0.09 -0...