MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems
Pith reviewed 2026-06-27 06:39 UTC · model grok-4.3
The pith
MAStrike uses Shapley value analysis to select agent coalitions for coordinated attacks that outperform heuristic red-teaming in hierarchical multi-agent systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAStrike performs the first agent-level Shapley value analysis for multi-agent systems to quantify each agent's marginal contribution to system robustness under task-specific distributions. Guided by this attribution, the framework identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis that attributes failure cases to uncompromised agents. Experiments across multiple frontier models demonstrate that MAStrike substantially outperforms heuristic baselines while uncovering non-trivial Shapley value distributions and higher-order interaction structures among agents.
What carries the argument
Agent-level Shapley value analysis, which quantifies each agent's marginal contribution to system robustness under task-specific distributions and guides coalition selection for attacks.
If this is right
- Vulnerable agent coalitions become identifiable through attribution rather than manual or random choice.
- Coordinated, role-aware adversarial manipulations can be generated to exploit cross-agent interactions.
- Iterative causal diagnosis allows refinement by targeting agents that block initial attack attempts.
- Higher-order interaction structures among agents become visible for analysis beyond single-agent methods.
- A standardized benchmark enables consistent evaluation of red-teaming approaches across topologies and domains.
Where Pith is reading between the lines
- Defenders could apply the same attribution technique to redistribute safety responsibilities and reduce the impact of any single coalition.
- The benchmark environments could support testing of defensive mechanisms that detect or disrupt coordinated behaviors.
- Similar marginal-contribution analysis might identify critical components in other distributed systems such as networked services or workflow pipelines.
Load-bearing premise
Shapley value analysis on task-specific distributions accurately identifies the agents most responsible for system safety and that this attribution reliably guides the discovery of effective collusive attack coalitions.
What would settle it
Experiments on the same MAS setups and models where replacing Shapley-guided coalition selection with random or heuristic selection yields comparable attack success rates and failure patterns.
Figures
read the original abstract
Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MAStrike, a closed-loop collusive red-teaming framework for hierarchical multi-agent systems. It proposes the first agent-level Shapley value analysis to quantify each agent's marginal contribution to system robustness under task-specific distributions, uses this to identify vulnerable coalitions, generates role-aware adversarial manipulations, and refines attacks via causal diagnosis of failures. The work also constructs a new MAS red-teaming benchmark spanning finance, software engineering, and CRM domains with varied topologies, and reports that MAStrike substantially outperforms heuristic baselines across MAS built on multiple frontier models while revealing non-trivial Shapley distributions and interaction structures.
Significance. If the experimental claims and Shapley attribution hold under scrutiny, the work would be significant for MAS security research by shifting from heuristic single-agent red-teaming to principled coalition discovery and attribution. The new benchmark and closed-loop refinement process could enable more systematic evaluation of distributed safety properties in deployed systems.
major comments (2)
- Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.
- Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.
minor comments (1)
- Abstract: Typo 'GGuided' should be 'Guided'.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the work's potential significance. We address the two major comments on the abstract below.
read point-by-point responses
-
Referee: Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.
Authors: We agree that the abstract, constrained by length, does not include these specifics. The full manuscript provides them in Sections 4 (experimental setup, baselines, metrics, statistical tests) and 5 (results with dataset details and data handling). To address the concern directly, we will revise the abstract to reference the benchmark scale, domains, and that results include statistical validation. revision: yes
-
Referee: Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.
Authors: The value function, task-specific distributions, and coalition enumeration (exact for small sets, sampled otherwise) are formally defined in Section 3. Validation via ablations showing improvement over random/heuristic selection appears in Section 5. We will revise the abstract to briefly indicate the empirical validation of the Shapley-guided component. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper applies the standard Shapley value from cooperative game theory as an attribution tool to MAS robustness under task distributions, without re-deriving or redefining the value function in terms of its own outputs. Benchmark construction, coalition identification, and empirical outperformance versus heuristics are presented as experimental results rather than closed-form predictions or self-referential derivations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description that would reduce the central claims to inputs by construction. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Syntaxshap: Syntax-aware explainability method for text generation
Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Syntaxshap: Syntax-aware explainability method for text generation. InACL, 2024
2024
-
[2]
Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate
Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Yang Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. InEMNLP, 2024
2024
-
[3]
Agentharm: A benchmark for measuring harmfulness of llm agents, 2024
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. InICLR, 2025
2024
-
[4]
Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009
Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009
2009
-
[5]
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. InNeurIPS, 2024
2024
-
[6]
Lira: A multi-agent framework for reliable and readable literature review generation
Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Seyed Amin Tabatabaei, Maarten de Rijke, and Xinyi Chen. Lira: A multi-agent framework for reliable and readable literature review generation. InAAAI, 2026. 10
2026
-
[7]
An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999
Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999
1999
-
[8]
Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast
Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. InICML, 2024
2024
-
[9]
Large language model based multi-agents: A survey of progress and challenges
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024
2024
-
[10]
Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025
Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenˇciak, et al. Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025
-
[11]
Red-teaming llm multi-agent systems via communication attacks
Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. InACL, 2025
2025
-
[12]
MetaGPT: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InICLR, 2024
2024
-
[13]
On the resilience of llm-based multi-agent collaboration with faulty agents
Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents. InICML, 2024
2024
-
[14]
Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019
Ruoxi Jia, David Dao, Boxin Wang, Frances A Hubis, Nezihe M Gürel, Nick Hynes, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019
2019
-
[15]
Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022
Mintong Kang, Linyi Li, and Bo Li. Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022
2022
-
[16]
Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025
-
[17]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
A unified approach to interpreting model predictions
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017
2017
-
[19]
Communicative agents for software development
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. InACL, 2024
2024
-
[20]
Identifying the risks of lm agents with an lm-emulated sandbox
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. InICLR, 2024
2024
-
[21]
Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning
Franco Ruggeri, William Emanuelsson, Ahmad Terra, Rafia Inam, and Karl H Johansson. Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning. InIEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024
2024
-
[22]
A value for n-person games
Lloyd S Shapley et al. A value for n-person games. 1953
1953
-
[23]
Medagents: Large language models as collaborators for zero-shot medical reasoning
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InACL, 2024. 11
2024
-
[24]
Yiling Tao, Xinran Zheng, Shuo Yang, Meiling Tao, and Xingjun Wang. Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026
-
[25]
Shapley q-value: A local reward approach to solve global reward games
Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. InAAAI, 2020
2020
-
[26]
Tokenshapley: Token level context attribution with shapley value
Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T Wang, and Jian Du. Tokenshapley: Token level context attribution with shapley value. InACL, 2025
2025
-
[27]
Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety
Zaibin Zhang, Yongting Zhang, Lijun Li, Jing Shao, Hongzhi Gao, Yu Qiao, Lijun Wang, Huchuan Lu, and Feng Zhao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. InACL, 2024. 12 IV DT TR AH AR PC IV DT TR AH AR PC Finance 0.61 -0.06 -0.00 -0.02 -0.05 -0.01 0.02 -0.08 -0.09 0.02 0.09 -0...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.