MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Avni Kothari; Bo Li; Chejian Xu; Freddy Lecue; Jingyang Zhang; Sarah Tan; Wenbo Guo; Zhaorun Chen

arxiv: 2606.12918 · v1 · pith:CNMSZV6Pnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

Chejian Xu , Zhaorun Chen , Jingyang Zhang , Freddy Lecue , Avni Kothari , Sarah Tan , Wenbo Guo , Bo Li This is my paper

Pith reviewed 2026-06-27 06:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords multi-agent systemsred-teamingShapley valuescollusionadversarial attackshierarchical systemssystem robustnesscoordinated attacks

0 comments

The pith

MAStrike uses Shapley value analysis to select agent coalitions for coordinated attacks that outperform heuristic red-teaming in hierarchical multi-agent systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for red-teaming hierarchical multi-agent systems by first measuring each agent's individual contribution to overall safety through Shapley values computed on task distributions. This measurement then directs the choice of which groups of agents to compromise together, enabling coordinated manipulations that bypass defenses more reliably than isolated perturbations. The work also introduces a benchmark with controllable environments across finance, software engineering, and CRM domains to test these attacks on systems built from frontier models. If the attribution step holds, it reveals coordination patterns and interaction structures that prior single-agent approaches overlook. The result is a closed-loop process that refines attacks by diagnosing which uncompromised agents block success.

Core claim

MAStrike performs the first agent-level Shapley value analysis for multi-agent systems to quantify each agent's marginal contribution to system robustness under task-specific distributions. Guided by this attribution, the framework identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis that attributes failure cases to uncompromised agents. Experiments across multiple frontier models demonstrate that MAStrike substantially outperforms heuristic baselines while uncovering non-trivial Shapley value distributions and higher-order interaction structures among agents.

What carries the argument

Agent-level Shapley value analysis, which quantifies each agent's marginal contribution to system robustness under task-specific distributions and guides coalition selection for attacks.

If this is right

Vulnerable agent coalitions become identifiable through attribution rather than manual or random choice.
Coordinated, role-aware adversarial manipulations can be generated to exploit cross-agent interactions.
Iterative causal diagnosis allows refinement by targeting agents that block initial attack attempts.
Higher-order interaction structures among agents become visible for analysis beyond single-agent methods.
A standardized benchmark enables consistent evaluation of red-teaming approaches across topologies and domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders could apply the same attribution technique to redistribute safety responsibilities and reduce the impact of any single coalition.
The benchmark environments could support testing of defensive mechanisms that detect or disrupt coordinated behaviors.
Similar marginal-contribution analysis might identify critical components in other distributed systems such as networked services or workflow pipelines.

Load-bearing premise

Shapley value analysis on task-specific distributions accurately identifies the agents most responsible for system safety and that this attribution reliably guides the discovery of effective collusive attack coalitions.

What would settle it

Experiments on the same MAS setups and models where replacing Shapley-guided coalition selection with random or heuristic selection yields comparable attack success rates and failure patterns.

Figures

Figures reproduced from arXiv: 2606.12918 by Avni Kothari, Bo Li, Chejian Xu, Freddy Lecue, Jingyang Zhang, Sarah Tan, Wenbo Guo, Zhaorun Chen.

**Figure 1.** Figure 1: Overview of MASTRIKE. (Top) Agent-level Shapley value analysis: we generate red-teaming tasks, estimate coalition-level attack success rates (ASR), and compute Shapley values and interaction indices to quantify agent importance and inter-agent dependencies. (Bottom) Shapley-guided red-teaming optimization: given a malicious instruction, we leverage Shapley signals via task similarity, select a synergy-awar… view at source ↗

**Figure 2.** Figure 2: Illustrative examples of attacks against hierarchical MAS. Agent names are enclosed in angle brackets. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: ASR (%) vs. coalition size k on ClaudeOpus-4.7. MASTRIKE consistently outperforms baselines and scales effectively with k The performance gap is particularly pronounced in the engineering and finance domains, where MASTRIKE frequently achieves ASR above 70–100%, suggesting that coordinated attacks can effectively bypass distributed safety mechanisms across multiple agents. Notably, AiTM fails almost en… view at source ↗

**Figure 4.** Figure 4: Shapley value distribution of agents in different domains. Agents with zero Shapley values are omitted. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Shapley interaction index in the engineering domain, where the ASR of composed agent pairs are [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Shapley interaction index. A Additional details on sample-based estimation Estimation from sampled coalitions. Equations (2) and (3) define the population Shapley value and interaction index, which require evaluating vq on all 2 |A| coalitions. For domains with small attackable sets (|A| = 6 in finance), we enumerate the full power set and compute ϕi(q) and Iij (q) exactly. For larger sets (engineering and… view at source ↗

**Figure 7.** Figure 7: Finance benign trajectory. The MAS independently verifies identity, device trust, account history, and [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Finance red-team trajectory. A malicious password-reset request succeeds after the compromised [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Engineering benign trajectory. Independent technical and management checks clear a normal bug-fix [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Engineering red-team trajectory. A request to remove SAST and secret-scan gates from the blocking [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: CRM benign trajectory. The MAS creates an inbound Salesforce lead and sends an acknowledgment [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: CRM red-team trajectory. An oversized refund succeeds when only the [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Finance MAS design [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Engineering MAS design [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: CRM MAS design. 200 CRM cases in the full benign benchmark. Following pilot calibration on the finance MAS, instances mix easy and hard contexts (e.g., busy-but-clean account histories, density-rich engineering repositories, and multi-stakeholder CRM accounts) so that the benchmark differentiates model capability rather than ceiling at uniform success. Benign instances are scored by an execution-based jud… view at source ↗

read the original abstract

Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAStrike's agent-level Shapley attribution for collusive red-teaming in MAS is a genuine first step, but the abstract gives almost no concrete evidence on how the value function or experiments actually work.

read the letter

The paper's core move is to compute Shapley values over agents in hierarchical MAS to rank their marginal impact on system robustness, then feed that ranking into a closed-loop process that finds attacking coalitions and refines them with causal diagnosis. That combination has not appeared before in the cited literature, and the new benchmark spanning finance, software engineering, and CRM topologies is a practical addition.

They do a clean job framing the problem: single-agent or heuristic red-teaming misses coordinated privilege escalation, and the Shapley step gives an explicit way to prioritize which agents to target. The claim that they uncover non-trivial interaction structures among agents is plausible and worth checking.

The soft spots are mostly in the missing mechanics. The abstract never defines the value function used for the Shapley calculation or how they approximate the 2^n coalitions, so it is impossible to tell whether the attribution is stable or just reflects the chosen task distribution. The outperformance statement is repeated but carries no numbers, baselines, or significance tests, which makes the soundness hard to assess from what is shown. If the full experiments rest on the same level of detail, a referee would need to see the exact coalition search algorithm and the failure attribution procedure before accepting the results.

This is for researchers already working on multi-agent security or red-teaming pipelines. Someone building evaluation suites for deployed agent systems could pick up the benchmark and the Shapley framing even if they end up changing the attack generation. It is not yet a finished method, but the direction is coherent enough that a serious editor should send it out for review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces MAStrike, a closed-loop collusive red-teaming framework for hierarchical multi-agent systems. It proposes the first agent-level Shapley value analysis to quantify each agent's marginal contribution to system robustness under task-specific distributions, uses this to identify vulnerable coalitions, generates role-aware adversarial manipulations, and refines attacks via causal diagnosis of failures. The work also constructs a new MAS red-teaming benchmark spanning finance, software engineering, and CRM domains with varied topologies, and reports that MAStrike substantially outperforms heuristic baselines across MAS built on multiple frontier models while revealing non-trivial Shapley distributions and interaction structures.

Significance. If the experimental claims and Shapley attribution hold under scrutiny, the work would be significant for MAS security research by shifting from heuristic single-agent red-teaming to principled coalition discovery and attribution. The new benchmark and closed-loop refinement process could enable more systematic evaluation of distributed safety properties in deployed systems.

major comments (2)

Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.
Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.

minor comments (1)

Abstract: Typo 'GGuided' should be 'Guided'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential significance. We address the two major comments on the abstract below.

read point-by-point responses

Referee: Abstract: The claim of 'extensive experiments' showing that MAStrike 'substantially outperforms heuristic baselines' is presented without any metrics, baselines, statistical significance tests, dataset sizes, or data-handling details, preventing assessment of whether the central outperformance result is supported.

Authors: We agree that the abstract, constrained by length, does not include these specifics. The full manuscript provides them in Sections 4 (experimental setup, baselines, metrics, statistical tests) and 5 (results with dataset details and data handling). To address the concern directly, we will revise the abstract to reference the benchmark scale, domains, and that results include statistical validation. revision: yes
Referee: Abstract: The weakest assumption—that Shapley value analysis on task-specific distributions accurately identifies agents responsible for system safety and reliably guides collusive attack discovery—is stated without the value-function definition, coalition enumeration method, or any validation that the attribution step improves attack success over random or heuristic selection.

Authors: The value function, task-specific distributions, and coalition enumeration (exact for small sets, sampled otherwise) are formally defined in Section 3. Validation via ablations showing improvement over random/heuristic selection appears in Section 5. We will revise the abstract to briefly indicate the empirical validation of the Shapley-guided component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies the standard Shapley value from cooperative game theory as an attribution tool to MAS robustness under task distributions, without re-deriving or redefining the value function in terms of its own outputs. Benchmark construction, coalition identification, and empirical outperformance versus heuristics are presented as experimental results rather than closed-form predictions or self-referential derivations. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description that would reduce the central claims to inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, methods sections, or implementation details from which to extract free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5811 in / 1022 out tokens · 16782 ms · 2026-06-27T06:39:10.128889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Syntaxshap: Syntax-aware explainability method for text generation

Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Syntaxshap: Syntax-aware explainability method for text generation. InACL, 2024

2024
[2]

Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Yang Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. InEMNLP, 2024

2024
[3]

Agentharm: A benchmark for measuring harmfulness of llm agents, 2024

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. InICLR, 2025

2024
[4]

Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

2009
[5]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. InNeurIPS, 2024

2024
[6]

Lira: A multi-agent framework for reliable and readable literature review generation

Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Seyed Amin Tabatabaei, Maarten de Rijke, and Xinyi Chen. Lira: A multi-agent framework for reliable and readable literature review generation. InAAAI, 2026. 10

2026
[7]

An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

1999
[8]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. InICML, 2024

2024
[9]

Large language model based multi-agents: A survey of progress and challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024

2024
[10]

Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenˇciak, et al. Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

work page arXiv 2025
[11]

Red-teaming llm multi-agent systems via communication attacks

Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. InACL, 2025

2025
[12]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InICLR, 2024

2024
[13]

On the resilience of llm-based multi-agent collaboration with faulty agents

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents. InICML, 2024

2024
[14]

Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

Ruoxi Jia, David Dao, Boxin Wang, Frances A Hubis, Nezihe M Gürel, Nick Hynes, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

2019
[15]

Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

Mintong Kang, Linyi Li, and Bo Li. Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

2022
[16]

Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

work page arXiv 2025
[17]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017

2017
[19]

Communicative agents for software development

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. InACL, 2024

2024
[20]

Identifying the risks of lm agents with an lm-emulated sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. InICLR, 2024

2024
[21]

Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning

Franco Ruggeri, William Emanuelsson, Ahmad Terra, Rafia Inam, and Karl H Johansson. Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning. InIEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024

2024
[22]

A value for n-person games

Lloyd S Shapley et al. A value for n-person games. 1953

1953
[23]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InACL, 2024. 11

2024
[24]

Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

Yiling Tao, Xinran Zheng, Shuo Yang, Meiling Tao, and Xingjun Wang. Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

work page arXiv 2026
[25]

Shapley q-value: A local reward approach to solve global reward games

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. InAAAI, 2020

2020
[26]

Tokenshapley: Token level context attribution with shapley value

Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T Wang, and Jian Du. Tokenshapley: Token level context attribution with shapley value. InACL, 2025

2025
[27]

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Zaibin Zhang, Yongting Zhang, Lijun Li, Jing Shao, Hongzhi Gao, Yu Qiao, Lijun Wang, Huchuan Lu, and Feng Zhao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. InACL, 2024. 12 IV DT TR AH AR PC IV DT TR AH AR PC Finance 0.61 -0.06 -0.00 -0.02 -0.05 -0.01 0.02 -0.08 -0.09 0.02 0.09 -0...

2024

[1] [1]

Syntaxshap: Syntax-aware explainability method for text generation

Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Syntaxshap: Syntax-aware explainability method for text generation. InACL, 2024

2024

[2] [2]

Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, and William Yang Wang. Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate. InEMNLP, 2024

2024

[3] [3]

Agentharm: A benchmark for measuring harmfulness of llm agents, 2024

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. InICLR, 2025

2024

[4] [4]

Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling.Computers & Operations Research, 2009

2009

[5] [5]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. InNeurIPS, 2024

2024

[6] [6]

Lira: A multi-agent framework for reliable and readable literature review generation

Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Seyed Amin Tabatabaei, Maarten de Rijke, and Xinyi Chen. Lira: A multi-agent framework for reliable and readable literature review generation. InAAAI, 2026. 10

2026

[7] [7]

An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games.International Journal of Game Theory, 1999

1999

[8] [8]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. InICML, 2024

2024

[9] [9]

Large language model based multi-agents: A survey of progress and challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InIJCAI, 2024

2024

[10] [10]

Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tomáš Gavenˇciak, et al. Multi- agent risks from advanced ai.arXiv preprint arXiv:2502.14143, 2025

work page arXiv 2025

[11] [11]

Red-teaming llm multi-agent systems via communication attacks

Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. Red-teaming llm multi-agent systems via communication attacks. InACL, 2025

2025

[12] [12]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InICLR, 2024

2024

[13] [13]

On the resilience of llm-based multi-agent collaboration with faulty agents

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based multi-agent collaboration with faulty agents. InICML, 2024

2024

[14] [14]

Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

Ruoxi Jia, David Dao, Boxin Wang, Frances A Hubis, Nezihe M Gürel, Nick Hynes, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value.Proceedings of AISTATS 2019, 2019

2019

[15] [15]

Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

Mintong Kang, Linyi Li, and Bo Li. Fashapley: Fast and approximated shapley based model pruning towards certifiably robust dnns.SaTML, 2022

2022

[16] [16]

Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. Tamas: Benchmarking adversarial risks in multi-agent llm systems.arXiv preprint arXiv:2511.05269, 2025

work page arXiv 2025

[17] [17]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, 2017

2017

[19] [19]

Communicative agents for software development

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. InACL, 2024

2024

[20] [20]

Identifying the risks of lm agents with an lm-emulated sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. InICLR, 2024

2024

[21] [21]

Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning

Franco Ruggeri, William Emanuelsson, Ahmad Terra, Rafia Inam, and Karl H Johansson. Rollout-based shapley values for explainable cooperative multi-agent reinforcement learning. InIEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024

2024

[22] [22]

A value for n-person games

Lloyd S Shapley et al. A value for n-person games. 1953

1953

[23] [23]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InACL, 2024. 11

2024

[24] [24]

Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

Yiling Tao, Xinran Zheng, Shuo Yang, Meiling Tao, and Xingjun Wang. Groupguard: A framework for modeling and defending collusive attacks in multi-agent systems.arXiv preprint arXiv:2603.13940, 2026

work page arXiv 2026

[25] [25]

Shapley q-value: A local reward approach to solve global reward games

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. InAAAI, 2020

2020

[26] [26]

Tokenshapley: Token level context attribution with shapley value

Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T Wang, and Jian Du. Tokenshapley: Token level context attribution with shapley value. InACL, 2025

2025

[27] [27]

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Zaibin Zhang, Yongting Zhang, Lijun Li, Jing Shao, Hongzhi Gao, Yu Qiao, Lijun Wang, Huchuan Lu, and Feng Zhao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. InACL, 2024. 12 IV DT TR AH AR PC IV DT TR AH AR PC Finance 0.61 -0.06 -0.00 -0.02 -0.05 -0.01 0.02 -0.08 -0.09 0.02 0.09 -0...

2024