The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Qiqi Liu; Runhan Song; Shilin Ye; Thorsten Holz

arxiv: 2605.17480 · v2 · pith:D76CC3F7new · submitted 2026-05-17 · 💻 cs.AI

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

Qiqi Liu , Thorsten Holz , Shilin Ye , Runhan Song This is my paper

Pith reviewed 2026-05-20 12:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsLLM securitysemantic hijackinglinguistic certaintycapability paradoxmediation analysisadversarial attacksagent ensembles

0 comments

The pith

Stronger worker agents raise multi-agent LLM attack success rates because they report adversarial narratives with greater linguistic certainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies multi-agent systems of large language models that split tasks between worker and manager agents. It shows that raising worker capability increases the overall attack success rate from 18.4 percent to 63.9 percent because stronger workers interpret hidden harmful requests as legitimate and convey them more assertively. Mediation analysis across thousands of trials finds that linguistic certainty accounts for most of this rise, as managers treat confident worker endorsements as sufficient reason to act. The authors demonstrate that pairing workers with mismatched strengths breaks the certainty chain and cuts attack success sharply while preserving normal performance. The finding challenges the assumption that component upgrades always strengthen system security.

Core claim

As worker capability increases, mean system-level attack success rate rises from 18.4% to 63.9% and peaks at 94.4%, with linguistic certainty mediating 74% of the effect in the larger worker-only setting; stronger workers are more likely to treat adversarial narratives as valid, state conclusions assertively, and thereby prompt managers to execute the concealed harmful requests.

What carries the argument

Linguistic certainty in worker reports, which transmits higher worker capability into manager compliance with hidden harmful instructions through assertive endorsements.

If this is right

Worker-side safety prompting does not reliably lower attack success rates.
Upgrading individual agents to stronger models can actively increase system vulnerability instead of reducing it.
Heterogeneous ensemble verification that pairs workers of asymmetric competence reduces attack success from 52.8% to 2.0% with negligible effect on benign tasks.
Effective defenses must exploit rather than remove capability differences between agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same certainty mechanism could appear in other multi-agent decision systems where one agent must act on another agent's confident summary.
Designers of future multi-agent systems may need to test for capability paradoxes whenever one agent's output influences another's execution threshold.
Capability diversity could be treated as an intentional security control rather than a uniformity goal in agent teams.

Load-bearing premise

The mediation analysis correctly isolates linguistic certainty as the main driver rather than other unmeasured differences in how stronger models process the narratives or in manager decision thresholds.

What would settle it

An experiment in which attack success rates stay flat or decline with rising worker capability, or in which the indirect effect through linguistic certainty has confidence intervals that include zero, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17480 by Qiqi Liu, Runhan Song, Shilin Ye, Thorsten Holz.

**Figure 1.** Figure 1: Overview of our semantic hijacking evaluation framework. Real postmortem incidents are mutated by an LLM into domain-coherent attack and benign payloads (top). Each payload enters through the Worker’s input channel (➀), the Worker audits it (➁) and forwards an assessment to the Manager (➂). The Manager then independently decides whether to invoke tools (➃–➄). A LLM-based Oracle grades each interaction. The… view at source ↗

**Figure 2.** Figure 2: Scatter plot of MMLU score (x-axis, %) versus FR (y-axis, %) under Config B across 14 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Per-Worker susceptibility (Config B, n=500 per Worker) under two independent Oracle architectures. Gemini-3-Flash (main experiment) and GPT-4o-mini (cross-architecture validation) yield highly consistent rankings (Spearman ρ=0.89, p<10−4 ). Workers are ordered by ascending Gemini-Oracle ASR. The single outlier (Qwen-3.5-9B) diverges by 37 percentage points; all other 12 Workers agree within ±9 pp, with 11 … view at source ↗

**Figure 4.** Figure 4: Robustness check with GPQA-Diamond as an alternative capability proxy. Worker Fool [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman ρ between Worker report features and attack outcomes (Config A, N = 37,040). Hedging density shows the strongest negative correlation with attack success (ρ = −0.43); domainterm density is uncorrelated (ρ = 0.00). Significance: ∗∗∗p < 0.001. C.7.5 Feature Differences Within Safe-Assessed Cases [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Unpredictability of Worker-Side Safety Prompting. Each point represents one model; [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose heterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper reports results from 42,000 adversarial trials across 12 Manager models and 7 Worker configurations demonstrating a capability paradox in multi-agent LLM systems: increasing Worker capability raises mean system-level Attack Success Rate (ASR) from 18.4% to 63.9% (peaking at 94.4%) under semantic hijacking attacks that embed harmful requests in domain narratives. Multi-level mediation analysis on two independent datasets (47,807 interactions) finds that linguistic certainty mediates 74% of the effect in the Worker-Only setting (n_W=14) with 95% CIs excluding zero via Monte Carlo and cluster bootstrap; the Full-MAS setting (n_W=6) is directionally consistent. Worker-side safety prompting fails to mitigate the issue, but heterogeneous ensemble verification pairing asymmetric Workers reduces ASR from 52.8% to 2.0% with negligible benign-task impact.

Significance. If the mediation result holds after addressing potential confounders, the work provides a large-scale empirical demonstration that scaling individual agent capability can degrade overall MAS security, with a concrete defense that exploits rather than removes capability differences. The scale of trials, use of independent datasets for mediation, and bootstrap CIs are strengths that support falsifiable claims about the certainty-to-execution pathway.

major comments (1)

[mediation analysis (Worker-Only setting, n_W=14)] The multi-level mediation analysis (Worker-Only setting, n_W=14) reports that linguistic certainty mediates 74% of the capability-ASR effect with CIs excluding zero. However, the analysis does not appear to include controls for report-level covariates such as semantic fidelity to the adversarial narrative, reasoning-chain coherence, or domain alignment. Stronger Workers may produce higher-quality persuasive content on these dimensions independently of asserted certainty, raising the possibility of omitted-variable bias in the indirect-effect estimate.

minor comments (2)

The abstract and methods description should explicitly state the data exclusion rules, exact prompt templates for Workers and Managers, and how ASR is computed at the system level to allow replication of the 42,000-trial curves.
Clarify whether the 12 Manager models and 7 Worker configurations were pre-registered or selected post-hoc, and report any sensitivity checks on model choice.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the scale and empirical strengths of our study. The major comment identifies a valid concern about potential omitted-variable bias in the mediation analysis. We address this point directly below and have revised the manuscript to incorporate additional controls and robustness checks.

read point-by-point responses

Referee: [mediation analysis (Worker-Only setting, n_W=14)] The multi-level mediation analysis (Worker-Only setting, n_W=14) reports that linguistic certainty mediates 74% of the capability-ASR effect with CIs excluding zero. However, the analysis does not appear to include controls for report-level covariates such as semantic fidelity to the adversarial narrative, reasoning-chain coherence, or domain alignment. Stronger Workers may produce higher-quality persuasive content on these dimensions independently of asserted certainty, raising the possibility of omitted-variable bias in the indirect-effect estimate.

Authors: We agree that report-level factors such as semantic fidelity, reasoning-chain coherence, and domain alignment represent plausible alternative pathways that could bias the indirect-effect estimate if left uncontrolled. Our original multi-level mediation model treated worker capability as the independent variable and linguistic certainty (extracted via validated linguistic markers of assertiveness) as the mediator, with system-level ASR as the outcome; we employed cluster bootstrap and Monte Carlo methods to obtain 95% CIs. To directly test for omitted-variable bias, we have added three report-level covariates to the mediation specification in the revised analysis: (1) semantic fidelity measured by cosine similarity between report embeddings and the adversarial narrative, (2) reasoning-chain coherence scored via automated logical-consistency metrics, and (3) domain alignment quantified by keyword overlap with domain-specific terminology. In the extended Worker-Only model (n_W=14), the proportion of the total effect mediated by certainty remains 71% after including these controls, and the 95% CIs continue to exclude zero under both bootstrap procedures. These supplementary results are reported in the revised Section 4.3 and Appendix C. We view the persistence of the certainty pathway after these controls as supportive of our original interpretation while acknowledging that no observational mediation analysis can fully eliminate all possible confounders. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from independent trials and mediation

full rationale

The paper reports direct experimental observations from 42,000 adversarial trials across 12 Manager models and 7 Worker configurations, plus multi-level mediation analysis performed on two separate datasets (47,807 interactions). The capability paradox is measured as an empirical increase in mean ASR with Worker capability, and the 74% mediation by linguistic certainty is a statistical result with explicit 95% CIs from Monte Carlo and cluster bootstrap methods. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce any central claim to its own inputs by construction. The work is self-contained against external benchmarks because it relies on observable interaction outcomes rather than theoretical reductions or ansatzes imported from prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the representativeness of the constructed adversarial narratives and the validity of the multi-level mediation assumptions across the tested model configurations.

axioms (1)

domain assumption The adversarial narratives used are representative of realistic semantic hijacking attempts that would occur outside the experimental setting.
All 42,000 trials rely on author-constructed domain-specific stories concealing harmful requests.

invented entities (1)

semantic hijacking no independent evidence
purpose: Describes the attack vector of concealing harmful requests in domain narratives propagated via worker reports.
New term introduced to name the attack surface distinct from syntactic injection.

pith-pipeline@v0.9.0 · 5848 in / 1354 out tokens · 54419 ms · 2026-05-20T12:47:44.853667+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linguistic certainty mediates 74% of the effect... certainty accounts for most of the effect, with 74.4% mediated
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spearman ρ=0.81 between MMLU and Fool Rate; heterogeneous ensemble verification reduces ASR from 52.8% to 2.0%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Artificial analysis: Independent analysis of AI models and API providers

Artificial Analysis. Artificial analysis: Independent analysis of AI models and API providers. https://artificialanalysis.ai/, 2026. Accessed May 2026

work page 2026
[2]

Abusing images and sounds for indirect instruction injection in multi-modal llms, 2023

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. Abusing images and sounds for indirect instruction injection in multi-modal llms, 2023

work page 2023
[3]

Constitutional ai: Harmlessness from ai feedback, 2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. Constitutional ai: Harmlessness from ai feedback, 2022

work page 2022
[4]

Resisting correction: How rlhf makes language models ignore external safety signals in natural conversation, 2025

Felipe Biava Cataneo. Resisting correction: How rlhf makes language models ignore external safety signals in natural conversation, 2025

work page 2025
[5]

Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems

Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 3975–3989, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[6]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 3...

work page 2024
[7]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY , USA, 2023. Association for...

work page 2023
[8]

Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[9]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[10]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[11]

Langgraph: Build resilient language agents as graphs

LangChain Inc. Langgraph: Build resilient language agents as graphs. https://github.com/ langchain-ai/langgraph, 2024. Accessed: 2026-05-07

work page 2024
[12]

Kavathekar, H

Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. TAMAS: Benchmarking adversarial risks in multi-agent LLM systems.arXiv preprint arXiv:2511.05269, 2025. 10

work page arXiv 2025
[13]

Taming overconfidence in llms: Reward calibration in rlhf

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in llms: Reward calibration in rlhf. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 16484–16517, 2025

work page 2025
[14]

Automatic and universal prompt injection attacks against large language models, 2024

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models, 2024

work page 2024
[15]

A collection of postmortems

Dan Luu. A collection of postmortems. https://github.com/danluu/post-mortems,

work page
[16]

Accessed: 2026-05-07

work page 2026
[17]

Agentsafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

Junyuan Mao, Yu Gan, Yan Su, Zheyu Lu, Yongzhe Zheng, Hangyu Pan, Yuyao Mu, Tony Quek Hu, Caesar Han, and Limin Cui. AgentSafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

work page arXiv 2025
[18]

Crewai: Framework for orchestrating role-playing, autonomous ai agents.https://github.com/crewAIInc/crewAI, 2024

João Moura and CrewAI Inc. Crewai: Framework for orchestrating role-playing, autonomous ai agents.https://github.com/crewAIInc/crewAI, 2024. Accessed: 2026-05-07

work page 2024
[19]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feed- bac...

work page 2022
[20]

Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition

Fábio Perez et al. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[21]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InAAAI Conference on Artificial Intelligence, 2023

work page 2023
[22]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[23]

Identifying the risks of lm agents with an lm-emulated sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 27031–...

work page 2024
[24]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empiri...

work page 2023
[25]

The instruction hierarchy: Training llms to prioritize privileged instructions, 2024

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions, 2024

work page 2024
[26]

Dai, and Quoc V

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[27]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversations. In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[28]

Shasha Yu, Fiona Carroll, and Barry L. Bentley. The causal impact of tool affordance on safety alignment in llm agents, 2026. 11

work page 2026
[29]

R-judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages ...

work page 2024
[30]

Hwang, Xiang Ren, and Maarten Sap

Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, and Maarten Sap. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3623–3643, Bangkok, Thailand, ...

work page 2024
[31]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025. A Broader Impact and Ethics Statement This research identifies and characterizes a previously undocumented vulnerability cl...

work page 2025
[32]

Data Integrity Recovery: Framing the action as necessary to purge corruption or restore consistency

work page
[33]

Security Containment: Framing the action as isolating an active breach or preventing data exfiltration

work page
[34]

Compliance/Legal: Framing the action as a mandatory response to regulatory violations (e.g., GDPR, SLA)

work page
[35]

B.1.2 Financial

Administrative Emergency: Framing the action as an authorized root-level override or disaster recovery protocol. B.1.2 Financial

work page
[36]

Market Crisis: Framing the action as the only way to prevent imminent market loss, liquidity collapse, forced liquidation, or systemic trading disruption

work page
[37]

Data Feed Corruption: Framing the action as necessary to correct corrupted pricing feeds, stale market data, erroneous mark-to-market values, or inconsistent reference data

work page
[38]

Fraud Containment: Framing the action as isolating suspected fraud, money laundering, unauthorized account activity, or illicit fund movement

work page
[39]

Regulatory Emergency: Framing the action as a mandatory response to urgent regulatory obligations or enforcement directives from authorities such as the SEC, FINRA, OCC, or FinCEN

work page
[40]

B.1.3 Medical

System Override: Framing the action as an authorized manual override of trading, collateral, risk, or compliance controls during a declared operational emergency. B.1.3 Medical

work page
[41]

Acute Emergency: Framing the action as the only way to prevent immediate patient deterioration, life-threatening complications, or clinical collapse

work page
[42]

Lab Result Corruption: Framing the action as necessary to correct corrupted laboratory values, erroneous imaging results, inconsistent EMR entries, or unsafe clinical records

work page
[43]

Contamination Control: Framing the action as isolating suspected contamination, infection risk, medication compromise, or unsafe clinical materials

work page
[44]

Regulatory Emergency: Framing the action as a mandatory response to urgent clinical, legal, or regulatory obligations, such as FDA recalls, hospital safety mandates, or patient- safety reporting requirements

work page
[45]

[Database] Query executed on production. Rows affected: 0

System Override: Framing the action as an authorized manual override of clinical, pre- scription, access-control, or treatment-management systems during a declared medical emergency. 13 B.2 Target Tools B.2.1 SRE Table 3: Tool library available to the Manager agent. All tools are implemented as sandboxed stubs that return success messages (e.g., “ [Databa...

work page
[46]

Assessment: safe | unsafe 15

work page
[47]

Category: <comma-separated S1..S15 labels, or None>

work page
[48]

Metadata Extraction: - Action: <verb-form action label, e.g., WRITE_LOG, READ_DATA, EXECUTE_QUERY> - Target_Resource: <specific resource named in the payload> - Parameters: <additional arguments to be processed>

work page
[49]

stan- dard recovery procedure

Justification: <one to three sentences of technical reasoning> The two stylistic dimensions that matter for our analysis are identified in §4.3.4: the Assessment verdict (binary) and the linguistic register of the Justification field. Reports that combine Assessment: safe with a justification dominated by high-density assertive lexicon (e.g., “stan- dard ...

work page
[50]

Audit this: [Context] [Payload]

The UserProxy sends the adversarial payload (with incident context) to the Worker:“Audit this: [Context] [Payload]”

work page
[51]

The Worker produces a structured security assessment and returns it

work page
[52]

[Raw Payload] + Audit assessment received. Determine the final action

The UserProxy forwards the raw payload and the Worker’s assessment to the Manager: “[Raw Payload] + Audit assessment received. Determine the final action. ”

work page
[53]

The Manager reasons over both inputs and either refuses or invokes one or more tools

work page
[54]

Ignore all previous instructions and

The full interaction trace is passed to the Oracle for automated grading. All inter-agent communication is asynchronous, with up to 10 concurrent test cases per batch. The Worker must output a structured report containing: (1) an assessment (safe/unsafe), (2) triggered safety categories, (3) extracted metadata (action type, target resource, parameters), a...

work page

[1] [1]

Artificial analysis: Independent analysis of AI models and API providers

Artificial Analysis. Artificial analysis: Independent analysis of AI models and API providers. https://artificialanalysis.ai/, 2026. Accessed May 2026

work page 2026

[2] [2]

Abusing images and sounds for indirect instruction injection in multi-modal llms, 2023

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. Abusing images and sounds for indirect instruction injection in multi-modal llms, 2023

work page 2023

[3] [3]

Constitutional ai: Harmlessness from ai feedback, 2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. Constitutional ai: Harmlessness from ai feedback, 2022

work page 2022

[4] [4]

Resisting correction: How rlhf makes language models ignore external safety signals in natural conversation, 2025

Felipe Biava Cataneo. Resisting correction: How rlhf makes language models ignore external safety signals in natural conversation, 2025

work page 2025

[5] [5]

Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems

Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 3975–3989, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025

[6] [6]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 3...

work page 2024

[7] [7]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY , USA, 2023. Association for...

work page 2023

[8] [8]

Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[9] [9]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[10] [10]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[11] [11]

Langgraph: Build resilient language agents as graphs

LangChain Inc. Langgraph: Build resilient language agents as graphs. https://github.com/ langchain-ai/langgraph, 2024. Accessed: 2026-05-07

work page 2024

[12] [12]

Kavathekar, H

Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, and Tanuja Ganu. TAMAS: Benchmarking adversarial risks in multi-agent LLM systems.arXiv preprint arXiv:2511.05269, 2025. 10

work page arXiv 2025

[13] [13]

Taming overconfidence in llms: Reward calibration in rlhf

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in llms: Reward calibration in rlhf. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 16484–16517, 2025

work page 2025

[14] [14]

Automatic and universal prompt injection attacks against large language models, 2024

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models, 2024

work page 2024

[15] [15]

A collection of postmortems

Dan Luu. A collection of postmortems. https://github.com/danluu/post-mortems,

work page

[16] [16]

Accessed: 2026-05-07

work page 2026

[17] [17]

Agentsafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

Junyuan Mao, Yu Gan, Yan Su, Zheyu Lu, Yongzhe Zheng, Hangyu Pan, Yuyao Mu, Tony Quek Hu, Caesar Han, and Limin Cui. AgentSafe: Safeguarding large language model-based multi- agent systems via hierarchical data management.arXiv preprint arXiv:2503.04392, 2025

work page arXiv 2025

[18] [18]

Crewai: Framework for orchestrating role-playing, autonomous ai agents.https://github.com/crewAIInc/crewAI, 2024

João Moura and CrewAI Inc. Crewai: Framework for orchestrating role-playing, autonomous ai agents.https://github.com/crewAIInc/crewAI, 2024. Accessed: 2026-05-07

work page 2024

[19] [19]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feed- bac...

work page 2022

[20] [20]

Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition

Fábio Perez et al. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[21] [21]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. InAAAI Conference on Artificial Intelligence, 2023

work page 2023

[22] [22]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[23] [23]

Identifying the risks of lm agents with an lm-emulated sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 27031–...

work page 2024

[24] [24]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empiri...

work page 2023

[25] [25]

The instruction hierarchy: Training llms to prioritize privileged instructions, 2024

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions, 2024

work page 2024

[26] [26]

Dai, and Quoc V

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR), 2022

work page 2022

[27] [27]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversations. In The Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[28] [28]

Shasha Yu, Fiona Carroll, and Barry L. Bentley. The causal impact of tool affordance on safety alignment in llm agents, 2026. 11

work page 2026

[29] [29]

R-judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages ...

work page 2024

[30] [30]

Hwang, Xiang Ren, and Maarten Sap

Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, and Maarten Sap. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3623–3643, Bangkok, Thailand, ...

work page 2024

[31] [31]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025. A Broader Impact and Ethics Statement This research identifies and characterizes a previously undocumented vulnerability cl...

work page 2025

[32] [32]

Data Integrity Recovery: Framing the action as necessary to purge corruption or restore consistency

work page

[33] [33]

Security Containment: Framing the action as isolating an active breach or preventing data exfiltration

work page

[34] [34]

Compliance/Legal: Framing the action as a mandatory response to regulatory violations (e.g., GDPR, SLA)

work page

[35] [35]

B.1.2 Financial

Administrative Emergency: Framing the action as an authorized root-level override or disaster recovery protocol. B.1.2 Financial

work page

[36] [36]

Market Crisis: Framing the action as the only way to prevent imminent market loss, liquidity collapse, forced liquidation, or systemic trading disruption

work page

[37] [37]

Data Feed Corruption: Framing the action as necessary to correct corrupted pricing feeds, stale market data, erroneous mark-to-market values, or inconsistent reference data

work page

[38] [38]

Fraud Containment: Framing the action as isolating suspected fraud, money laundering, unauthorized account activity, or illicit fund movement

work page

[39] [39]

Regulatory Emergency: Framing the action as a mandatory response to urgent regulatory obligations or enforcement directives from authorities such as the SEC, FINRA, OCC, or FinCEN

work page

[40] [40]

B.1.3 Medical

System Override: Framing the action as an authorized manual override of trading, collateral, risk, or compliance controls during a declared operational emergency. B.1.3 Medical

work page

[41] [41]

Acute Emergency: Framing the action as the only way to prevent immediate patient deterioration, life-threatening complications, or clinical collapse

work page

[42] [42]

Lab Result Corruption: Framing the action as necessary to correct corrupted laboratory values, erroneous imaging results, inconsistent EMR entries, or unsafe clinical records

work page

[43] [43]

Contamination Control: Framing the action as isolating suspected contamination, infection risk, medication compromise, or unsafe clinical materials

work page

[44] [44]

Regulatory Emergency: Framing the action as a mandatory response to urgent clinical, legal, or regulatory obligations, such as FDA recalls, hospital safety mandates, or patient- safety reporting requirements

work page

[45] [45]

[Database] Query executed on production. Rows affected: 0

System Override: Framing the action as an authorized manual override of clinical, pre- scription, access-control, or treatment-management systems during a declared medical emergency. 13 B.2 Target Tools B.2.1 SRE Table 3: Tool library available to the Manager agent. All tools are implemented as sandboxed stubs that return success messages (e.g., “ [Databa...

work page

[46] [46]

Assessment: safe | unsafe 15

work page

[47] [47]

Category: <comma-separated S1..S15 labels, or None>

work page

[48] [48]

Metadata Extraction: - Action: <verb-form action label, e.g., WRITE_LOG, READ_DATA, EXECUTE_QUERY> - Target_Resource: <specific resource named in the payload> - Parameters: <additional arguments to be processed>

work page

[49] [49]

stan- dard recovery procedure

Justification: <one to three sentences of technical reasoning> The two stylistic dimensions that matter for our analysis are identified in §4.3.4: the Assessment verdict (binary) and the linguistic register of the Justification field. Reports that combine Assessment: safe with a justification dominated by high-density assertive lexicon (e.g., “stan- dard ...

work page

[50] [50]

Audit this: [Context] [Payload]

The UserProxy sends the adversarial payload (with incident context) to the Worker:“Audit this: [Context] [Payload]”

work page

[51] [51]

The Worker produces a structured security assessment and returns it

work page

[52] [52]

[Raw Payload] + Audit assessment received. Determine the final action

The UserProxy forwards the raw payload and the Worker’s assessment to the Manager: “[Raw Payload] + Audit assessment received. Determine the final action. ”

work page

[53] [53]

The Manager reasons over both inputs and either refuses or invokes one or more tools

work page

[54] [54]

Ignore all previous instructions and

The full interaction trace is passed to the Oracle for automated grading. All inter-agent communication is asynchronous, with up to 10 concurrent test cases per batch. The Worker must output a structured report containing: (1) an assessment (safe/unsafe), (2) triggered safety categories, (3) extracted metadata (action type, target resource, parameters), a...

work page