pith. machine review for the scientific record. sign in

arxiv: 2512.04129 · v2 · submitted 2025-12-03 · 💻 cs.CR

Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation

Pith reviewed 2026-05-17 02:59 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM multi-agent systemsadversarial attackstopology-guided propagationblack-box securityMAS vulnerabilitiesinter-agent dependencies
0
0 comments X

The pith

Adversarial inputs can propagate through LLM multi-agent systems by following their communication topology from exposed edge agents to high-privilege ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that LLM-based multi-agent systems contain overlooked vulnerabilities arising from the way upstream agent outputs are reinterpreted and executed by downstream agents. The authors introduce a topology-aware attack that first maps agent dependencies in a black-box setting, then models how contamination spreads, and finally uses hierarchical payload encapsulation to reach and influence privileged agents. If the claim holds, securing MAS deployments requires attention to the full interaction graph rather than isolated agent defenses. A sympathetic reader would care because these systems are already deployed for complex tasks, and successful propagation could turn routine agent cooperation into a vector for widespread malicious behavior.

Core claim

The paper claims that a topology-guided attack scheme, built from reconnaissance of inter-agent dependencies, contamination propagation modeling, and hierarchical payload encapsulation, enables reliable multi-hop compromise of high-privilege agents starting from exposed edge agents. Experiments demonstrate success rates of 40%–78% across three common MAS frameworks and five topologies, rising to 85% on two real-world applications in twenty scenarios. The work further shows that a topology-trust mitigation strategy can block 94.8% of the composite attacks.

What carries the argument

Topology-guided adversarial propagation, which maps agent interaction structure, models how contamination travels along dependencies, and encapsulates payloads to survive reinterpretation by downstream agents.

If this is right

  • The attack remains effective across multiple standard MAS frameworks and common communication topologies.
  • Real-world MAS applications exhibit the same propagation vulnerability in diverse task scenarios.
  • A mitigation that assigns trust according to agent position in the topology blocks nearly all instances of the attack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • MAS designers may need to add explicit checks on information provenance rather than assuming cooperative reinterpretation is safe.
  • Similar dependency-based propagation risks could appear in other multi-component LLM pipelines that chain outputs without strong isolation.

Load-bearing premise

Downstream agents will reliably reinterpret and act on upstream outputs in ways that let adversarial contamination propagate to influence their behavior.

What would settle it

A controlled test in which agents in a representative MAS consistently sanitize or ignore adversarial content from upstream outputs without altering their downstream actions would show that propagation does not occur as modeled.

Figures

Figures reproduced from arXiv: 2512.04129 by Cong Wu, Huangpeng Gu, Jing Chen, Le Yin, Ruichao Liang, Xiaoyu Zhang, Yang Liu, Yebo Feng, Zijian Zhang.

Figure 1
Figure 1. Figure 1: Attack attempts on LANGMANUS. focus on enhancing alignment training [51], fine￾tuning [52], and applying input-output filtering [53], [54]. At the system level, Zhang et al.[55] integrate hier￾archical information management and memory protec￾tion into MAS for systematic defense. Wang et al.[17] leverage graph neural networks for anomaly detection in utterance graphs and apply topological interventions to … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the topology-guided attack pipeline on LANGMANUS. attack that mimics legitimate coordination patterns by guiding the malicious instruction from the Browser to the Supervisor, then to the Planner, and finally to the FileManager, ultimately triggering the deletion of the target file. The detailed procedure is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework of Topology-Aware Multi-Hop Attack. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of adversarial contamination propagation model with [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Payload construction for guiding WebSurfer [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of T-Guard. Algorithm 1: Standard Procedure of T-Guard Input: External Environment E, System Topology G = (V, E), Security Policy P Output: Updated Permissions to All Agents 1 while system is running do 2 foreach edge agent vi ∈ G do 3 Get visual input Ii from E, output Oi from vi; // Detect visual-semantic inconsistency 4 ci ← CrossModalValidator(Ii, Oi, P); // Update trust scores for all age… view at source ↗
Figure 7
Figure 7. Figure 7: Node-level comparison between model-predicted taint values and observed infection integrity scores [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Aggregated comparison of average model predictions and observed infection integrity scores. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of five network topologies. Red nodes indicate attack entry points, and greens mark targets. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

The digital world is witnessing the rapid rise of LLM-based multi-agent systems (MASs) and their powerful applications. However, their security remains insufficiently understood, as existing evaluations are largely limited to narrow attack settings and may substantially underestimate the real risks of MAS deployments. Inspired by the MAS inter-agent dependencies, where upstream outputs are reinterpreted and executed by downstream agents, we propose a topology-aware attack scheme that propagates adversarial contamination from exposed edge agents to high-privilege agents to induce malicious behaviors. By combining topology reconnaissance, contamination propagation modeling, and hierarchical payload encapsulation, our approach overcomes the key challenges of black-box attacks and makes such multi-hop compromise practical. Experiments show that our approach achieves success rates of 40\%--78\% on three widely-used MAS frameworks under five topologies, and 85\% on two real-world MAS applications across 20 representative scenarios. The results reveal fundamental vulnerabilities in MASs that have been overlooked by prior studies. Based on these findings, we propose a topology-trust mitigation that blocks 94.8\% of such composite attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a topology-guided adversarial attack on LLM-based multi-agent systems that performs black-box reconnaissance to map inter-agent dependencies, then propagates contamination from exposed edge agents to high-privilege agents via hierarchical payload encapsulation and modeled propagation. Experiments report success rates of 40–78% across three standard MAS frameworks (AutoGen, CrewAI, LangGraph) under five topologies and 85% success on two real-world MAS applications over 20 scenarios; a topology-trust mitigation is introduced that blocks 94.8% of the composite attacks.

Significance. If the experimental results prove robust, the work identifies a previously under-examined attack surface arising from the default reinterpretation of upstream outputs by downstream agents in MAS topologies. The evaluation on unmodified, widely deployed frameworks plus real applications supplies concrete evidence of practical risk and supplies a concrete mitigation, which could inform both defensive design and future security evaluations of agentic systems.

major comments (2)
  1. [Attack Model and Experimental Evaluation] The central success-rate claims (40–78% and 85%) rest on the assumption that downstream agents will reinterpret and execute upstream outputs without filtering or summarization. No ablation is reported that inserts even lightweight validation, safety alignment, or output sanitization at intermediate agents; such a test is required to establish whether the reported rates survive realistic MAS deployments.
  2. [Experimental Evaluation] The experimental section supplies no description of trial counts, statistical tests, success criteria, or data-exclusion rules. Without these elements the numerical results cannot be evaluated for reproducibility or for support of the claim that the attack overcomes black-box challenges.
minor comments (1)
  1. [Abstract and §4] The abstract and evaluation sections refer to “five topologies” and “20 representative scenarios” without enumerating them or justifying their selection; an explicit list or appendix would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Attack Model and Experimental Evaluation] The central success-rate claims (40–78% and 85%) rest on the assumption that downstream agents will reinterpret and execute upstream outputs without filtering or summarization. No ablation is reported that inserts even lightweight validation, safety alignment, or output sanitization at intermediate agents; such a test is required to establish whether the reported rates survive realistic MAS deployments.

    Authors: We agree that the reported success rates assume direct reinterpretation of upstream outputs by downstream agents, which reflects the default behavior in the evaluated MAS frameworks but may not hold in deployments with added safeguards. In the revised manuscript we will add a new ablation study that inserts lightweight validation, safety alignment, and output sanitization at intermediate agents. This will quantify how the attack success rates change under more realistic filtering conditions and clarify the scope of the vulnerability. revision: yes

  2. Referee: [Experimental Evaluation] The experimental section supplies no description of trial counts, statistical tests, success criteria, or data-exclusion rules. Without these elements the numerical results cannot be evaluated for reproducibility or for support of the claim that the attack overcomes black-box challenges.

    Authors: We acknowledge that the experimental methodology section lacked sufficient detail on reproducibility. In the revised manuscript we will expand the relevant sections to report the exact number of trials conducted for each configuration, the statistical tests and confidence intervals used, precise definitions of success criteria for each task and topology, and any data-exclusion rules applied. These additions will allow independent evaluation of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct experiments

full rationale

The paper presents a topology-aware attack on LLM multi-agent systems via reconnaissance, propagation modeling, and payload encapsulation, validated through empirical success rates (40-78% on frameworks, 85% on applications) rather than any mathematical derivation or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text; the central claims are tested on unmodified standard MAS frameworks (AutoGen, CrewAI, LangGraph) and real-world scenarios, making the work self-contained against external benchmarks. This matches the reader's assessment of minimal circularity risk.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MAS inter-agent dependencies enable propagation of adversarial content and on the practical feasibility of black-box topology mapping; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Upstream outputs are reinterpreted and executed by downstream agents in MAS
    This dependency is stated as the inspiration for the attack propagation mechanism.

pith-pipeline@v0.9.0 · 5510 in / 1197 out tokens · 46913 ms · 2026-05-17T02:59:40.358228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Conjunctive Prompt Attacks in Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Chateda: A large language model powered autonomous agent for eda,

    H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu, “Chateda: A large language model powered autonomous agent for eda,”Trans. Comp.-Aided Des. Integ. Cir . Sys., vol. 43, 2024

  2. [2]

    PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “PentestGPT: Evaluating and harnessing large language models for automated penetration testing,” inProceedings of the USENIX Security Symposium (USENIX Security), 2024

  3. [3]

    A large language model-based multi-agent manufacturing system for intelligent shopfloors,

    Z. Zhao, D. Tang, C. Liu, L. Wang, Z. Zhang, H. Zhu, K. Chen, Q. Nie, and Y . Ji, “A large language model-based multi-agent manufacturing system for intelligent shopfloors,”Advanced En- gineering Informatics, vol. 69, 2026

  4. [4]

    Protagents: protein dis- covery via large language model multi-agent collaborations combining physics and machine learning,

    A. Ghafarollahi and M. J. Buehler, “Protagents: protein dis- covery via large language model multi-agent collaborations combining physics and machine learning,”Digital Discovery, vol. 3, 2024

  5. [5]

    Mdagents: An adaptive collaboration of llms for medical decision-making,

    Y . Kim, C. Park, H. Jeong, Y . S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, and H. W. Park, “Mdagents: An adaptive collaboration of llms for medical decision-making,” Advances in Neural Information Processing Systems, vol. 37, 2024

  6. [6]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,” inProceedings of the Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track (NeurIPS), 2024

  7. [7]

    Dissecting adversarial robustness of mul- timodal LM agents,

    C. H. Wu, R. R. Shah, J. Y . Koh, R. Salakhutdinov, D. Fried, and A. Raghunathan, “Dissecting adversarial robustness of mul- timodal LM agents,” inProceedings of the International Con- ference on Learning Representations (ICLR), 2025

  8. [8]

    Attacking vision-language computer agents via pop-ups,

    Y . Zhang, T. Yu, and D. Yang, “Attacking vision-language computer agents via pop-ups,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  9. [9]

    NetSafe: Exploring the topological safety of multi-agent system,

    M. Yu, S. Wang, G. Zhang, J. Mao, C. Yin, Q. Liu, K. Wang, Q. Wen, and Y . Wang, “NetSafe: Exploring the topological safety of multi-agent system,” inProceedings of the Findings of the Association for Computational Linguistics (Findings of ACL), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds., 2025

  10. [10]

    On the resilience of LLM-based multi-agent collaboration with faulty agents,

    J. tse Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. Lyu, and M. Sap, “On the resilience of LLM-based multi-agent collaboration with faulty agents,” inProceedings of the International Conference on Machine Learning (ICML), 2025

  11. [11]

    Red- teaming llm multi-agent systems via communication attacks,

    P. He, Y . Lin, S. Dong, H. Xu, Y . Xing, and H. Liu, “Red- teaming llm multi-agent systems via communication attacks,” inProceedings of the Findings of the Association for Compu- tational Linguistics (Findings of ACL), 2025

  12. [12]

    Ip leakage attacks targeting llm-based multi-agent systems,

    L. Wang, W. Wang, S. Wang, Z. Li, Z. Ji, Z. Lyu, D. Wu, and S.- C. Cheung, “Ip leakage attacks targeting llm-based multi-agent systems,”arXiv preprint arXiv:2505.12442, 2025

  13. [13]

    Automating prompt leakage attacks on large language models using agentic approach,

    T. Sternak, D. Runje, D. Grano ˇsa, and C. Wang, “Automating prompt leakage attacks on large language models using agentic approach,”arXiv preprint arXiv:2502.12630, 2025

  14. [14]

    Mirage in the eyes: hallucination attack on multi-modal large language models with only attention sink,

    Y . Wang, M. Zhang, J. Sun, C. Wang, M. Yang, H. Xue, J. Tao, R. Duan, and J. Liu, “Mirage in the eyes: hallucination attack on multi-modal large language models with only attention sink,” in Proceedings of the USENIX Conference on Security Symposium (USENIX Security), 2025

  15. [15]

    Web fraud attacks against llm-driven multi-agent systems,

    D. Kong, H. Peng, Y . Zhang, L. Zhao, Z. Xu, S. Lin, C. Lin, and M. Han, “Web fraud attacks against llm-driven multi-agent systems,”arXiv preprint arXiv:2509.01211, 2025

  16. [16]

    Denial- of-service poisoning attacks against large language models,

    K. Gao, T. Pang, C. Du, Y . Yang, S.-T. Xia, and M. Lin, “Denial- of-service poisoning attacks against large language models,” arXiv preprint arXiv:2410.10760, 2024

  17. [17]

    G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems,

    S. Wang, G. Zhang, M. Yu, G. Wan, F. Meng, C. Guo, K. Wang, and Y . Wang, “G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems,” inProceed- ings of the Findings of the Association for Computational Linguistics (Findings of ACL), 2025

  18. [18]

    A practical memory injection attack against llm agents.arXiv preprint arXiv:2503.03704, 2025

    S. Dong, S. Xu, P. He, Y . Li, J. Tang, T. Liu, H. Liu, and Z. Xiang, “A practical memory injection attack against llm agents,”arXiv preprint arXiv:2503.03704, 2025

  19. [19]

    Agentpoison: red-teaming llm agents via poisoning memory or knowledge bases,

    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “Agentpoison: red-teaming llm agents via poisoning memory or knowledge bases,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2024

  20. [20]

    Unveiling privacy risks in LLM agent memory,

    B. Wang, W. He, S. Zeng, Z. Xiang, Y . Xing, J. Tang, and P. He, “Unveiling privacy risks in LLM agent memory,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds., 2025

  21. [21]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors,

    W. Chen, Y . Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y . Lu, Y .-H. Hung, C. Qianet al., “Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors,” inProceedings of the International Conference on Learning Representations (ICLR), 2023

  22. [22]

    Tptu-v2: Boosting task planning and tool usage of large language model-based agents in real- world industry systems,

    Y . Kong, J. Ruan, Y . Chen, B. Zhang, T. Bao, S. Shiwei, X. Hu, H. Mao, Z. Li, X. Zenget al., “Tptu-v2: Boosting task planning and tool usage of large language model-based agents in real- world industry systems,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  23. [23]

    Toolformer: language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess ´ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: language models can teach themselves to use tools,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023

  24. [24]

    Masrouter: Learning to route llms for multi-agent sys- tems,

    Y . Yue, G. Zhang, B. Liu, G. Wan, K. Wang, D. Cheng, and Y . Qi, “Masrouter: Learning to route llms for multi-agent sys- tems,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  25. [25]

    Metagpt: The multi-agent framework,

    MetaGPT, “Metagpt: The multi-agent framework,” 2025. [Online]. Available: https://www.deepwisdom.ai/

  26. [26]

    Langmanus,

    Darwin-lfl, “Langmanus,” 2025. [Online]. Available: https: //github.com/Darwin-lfl/langmanus?tab=readme-ov-file

  27. [27]

    Autogen: Enabling next-gen llm applications via multi-agent conversations,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inProceedings of the First Conference on Language Modeling (COLM), 2024

  28. [28]

    [Online]

    CAMEL-AI, “Camel,” 2025. [Online]. Available: https://www. camel-ai.org/

  29. [29]

    Cut the crap: An economical com- munication pipeline for LLM-based multi-agent systems,

    G. Zhang, Y . Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen, “Cut the crap: An economical com- munication pipeline for LLM-based multi-agent systems,” in Proceedings of the International Conference on Learning Rep- resentations (ICLR), 2025

  30. [30]

    Multi-agent design: Optimiz- ing agents with better prompts and topologies,

    H. Zhou, X. Wan, R. Sun, H. Palangi, S. Iqbal, I. Vuli ´c, A. Korhonen, and S. ¨O. Arık, “Multi-agent design: Optimiz- ing agents with better prompts and topologies,”arXiv preprint arXiv:2502.02533, 2025

  31. [31]

    Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent col- laboration,

    Z. Wang, Y . Wang, X. Liu, L. Ding, M. Zhang, J. Liu, and M. Zhang, “Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent col- laboration,” 2025. 14

  32. [32]

    Topological structure learning should be a research priority for llm-based multi-agent systems,

    J. Yang, M. Zhang, Y . Jin, H. Chen, Q. Wen, L. Lin, Y . He, W. Xu, J. Evans, and J. Wang, “Topological structure learning should be a research priority for llm-based multi-agent systems,” arXiv preprint arXiv:2505.22467, 2025

  33. [33]

    Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems,

    X. Shen, Y . Liu, Y . Dai, Y . Wang, R. Miao, Y . Tan, S. Pan, and X. Wang, “Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems,” inProceedings of the Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds., 2025

  34. [34]

    Agentoccam: A simple yet strong baseline for LLM-based web agents,

    K. Yang, Y . Liu, S. Chaudhary, R. Fakoor, P. Chaudhari, G. Karypis, and H. Rangwala, “Agentoccam: A simple yet strong baseline for LLM-based web agents,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

  35. [35]

    EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,

    Z. Liao, L. Mo, C. Xu, M. Kang, J. Zhang, C. Xiao, Y . Tian, B. Li, and H. Sun, “EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

  36. [36]

    Agrail: A lifelong agent guardrail with effective and adaptive safety detection,

    W. Luo, S. Dai, X. Liu, S. Banerjee, H. Sun, M. Chen, and C. Xiao, “Agrail: A lifelong agent guardrail with effective and adaptive safety detection,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  37. [37]

    Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast,

    X. Gu, X. Zheng, T. Pang, C. Du, Q. Liu, Y . Wang, J. Jiang, and M. Lin, “Agent smith: a single image can jailbreak one million multimodal llm agents exponentially fast,” inProceedings of the International Conference on Machine Learning (ICML), 2024

  38. [38]

    Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases,

    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2024

  39. [39]

    Caution for the environment: Multimodal LLM agents are susceptible to environmental distractions,

    X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal LLM agents are susceptible to environmental distractions,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds., 2025

  40. [40]

    Great, now write an article about that: The crescendo{Multi-Turn}{LLM}jailbreak attack,

    M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The crescendo{Multi-Turn}{LLM}jailbreak attack,” inProceedings of the USENIX Security Symposium (USENIX Security), 2025

  41. [41]

    ”do any- thing now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”do any- thing now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

  42. [42]

    Osworld: benchmarking mul- timodal agents for open-ended tasks in real computer envi- ronments,

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: benchmarking mul- timodal agents for open-ended tasks in real computer envi- ronments,” inProceedings of the International Conference on Neural Information Processing Systems (NIPS), 2024

  43. [43]

    Infecting LLM agents via generalizable adversarial attack,

    W. Yu, K. Hu, T. Pang, C. Du, M. Lin, and M. Fredrikson, “Infecting LLM agents via generalizable adversarial attack,” in Proceedings of the Conference on Neural Information Process- ing Systems Workshop (NeurIPS Workshop), 2025

  44. [44]

    Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate,

    A. Amayuelas, X. Yang, A. Antoniades, W. Hua, L. Pan, and W. Y . Wang, “Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate,” inProceedings of the Findings of the Association for Computational Linguistics (EMNLP), 2024

  45. [45]

    Flooding spread of manipu- lated knowledge in llm-based multi-agent communities,

    T. Ju, Y . Wang, X. Ma, P. Cheng, H. Zhao, Y . Wang, L. Liu, J. Xie, Z. Zhang, and G. Liu, “Flooding spread of manipu- lated knowledge in llm-based multi-agent communities,”arXiv preprint arXiv:2407.07791, 2024

  46. [46]

    Gpt-4v(ision),

    OpenAI, “Gpt-4v(ision),” 2025. [Online]. Available: https: //openai.com/index/gpt-4v-system-card/

  47. [47]

    Qwen-vl-max,

    Aliyun, “Qwen-vl-max,” 2025. [Online]. Available: https: //modelscope.cn/studios/qwen/Qwen-VL-Max

  48. [48]

    Doubao-vision-pro,

    V olcengine, “Doubao-vision-pro,” 2025. [Online]. Available: https://www.volcengine.com/product/doubao

  49. [49]

    Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

    A. Fourney, G. Bansal, H. Mozannar, C. Tan, E. Salinas, Erkang, Zhu, F. Niedtner, G. Proebsting, G. Bassman, J. Gerrits, J. Alber, P. Chang, R. Loynd, R. West, V . Dibia, A. Awadallah, E. Ka- mar, R. Hosn, and S. Amershi, “Magentic-one: A generalist multi-agent system for solving complex tasks,”arXiv preprint arXiv:2411.04468, 2024

  50. [50]

    Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation,

    camel ai.org, “Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation,” 2025. [Online]. Available: https://github.com/camel-ai/owl

  51. [51]

    LIMA: Less is more for alignment,

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. YU, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: Less is more for alignment,” inPro- ceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023

  52. [52]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schul- man, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inProceedings of the Conference on Neural Information P...

  53. [53]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inProceedings of the Con- ference on Neural Information Processing Systems (NeurIPS), 2023

  54. [54]

    Defending large language models against jailbreak attacks through chain of thought prompting,

    Y . Cao, N. Gu, X. Shen, D. Yang, and X. Zhang, “Defending large language models against jailbreak attacks through chain of thought prompting,” inProceedings of the International Conference on Networking and Network Applications (NaNA), 2024

  55. [55]

    Agentsafe: Safeguarding large language model-based multi-agent systems via hierarchical data manage- ment,

    J. Mao, F. Meng, Y . Duan, M. Yu, X. Jia, J. Fang, Y . Liang, K. Wang, and Q. Wen, “Agentsafe: Safeguarding large language model-based multi-agent systems via hierarchical data manage- ment,”arXiv preprint arXiv:2503.04392, 2025

  56. [56]

    Pleak: Prompt leaking attacks against large language model appli- cations,

    B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao, “Pleak: Prompt leaking attacks against large language model appli- cations,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

  57. [57]

    Novel universal bypass for all major llms,

    C. McCauley, K. Yeung, J. Martin, and K. Schulz, “Novel universal bypass for all major llms,” 2025. [Online]. Available: https://hiddenlayer.com/innovation-hub/ novel-universal-bypass-for-all-major-llms/

  58. [58]

    Enhancing jail- break attacks on llms via persona prompts,

    Z. Zhang, P. Zhao, D. Ye, and H. Wang, “Enhancing jail- break attacks on llms via persona prompts,”arXiv preprint arXiv:2507.22171, 2025

  59. [59]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  60. [60]

    A study of model iterations of fitts’ law and its application to human–computer interactions,

    H. Xiao, Y . Sun, Z. Duan, Y . Huo, J. Liu, M. Luo, Y . Li, and Y . Zhang, “A study of model iterations of fitts’ law and its application to human–computer interactions,”Applied Sciences, vol. 14, 2024

  61. [61]

    [Online]

    OpenAI, “Gpt-4o,” 2025. [Online]. Available: https://platform. openai.com/docs/models/gpt-4o

  62. [62]

    claude-3-7-sonnet,

    Anthropic, “claude-3-7-sonnet,” 2025. [Online]. Available: https://www.anthropic.com/news/claude-3-7-sonnet 15

  63. [63]

    deepseek-r1,

    Deepseek, “deepseek-r1,” 2025. [Online]. Available: https: //api-docs.deepseek.com/news/news250528

  64. [64]

    playwright-mcp,

    Microsoft, “playwright-mcp,” 2025. [Online]. Available: https: //github.com/microsoft/playwright-mcp

  65. [65]

    filesystem-mcp,

    mark3labs, “filesystem-mcp,” 2025. [Online]. Available: https: //github.com/mark3labs/mcp-filesystem-server

  66. [66]

    SimCSE: Simple contrastive learning of sentence embeddings,

    T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” inEmpirical Methods in Natural Language Processing (EMNLP), 2021

  67. [67]

    Vpi-bench: Visual prompt injection attacks for computer-use agents,

    T. Cao, B. Lim, Y . Liu, Y . Sui, Y . Li, S. Deng, L. Lu, N. Oo, S. Yan, and B. Hooi, “Vpi-bench: Visual prompt injection attacks for computer-use agents,”arXiv preprint arXiv:2506.02456, 2025. Appendix A. Ethics Considerations Research Scope and Ethical Boundaries. This research strictly adheres to ethical standards for security and AI system evaluation....