pith. machine review for the scientific record. sign in

arxiv: 2605.08763 · v1 · submitted 2026-05-09 · 💻 cs.CR

Recognition: no theorem link

When LLMs Team Up: A Coordinated Attack Framework for Automated Cyber Intrusions

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:28 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM agentsmulti-agent systemscyber intrusionsCTF tasksrole-based coordinationautomated attacksknowledge promotion
0
0 comments X

The pith

Dividing LLM agents into five specialized roles with strict coordination improves success on automated cyber intrusion tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that single LLM agents compress too many steps into one context during intrusion workflows, leading to drift and error spread. CAESAR splits the work across five distinct agent roles linked by a round-limited protocol, a shared knowledge store, workspace isolation, and validator checks before promoting facts. On 25 CTF tasks the structured team finishes more challenges than a matched single-agent setup and shows less variation in outcomes, with the biggest lift on exploits that chain several steps. The authors also note that role hand-offs and promotion events create observable traces that could help track such agent groups.

Core claim

CAESAR decomposes intrusion-style workflows into five typed roles coordinated by a bounded round protocol, persistent knowledge base, per-round workspace, validator-gated knowledge promotion, and capability-token write isolation, yielding higher task success rates and lower performance variance than single-agent baselines under equal budgets and tool access.

What carries the argument

CAESAR's five-role decomposition coordinated via bounded round protocol with validator-gated knowledge promotion and capability-token isolation.

Load-bearing premise

The five typed roles, bounded round protocol, validator-gated knowledge promotion, and capability-token isolation sufficiently model the information flow and error modes of real multi-stage intrusion workflows without adding coordination costs that offset the reported gains.

What would settle it

Re-running the single-agent baseline with the same cumulative tool-call budget and context length spread across multiple simulated rounds as the multi-agent case, then measuring whether the success-rate gap closes.

Figures

Figures reproduced from arXiv: 2605.08763 by Congcong Zhu, Minfeng Qi, Qin Wang, Tianqing Zhu, Wanlei Zhou, Zijie Xu.

Figure 1
Figure 1. Figure 1: Diagram of the LLM-based multi-agent attack framework, with the five-stage intrusion lifecycle. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Category-level performance comparison in terms of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task-level completion rate across all challenges. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agent performance evolution across Reverse, Pwn, Crypto, Misc, and Web challenges. Each subplot reports round-by [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregate comparison of multi-agent vs. single-agent performance across different backends. Bar height indicates the [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end execution trace of the jumpjump challenge discussed in Section VI. with direct probing being easily recognized by a persona operating under a disclosure-refusal policy. From Round 4 the Strategist’s plans shift toward indirect elicitation and the two Executors diversify conversational framing; partial extraction [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Automated intrusion-style workflows require LLM agents to reason over partial observations, tool outputs, and executable artifacts under bounded budgets. A single LLM instance often compresses evidence extraction, planning, execution, and validation into one context, which increases the risk of context drift and error propagation. Existing LLM-based multi-agent systems support general collaboration, but they do not explicitly model the role boundaries, artifact provenance, and cost constraints that characterize multi-stage intrusion workflows. This paper presents CAESAR, a coordinated multi-agent framework for controlled analysis of LLM-agent behavior in intrusion-style tasks. CAESAR decomposes the workflow into five typed roles and coordinates them through a bounded round protocol with a persistent knowledge base, a per-round workspace, validator-gated knowledge promotion, and capability-token write isolation. We evaluate CAESAR on 25 CTF tasks across five categories and four LLM backends. Compared with a single-agent baseline under matched budgets and tool access, CAESAR improves task success and reduces performance variance, with larger gains on tasks requiring multi-step exploit composition. A secondary simulated interactional-security study suggests that the role structure can transfer beyond code-native surfaces. The results indicate that role transitions, artifact provenance, and knowledge-promotion events provide useful structural signals for monitoring coordinated LLM-agent behavior beyond individual prompt and output inspection. The dataset, implementation, and evaluation logs are released at https://github.com/Xu-Qiu/CMAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CAESAR, a coordinated multi-agent framework for LLM agents performing automated cyber intrusion tasks. It structures the workflow using five distinct roles coordinated through a bounded round protocol, a persistent knowledge base, per-round workspace, validator-gated knowledge promotion, and capability-token write isolation. The framework is evaluated on 25 CTF tasks spanning five categories and four LLM backends. The central empirical result is that CAESAR achieves higher task success rates and lower performance variance compared to a single-agent baseline with matched budgets and tool access, with particularly pronounced benefits on tasks requiring multi-step exploit composition. The authors also conduct a secondary study on simulated interactional security and release the dataset, code, and logs.

Significance. Should the baseline comparison prove robust upon clarification, this paper offers a valuable contribution to the study of multi-agent LLM systems in cybersecurity by providing a domain-specific coordination protocol that emphasizes role boundaries, artifact provenance, and structured information flow. The empirical evaluation across multiple backends and task types, combined with the public release of implementation and evaluation logs, enhances reproducibility and allows for further analysis of coordination signals such as role transitions and knowledge-promotion events. This could inform both the design of more reliable automated intrusion tools and monitoring techniques for detecting coordinated LLM behaviors.

major comments (2)
  1. [§4 (Evaluation) and §3 (Framework)] The single-agent baseline is stated to use matched budgets and tool access (abstract and §4), but the manuscript does not specify whether the baseline incorporates mechanisms analogous to CAESAR's validator-gated knowledge promotion or capability-token write isolation (described in §3). If the baseline lacks these controls, the reported improvements in success rate and variance reduction may be due to internal error filtering rather than the benefits of multi-role coordination, which is load-bearing for the central claim.
  2. [Abstract and §4.1] The abstract and §4.1 report improvements across 25 tasks without statistical details such as significance tests, error bars, or per-task variance measures. The post-hoc emphasis on larger gains for multi-step tasks raises a selection-bias concern; the methods should clarify whether task categorization was pre-specified or if uniform evaluation was applied to all tasks.
minor comments (2)
  1. [Abstract] The secondary simulated interactional-security study is mentioned only briefly in the abstract; providing its methodology, results, and relation to the main CTF evaluation in the main text would better support the transferability claim.
  2. [§5 (Discussion)] The discussion of monitoring signals (role transitions, artifact provenance, knowledge-promotion events) is promising but would benefit from concrete quantitative metrics or examples showing how these signals differ from single-agent outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4 (Evaluation) and §3 (Framework)] The single-agent baseline is stated to use matched budgets and tool access (abstract and §4), but the manuscript does not specify whether the baseline incorporates mechanisms analogous to CAESAR's validator-gated knowledge promotion or capability-token write isolation (described in §3). If the baseline lacks these controls, the reported improvements in success rate and variance reduction may be due to internal error filtering rather than the benefits of multi-role coordination, which is load-bearing for the central claim.

    Authors: The single-agent baseline is implemented as one LLM instance executing the full workflow sequentially in a unified context with identical tool access and budget limits, but without role decomposition or any inter-agent coordination primitives. Validator-gated promotion and capability-token isolation are therefore absent by design, as they presuppose multiple specialized agents exchanging artifacts. We contend the gains derive from explicit role boundaries and provenance tracking that mitigate context drift, rather than generic filtering. We will revise §4 to provide an explicit side-by-side description of the baseline implementation and add a short discussion distinguishing coordination benefits from internal validation. revision: yes

  2. Referee: [Abstract and §4.1] The abstract and §4.1 report improvements across 25 tasks without statistical details such as significance tests, error bars, or per-task variance measures. The post-hoc emphasis on larger gains for multi-step tasks raises a selection-bias concern; the methods should clarify whether task categorization was pre-specified or if uniform evaluation was applied to all tasks.

    Authors: We will add error bars (standard deviation over repeated runs), a per-task success table in the appendix, and statistical tests (Wilcoxon signed-rank for paired success rates) to §4.1 and, space permitting, the abstract. Task categories follow the pre-defined CTF taxonomy used in the benchmark suite; the multi-step versus single-step distinction was defined a priori from the minimum number of sequential exploit steps stated in each task description. We will state this pre-specification explicitly in the methods and report aggregate results across all 25 tasks, treating the multi-step subgroup analysis as secondary. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation

full rationale

The paper advances no derivation chain, equations, or first-principles predictions. Its central claim rests on direct experimental comparison of the CAESAR multi-agent framework against a single-agent baseline under matched budgets and tool access across 25 CTF tasks. No parameters are fitted and then re-labeled as predictions, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The design elements (roles, validator gates, capability tokens) are presented as engineering choices whose effects are measured externally rather than defined into the outcome by construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on domain assumptions about LLM role specialization and the value of explicit provenance tracking; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)
  • domain assumption LLM agents can be reliably assigned and constrained to distinct roles (planner, executor, validator, etc.) without rapid context drift when given bounded rounds and validator gates.
    Invoked by the design of the five-role decomposition and coordination protocol.
invented entities (1)
  • CAESAR five-role coordination protocol with validator-gated knowledge promotion no independent evidence
    purpose: To structure multi-agent LLM behavior for intrusion workflows and provide monitorable signals
    Newly defined in this work; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5567 in / 1395 out tokens · 64096 ms · 2026-05-12T03:28:08.759913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Securefalcon: Are we there yet in automated software vulnerability detection with llms?

    M. A. Ferrag, A. Battah, N. Tihanyi, R. Jain, D. Maimut ¸, F. Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbahet al., “Securefalcon: Are we there yet in automated software vulnerability detection with llms?”IEEE Transactions on Software Engineering, 2025

  2. [2]

    Cve-llm: Ontology-assisted automatic vulnerability evaluation using large language models,

    R. Ghosh, H.-M. von Stockhausen, M. Schmitt, G. M. Vasile, S. K. Karn, and O. Farri, “Cve-llm: Ontology-assisted automatic vulnerability evaluation using large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, 2025, pp. 28 757– 28 765

  3. [3]

    Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

    A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers, pp. 30 848–30 865

  4. [4]

    {LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,

    A. Lekssays, H. Mouhcine, K. Tran, T. Yu, and I. Khalil, “{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 489–507

  5. [5]

    Expert insights into advanced persistent threats: Analysis, attribution, and challenges,

    A. Saha, J. Mattei, J. Blasco, L. Cavallaro, D. V otipka, and M. Lindorfer, “Expert insights into advanced persistent threats: Analysis, attribution, and challenges,” inProceedings of the 34th USENIX Security Symposium (USENIX Sec), 2025

  6. [6]

    A survey of intrusion detection techniques for cyber-physical systems,

    R. Mitchell and I.-R. Chen, “A survey of intrusion detection techniques for cyber-physical systems,”ACM Computing Surveys (CSUR), vol. 46, no. 4, pp. 1–29, 2014

  7. [7]

    A survey of data mining and machine learning methods for cyber security intrusion detection,

    A. L. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,”IEEE Commu- nications surveys & tutorials, vol. 18, no. 2, pp. 1153–1176, 2015. 16

  8. [8]

    We have a package for you! a comprehensive analysis of package hallucinations by code generating{LLMs},

    J. Spracklen, R. Wijewickrama, A. N. Sakib, A. Maiti, and B. Viswanath, “We have a package for you! a comprehensive analysis of package hallucinations by code generating{LLMs},” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 3687–3706

  9. [9]

    Llm-check: Investigating detection of hallucinations in large language models,

    G. Sriramanan, S. Bharti, V . S. Sadasivan, S. Saha, P. Kattakinda, and S. Feizi, “Llm-check: Investigating detection of hallucinations in large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 34 188–34 216, 2024

  10. [10]

    Flashdecoding++next: High throughput llm inference with latency and memory optimization,

    G. Dai, K. Hong, Q. Mao, X. Li, J. Xu, H. Huang, H. Xia, X. Ning, S. Yan, Y . Lianget al., “Flashdecoding++next: High throughput llm inference with latency and memory optimization,”IEEE Transactions on Computers, 2025

  11. [11]

    Teams of llm agents can exploit zero-day vulnerabilities,

    Y . Zhu, A. Kellermann, A. Gupta, P. Li, R. Fang, R. Bindu, and D. Kang, “Teams of llm agents can exploit zero-day vulnerabilities,” inProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2026, pp. 23–35

  12. [12]

    Co-redteam: Orchestrated security discovery and exploitation with llm agents,

    P. He, A. Fox, L. Miculicich, S. Friedli, D. Fabian, B. Gokturk, J. Tang, C.-Y . Lee, T. Pfister, and L. T. Le, “Co-redteam: Orchestrated security discovery and exploitation with llm agents,”arXiv preprint arXiv:2602.02164, 2026

  13. [13]

    Liva: A multi-agent llm-assisted system for iot vulnerability analysis,

    Z. Yang, H. Peng, Y . Jiang, J. Liu, H. Luo, M. Tang, J. Li, and K. Zhang, “Liva: A multi-agent llm-assisted system for iot vulnerability analysis,” IEEE Transactions on Dependable and Secure Computing, 2026

  14. [14]

    Agentchain: Blockchain-empowered multi-agent coordination for trustworthy llm question-answering systems,

    B. Chen, G. Li, J. Wu, J. Li, M. Chen, and J. Wang, “Agentchain: Blockchain-empowered multi-agent coordination for trustworthy llm question-answering systems,”IEEE Transactions on Dependable and Secure Computing, 2026

  15. [15]

    Game theory and multi- agent reinforcement learning,

    A. Now ´e, P. Vrancx, and Y .-M. De Hauwere, “Game theory and multi- agent reinforcement learning,” inReinforcement learning: State-of-the- art. Springer, 2012, pp. 441–470

  16. [16]

    Game theoretical applications for multi-agent sys- tems,

    P. C. Pendharkar, “Game theoretical applications for multi-agent sys- tems,”Expert Systems with Applications, vol. 39, no. 1, pp. 273–279, 2012

  17. [17]

    A hierarchical game-theoretic decision- making for cooperative multiagent systems under the presence of adver- sarial agents,

    Q. Yang and R. Parasuraman, “A hierarchical game-theoretic decision- making for cooperative multiagent systems under the presence of adver- sarial agents,” inProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, 2023, pp. 773–782

  18. [18]

    A game-theoretic framework for managing risk in multi-agent systems,

    O. Slumbers, D. H. Mguni, S. B. Blumberg, S. M. Mcaleer, Y . Yang, and J. Wang, “A game-theoretic framework for managing risk in multi-agent systems,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 32 059–32 087

  19. [19]

    Optimal robust formation of multi-agent systems as adversarial graphical apprentice games with inverse reinforcement learning,

    F. M. Golmisheh and S. Shamaghdari, “Optimal robust formation of multi-agent systems as adversarial graphical apprentice games with inverse reinforcement learning,”IEEE Transactions on Automation Sci- ence and Engineering, 2024

  20. [20]

    The trust paradox in llm-based multi-agent systems: When collaboration becomes a security vulnerability,

    Z. Xu, M. Qi, S. Wu, L. Zhang, Q. Wei, H. He, and N. Li, “The trust paradox in llm-based multi-agent systems: When collaboration becomes a security vulnerability,”arXiv preprint arXiv:2510.18563, 2025

  21. [21]

    A robust mean-field actor-critic reinforcement learning against adversarial perturbations on agent states,

    Z. Zhou, G. Liu, and M. Zhou, “A robust mean-field actor-critic reinforcement learning against adversarial perturbations on agent states,” IEEE Transactions on Neural Networks and Learning Systems, 2023

  22. [22]

    Where will they go? predicting fine-grained adversarial multi-agent motion using conditional variational autoencoders,

    P. Felsen, P. Lucey, and S. Ganguly, “Where will they go? predicting fine-grained adversarial multi-agent motion using conditional variational autoencoders,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 732–747

  23. [23]

    Zero-shot autonomous vehicle policy transfer: From simulation to real-world via adversarial learning,

    B. Chalaki, L. E. Beaver, B. Remer, K. Jang, E. Vinitsky, A. M. Bayen, and A. A. Malikopoulos, “Zero-shot autonomous vehicle policy transfer: From simulation to real-world via adversarial learning,” in2020 IEEE 16th international conference on control & automation (ICCA). IEEE, 2020, pp. 35–40

  24. [24]

    Certifiably robust policy learning against adversarial multi- agent communication,

    Y . Sun, R. Zheng, P. Hassanzadeh, Y . Liang, S. Feizi, S. Ganesh, and F. Huang, “Certifiably robust policy learning against adversarial multi- agent communication,” inThe Eleventh International Conference on Learning Representations, 2022

  25. [25]

    Coordinated llm multi-agent systems for collaborative question-answer generation,

    S. Saadaoui and E. Alonso, “Coordinated llm multi-agent systems for collaborative question-answer generation,”Knowledge-Based Systems, p. 114627, 2025

  26. [26]

    Autohma-llm: Efficient task coordination and execution in heteroge- neous multi-agent systems using hybrid large language models,

    T. Yang, P. Feng, Q. Guo, J. Zhang, J. Ning, X. Wang, and Z. Mao, “Autohma-llm: Efficient task coordination and execution in heteroge- neous multi-agent systems using hybrid large language models,”IEEE Transactions on Cognitive Communications and Networking, 2025

  27. [27]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead,

    J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–30, 2025

  28. [28]

    Advanced smart contract vulnerability detection via llm-powered multi-agent systems,

    Z. Wei, J. Sun, Y . Sun, Y . Liu, D. Wu, Z. Zhang, X. Zhang, M. Li, Y . Liu, C. Liet al., “Advanced smart contract vulnerability detection via llm-powered multi-agent systems,”IEEE Transactions on Software Engineering, 2025

  29. [29]

    Towards transparent and incentive-compatible collaboration in decentralized llm multi-agent systems: A blockchain-driven approach,

    M. Qi, T. Zhu, L. Zhang, N. Li, Y .-a. Tan, and W. Zhou, “Towards transparent and incentive-compatible collaboration in decentralized llm multi-agent systems: A blockchain-driven approach,”IEEE Transactions on Network Science and Engineering, 2026

  30. [30]

    Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system,

    H. Su, R. Chen, S. Tang, Z. Yin, X. Zheng, J. Li, B. Qi, Q. Wu, H. Li, W. Ouyanget al., “Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 28 201–28 240

  31. [31]

    Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,

    Y . Yu, Z. Yao, H. Li, Z. Deng, Y . Jiang, Y . Cao, Z. Chen, J. Suchow, Z. Cui, R. Liuet al., “Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,”Advances in Neural Information Processing Systems, vol. 37, pp. 137 010–137 045, 2024

  32. [32]

    Insight agents: An llm- based multi-agent system for data insights,

    J. Bai, Z. Zhang, J. Zhang, and J. Zhu, “Insight agents: An llm- based multi-agent system for data insights,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 4335–4339

  33. [33]

    Eduplanner: Llm-based multi-agent systems for customized and intelligent instruc- tional design,

    X. Zhang, C. Zhang, J. Sun, J. Xiao, Y . Yang, and Y . Luo, “Eduplanner: Llm-based multi-agent systems for customized and intelligent instruc- tional design,”IEEE Transactions on Learning Technologies, 2025

  34. [34]

    Masa: Llm-driven multi-agent systems for autoformalization,

    L. Zhang, M. Valentino, and A. Freitas, “Masa: Llm-driven multi-agent systems for autoformalization,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, 2025, pp. 615–624

  35. [35]

    Llm-based multi-agent systems are scalable graph generative models,

    J. Ji, R. Lei, J. Bi, Z. Wei, X. Chen, Y . Lin, X. Pan, Y . Li, and B. Ding, “Llm-based multi-agent systems are scalable graph generative models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 1492–1523

  36. [36]

    Large language model enhanced multi-agent systems for 6g communications,

    F. Jiang, Y . Peng, L. Dong, K. Wang, K. Yang, C. Pan, D. Niyato, and O. A. Dobre, “Large language model enhanced multi-agent systems for 6g communications,”IEEE Wireless Communications, 2024

  37. [37]

    Towards efficient llm grounding for embodied multi-agent collaboration,

    Y . Zhang, S. Yang, C. Bai, F. Wu, X. Li, Z. Wang, and X. Li, “Towards efficient llm grounding for embodied multi-agent collaboration,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 1663–1699

  38. [38]

    Red-teaming llm multi-agent systems via communication attacks,

    P. He, Y . Lin, S. Dong, H. Xu, Y . Xing, and H. Liu, “Red-teaming llm multi-agent systems via communication attacks,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 6726– 6747

  39. [39]

    Agents under siege: Breaking pragmatic multi-agent llm systems with optimized prompt attacks,

    R. Shahroz, Z. Tan, S. Yun, C. Fleming, and T. Chen, “Agents under siege: Breaking pragmatic multi-agent llm systems with optimized prompt attacks,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 9661–9674

  40. [40]

    G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems,

    S. Wang, G. Zhang, M. Yu, G. Wan, F. Meng, C. Guo, K. Wang, and Y . Wang, “G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025

  41. [41]

    Masrouter: Learning to route llms for multi-agent systems,

    Y . Yue, G. Zhang, B. Liu, G. Wan, K. Wang, D. Cheng, and Y . Qi, “Masrouter: Learning to route llms for multi-agent systems,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025

  42. [42]

    Agenttaxo: Dissecting and benchmarking token distribution of llm multi-agent systems,

    Q. Wang, Z. Tang, Z. Jiang, N. Chen, T. Wang, and B. He, “Agenttaxo: Dissecting and benchmarking token distribution of llm multi-agent systems,” inICLR 2025 Workshop on Foundation Models in the Wild, 2025