pith. sign in

arxiv: 2606.31227 · v1 · pith:SWRWPKI7new · submitted 2026-06-30 · 💻 cs.CR

Securing the AI Agent: A Unified Framework for Multi-Layer Agent Red Teaming

Pith reviewed 2026-07-01 05:45 UTC · model grok-4.3

classification 💻 cs.CR
keywords AI agent red teamingmulti-layer securitysupply chain auditingMCP server auditingjailbreak testingagent behavior evaluationinfrastructure vulnerability rules
0
0 comments X

The pith

AI-Infra-Guard matches distinct red teaming methods to each of four AI agent attack layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that AI agent attack surfaces divide into infrastructure, protocol/tool, agent behavior, and model layers, with no single detection method suiting all. It presents an open-source framework that pairs deterministic rule checks over components and vulnerabilities with each layer, LLM auditing for MCP servers and skill packages, multi-turn black-box testing for behavior, and a jailbreak harness for models. This approach is presented as the first open-source effort to include supply-chain auditing of agent skills. A reader would care because agents now run on expanding open infrastructure where existing tools leave gaps at one or more layers. The work positions the framework as a practical base for shared agent security development.

Core claim

The attack surface of an AI agent is stratified across infrastructure, protocol/tool, agent behavior, and model layers, and matching a detection paradigm to each layer—deterministic rules for the first, LLM-driven auditing for the second, multi-turn black-box testing for the third, and a jailbreak harness for the fourth—produces comprehensive coverage that includes supply-chain risks in agent skills and that no prior open-source framework achieves.

What carries the argument

Layer-paradigm matching, which assigns rule-based detection to infrastructure components, LLM auditing to protocols and skills, multi-turn testing to behavior, and specialized operators to models.

If this is right

  • Enables systematic auditing of MCP servers and agent-skill packages that extend current agents.
  • Applies over 1400 vulnerability rules to more than 75 AI infrastructure components.
  • Supports 26 attack operators across sixteen datasets for model-layer testing.
  • Supplies a shared open-source base for further agent security tooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use could make supply-chain checks on skill packages a standard step before agent deployment.
  • The stratification may highlight missing coverage in emerging agent platforms that add new extension mechanisms.
  • Community extensions could test whether the four-layer division remains stable as new protocols appear.

Load-bearing premise

The attack surface of an AI agent divides usefully into four layers and each layer requires its own distinct detection method that is both necessary and sufficient.

What would settle it

An attack on a deployed AI agent that evades all four layer-specific methods simultaneously, or a single detection approach that covers every layer without measurable loss of effectiveness.

read the original abstract

The fast growth of open-source AI infrastructure, from model serving engines and agent platforms to the Model Context Protocol (MCP) ecosystem and the language models themselves, has outpaced the security tooling available to defend it. We present AI-Infra-Guard, an open-source framework that organizes AI red teaming around a single observation: the attack surface of an AI agent is stratified across layers (infrastructure, protocol/tool, agent behavior, and model), and no single detection paradigm fits all of them. The framework therefore matches a paradigm to each layer, from deterministic rule matching over 75+ AI components and 1{,}400+ vulnerability rules, through LLM-driven agentic auditing of MCP servers and agent-skill packages and multi-turn black-box agent red teaming, to a jailbreak harness with 26+ attack operators over sixteen datasets. To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing of the agent skills that increasingly extend AI agents. We release AI-Infra-Guard as open source so that \emph{layer-paradigm matching} can serve as a practical foundation for agent security and a shared base for the community to build on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents AI-Infra-Guard, an open-source framework for multi-layer red teaming of AI agents. It organizes red teaming around the observation that an AI agent's attack surface is stratified into four layers (infrastructure, protocol/tool, agent behavior, and model), with a distinct detection paradigm matched to each: deterministic rule matching over 75+ components and 1,400+ vulnerability rules for infrastructure; LLM-driven agentic auditing of MCP servers and agent-skill packages for protocol/tool; multi-turn black-box agent red teaming for agent behavior; and a jailbreak harness with 26+ attack operators over sixteen datasets for the model layer. The work claims this is the only open-source framework spanning all layers, including supply-chain auditing of agent skills, and releases the framework to serve as a practical foundation for agent security.

Significance. If the four-layer stratification and paradigm matching can be shown to be exhaustive and effective, the framework could provide a structured, practical base for AI agent security that addresses supply-chain risks in agent skills. The open-source release is a strength that could enable community extensions. However, the manuscript contains no evaluation data, benchmarks, attack coverage analysis, or comparisons, so the practical significance cannot be assessed from the provided text.

major comments (2)
  1. [Abstract] Abstract: The core claim that 'the attack surface of an AI agent is stratified across layers ... and no single detection paradigm fits all of them' is presented as a single observation with no supporting argument, mapping of known attacks to the four layers, coverage analysis, or comparison showing why alternatives (e.g., unified LLM-based detection) are inferior. This justification is load-bearing for the framework design and uniqueness assertion.
  2. [Abstract] Abstract: The uniqueness claim ('To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing') is made without citation context, comparison to prior red-teaming tools, or evidence, which directly supports the central contribution statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the abstract. We address each major comment below with explanations drawn from the manuscript's design rationale and propose targeted revisions to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The core claim that 'the attack surface of an AI agent is stratified across layers ... and no single detection paradigm fits all of them' is presented as a single observation with no supporting argument, mapping of known attacks to the four layers, coverage analysis, or comparison showing why alternatives (e.g., unified LLM-based detection) are inferior. This justification is load-bearing for the framework design and uniqueness assertion.

    Authors: The stratification is motivated by the distinct technical characteristics of each layer, which necessitate paradigm matching: infrastructure (static code/config across 75+ components) requires deterministic rules for precision and low false positives; protocol/tool (MCP servers and skill packages) requires LLM-driven agentic auditing for dynamic analysis; agent behavior requires multi-turn black-box testing to capture sequential interactions; and model requires a specialized jailbreak harness with 26+ operators. The manuscript maps representative attacks to layers in the layer-specific sections and argues that a unified LLM paradigm would lack the determinism needed for infrastructure and the coverage for supply-chain risks. We will revise the abstract to include a concise sentence referencing these distinctions. revision: yes

  2. Referee: [Abstract] Abstract: The uniqueness claim ('To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing') is made without citation context, comparison to prior red-teaming tools, or evidence, which directly supports the central contribution statement.

    Authors: The claim rests on the observation that prior open-source tools address at most one or two layers without supply-chain auditing of agent skills. While 'to our knowledge' is standard phrasing, we agree that explicit context would strengthen the statement. We will revise the abstract to reference the scope of existing tools and expand the related-work discussion with comparisons. revision: partial

Circularity Check

0 steps flagged

No circularity: framework description with asserted organizing principle, not derived result

full rationale

The paper is a framework presentation that organizes red teaming around a stated observation about four-layer stratification of the attack surface. No equations, fitted parameters, predictions, or first-principles derivations exist that could reduce to inputs by construction. The central claim is an organizing assertion rather than a result obtained from the framework itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This matches the default expectation of no significant circularity for descriptive framework papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that AI-agent attack surfaces are stratified into four fixed layers and that each layer requires a distinct detection paradigm; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The attack surface of an AI agent is stratified across infrastructure, protocol/tool, agent behavior, and model layers.
    This partitioning is the organizing principle of the entire framework and is stated without further justification in the abstract.

pith-pipeline@v0.9.1-grok · 5771 in / 1303 out tokens · 23608 ms · 2026-07-01T05:45:25.270446+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Mcp attack matrix: Rug pull attacks

    Akto Security Research. Mcp attack matrix: Rug pull attacks. https://www.akto.io/mc p-attack-matrix/rug-pull-attacks, 2026. Accessed: 2026-06-24

  2. [2]

    Tool shadowing in mcp attack matrix

    Akto Security Research. Tool shadowing in mcp attack matrix. https://www.akto.io/ mcp-attack-matrix/tool-shadowing, 2026. Accessed: 2026-06-24

  3. [3]

    Introducing the model context protocol (mcp)

    Anthropic. Introducing the model context protocol (mcp). https://www.anthropic.co m/news/model-context-protocol, Nov. 2024. Accessed: 2026-06-24

  4. [4]

    Avgustinov, O

    P . Avgustinov, O. De Moor, M. P . Jones, and M. Schäfer. Ql: Object-oriented queries on relational data. In30th European Conference on Object-Oriented Programming (ECOOP 2016), pages 2–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2016. 35

  5. [5]

    P . Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024

  6. [6]

    P . Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  7. [7]

    Comfyui: The most powerful and modular ai engine for content creation

    Comfy Org. Comfyui: The most powerful and modular ai engine for content creation. https://github.com/comfy-org/comfyui , 2026. GitHub repository, Accessed: 2026-06-24

  8. [8]

    Deepteam: Open-source llm red teaming framework

    Confident AI. Deepteam: Open-source llm red teaming framework. https://github.c om/confident-ai/deepteam, 2026. GitHub repository, Accessed: 2026-06-24

  9. [9]

    arXiv preprint arXiv:2406.11036 , year =

    L. Derczynski, E. Galinkin, J. Martin, S. Majumdar, and N. Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036, 2024

  10. [10]

    Flowise: Drag and drop ui for building llm flows

    FlowiseAI. Flowise: Drag and drop ui for building llm flows. https://github.com/f lowiseai/flowise, 2026. GitHub repository, Accessed: 2026-06-24

  11. [11]

    llama.cpp: Llm inference in c/c++

    ggml-org. llama.cpp: Llm inference in c/c++. https://github.com/ggml-org/llama .cpp, 2026. GitHub repository, Accessed: 2026-06-24

  12. [12]

    Greshake, S

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

  13. [13]

    Y. Guo, P . Liu, W. Ma, Z. Deng, X. Zhu, P . Di, X. Xiao, and S. Wen. Systematic analysis of mcp security.arXiv preprint arXiv:2508.12538, 2025

  14. [14]

    Hughes, S

    J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, A. Somani, S. Koyejo, H. Sleight, E. Jones, E. Perez, et al. Best-of-n jailbreaking.Advances in Neural Information Processing Systems, 38: 73137–73221, 2026

  15. [15]

    Kubeflow: Machine learning toolkit for kubernetes

    Kubeflow Contributors. Kubeflow: Machine learning toolkit for kubernetes. https: //github.com/kubeflow/kubeflow, 2026. GitHub repository, Accessed: 2026-06-24

  16. [16]

    Langflow: A visual framework for building and deploying ai agents and workflows

    Langflow. Langflow: A visual framework for building and deploying ai agents and workflows. https://github.com/langflow-ai/langflow, 2026. GitHub repository, Accessed: 2026-06-24

  17. [17]

    Dify: Open-source platform for llm application development

    LangGenius. Dify: Open-source platform for llm application development. https: //github.com/langgenius/dify, 2026. GitHub repository, Accessed: 2026-06-24

  18. [18]

    Mehrotra, M

    A. Mehrotra, M. Zampetakis, P . Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024

  19. [19]

    Common vulnerabilities and exposures (cve)

    MITRE Corporation. Common vulnerabilities and exposures (cve). https://www.cve. org/, 2026. Accessed: 2026-06-24. 36

  20. [20]

    Mlflow: Open source ai engineering platform for agents, llms, and machine learning models

    MLflow Contributors. Mlflow: Open source ai engineering platform for agents, llms, and machine learning models. https://github.com/mlflow/mlflow , 2026. GitHub repository, Accessed: 2026-06-24

  21. [21]

    G. D. L. Munoz, A. J. Minnich, R. Lutz, R. Lundeen, R. S. R. Dheekonda, N. Chikanov, B.-E. Jagdagdorj, M. Pouliot, S. Chawla, W. Maxwell, et al. Pyrit: A framework for security risk identification and red teaming in generative ai system.arXiv preprint arXiv:2410.02828, 2024

  22. [22]

    Offical33

    A. Offical33. Api misconfiguration leads to data leakage. https://medium.com/@akas hoffical33/api-misconfiguration-leads-to-data-leakage-c174740687e2 ,

  23. [23]

    Medium article, Accessed: 2026-06-24

  24. [24]

    Ollama. Ollama. https://ollama.com , 2026. Local Large Language Model Runtime, Accessed: 2026-06-24

  25. [25]

    Owasp top 10 for large language model applications.https://owas p.org/www-project-top-10-for-large-language-model-applications/ , 2025

    OWASP Foundation. Owasp top 10 for large language model applications.https://owas p.org/www-project-top-10-for-large-language-model-applications/ , 2025. Accessed: 2026-06-24

  26. [26]

    Component analysis

    OWASP Foundation. Component analysis. https://owasp.org/www-community/Com ponent_Analysis, 2026. Accessed: 2026-06-24

  27. [27]

    Mcp tool poisoning

    OWASP Foundation. Mcp tool poisoning. https://owasp.org/www-community/att acks/MCP_Tool_Poisoning, 2026. Accessed: 2026-06-24

  28. [28]

    Prompt injection

    OWASP Foundation. Prompt injection. https://owasp.org/www-community/attac ks/PromptInjection, 2026. Accessed: 2026-06-24

  29. [29]

    Pearce, B

    H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt. Examining zero-shot vulnera- bility repair with large language models. In2023 IEEE symposium on security and privacy (SP), pages 2339–2356. IEEE, 2023

  30. [30]

    Nuclei: Fast and customizable vulnerability scanner based on simple yaml dsl.https://github.com/projectdiscovery/nuclei, 2026

    ProjectDiscovery. Nuclei: Fast and customizable vulnerability scanner based on simple yaml dsl.https://github.com/projectdiscovery/nuclei, 2026. GitHub repository, Accessed: 2026-06-24

  31. [31]

    Promptfoo: Llm evaluation and red teaming framework

    Promptfoo. Promptfoo: Llm evaluation and red teaming framework. https://github.c om/promptfoo/promptfoo, 2026. GitHub repository, Accessed: 2026-06-24

  32. [32]

    Agent confused deputy escalation

    Promptfoo. Agent confused deputy escalation. https://www.promptfoo.dev/lm-s ecurity-db/vuln/agent-confused-deputy-escalation-d1becd4d , 2026. LM Security Database entry, Accessed: 2026-06-24

  33. [33]

    Mcp safety audit: Llms with the model context protocol allow major security exploits.arXiv preprint arXiv:2504.03767, 2025

    B. Radosevich and J. Halloran. Mcp safety audit: Llms with the model context protocol allow major security exploits.arXiv preprint arXiv:2504.03767, 2025

  34. [34]

    Ray: A distributed compute framework for scaling ai and python applications

    Ray Project. Ray: A distributed compute framework for scaling ai and python applications. https://github.com/ray-project/ray, 2026. GitHub repository, Accessed: 2026-06- 24

  35. [35]

    Russinovich, A

    M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025. 37

  36. [36]

    Semgrep: Lightweight static analysis for many languages

    Semgrep Inc. Semgrep: Lightweight static analysis for many languages. https://gith ub.com/semgrep/semgrep, 2026. GitHub repository, Accessed: 2026-06-24

  37. [37]

    Agent scan: Security scanner for ai agents, mcp servers and agent skills

    Snyk. Agent scan: Security scanner for ai agents, mcp servers and agent skills. https: //github.com/snyk/agent-scan, 2026. GitHub repository, Accessed: 2026-06-24

  38. [38]

    H. Song, Y. Shen, W. Luo, L. Guo, T. Chen, J. Wang, B. Li, X. Zhang, and J. Chen. Beyond the protocol: Unveiling attack vectors in the model context protocol (mcp) ecosystem.IEEE Transactions on Software Engineering, 2026

  39. [39]

    vllm: A high-throughput and memory-efficient inference and serving engine for large language models

    vLLM Project. vllm: A high-throughput and memory-efficient inference and serving engine for large language models. https://github.com/vllm-project/vllm, 2026. GitHub repository, Accessed: 2026-06-24

  40. [40]

    Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35811–35819, 2026

  41. [41]

    Z. Ying, A. Liu, T. Zhang, Z. Yu, S. Liang, X. Liu, and D. Tao. Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security, 2025

  42. [42]

    Z. Ying, A. Liu, S. Liang, L. Huang, J. Guo, W. Zhou, X. Liu, and D. Tao. Safebench: A safety evaluation framework for multimodal large language models.International Journal of Computer Vision, 134(1):18, 2026

  43. [43]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A. Contributions and Acknowledgments AI-Infra-Guard is developed by Tencent Zhuque Lab. Table 7 lists the core members and their contributions. An asterisk (∗) denotes a mem...