Securing the AI Agent: A Unified Framework for Multi-Layer Agent Red Teaming
Pith reviewed 2026-07-01 05:45 UTC · model grok-4.3
The pith
AI-Infra-Guard matches distinct red teaming methods to each of four AI agent attack layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The attack surface of an AI agent is stratified across infrastructure, protocol/tool, agent behavior, and model layers, and matching a detection paradigm to each layer—deterministic rules for the first, LLM-driven auditing for the second, multi-turn black-box testing for the third, and a jailbreak harness for the fourth—produces comprehensive coverage that includes supply-chain risks in agent skills and that no prior open-source framework achieves.
What carries the argument
Layer-paradigm matching, which assigns rule-based detection to infrastructure components, LLM auditing to protocols and skills, multi-turn testing to behavior, and specialized operators to models.
If this is right
- Enables systematic auditing of MCP servers and agent-skill packages that extend current agents.
- Applies over 1400 vulnerability rules to more than 75 AI infrastructure components.
- Supports 26 attack operators across sixteen datasets for model-layer testing.
- Supplies a shared open-source base for further agent security tooling.
Where Pith is reading between the lines
- Widespread use could make supply-chain checks on skill packages a standard step before agent deployment.
- The stratification may highlight missing coverage in emerging agent platforms that add new extension mechanisms.
- Community extensions could test whether the four-layer division remains stable as new protocols appear.
Load-bearing premise
The attack surface of an AI agent divides usefully into four layers and each layer requires its own distinct detection method that is both necessary and sufficient.
What would settle it
An attack on a deployed AI agent that evades all four layer-specific methods simultaneously, or a single detection approach that covers every layer without measurable loss of effectiveness.
read the original abstract
The fast growth of open-source AI infrastructure, from model serving engines and agent platforms to the Model Context Protocol (MCP) ecosystem and the language models themselves, has outpaced the security tooling available to defend it. We present AI-Infra-Guard, an open-source framework that organizes AI red teaming around a single observation: the attack surface of an AI agent is stratified across layers (infrastructure, protocol/tool, agent behavior, and model), and no single detection paradigm fits all of them. The framework therefore matches a paradigm to each layer, from deterministic rule matching over 75+ AI components and 1{,}400+ vulnerability rules, through LLM-driven agentic auditing of MCP servers and agent-skill packages and multi-turn black-box agent red teaming, to a jailbreak harness with 26+ attack operators over sixteen datasets. To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing of the agent skills that increasingly extend AI agents. We release AI-Infra-Guard as open source so that \emph{layer-paradigm matching} can serve as a practical foundation for agent security and a shared base for the community to build on.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AI-Infra-Guard, an open-source framework for multi-layer red teaming of AI agents. It organizes red teaming around the observation that an AI agent's attack surface is stratified into four layers (infrastructure, protocol/tool, agent behavior, and model), with a distinct detection paradigm matched to each: deterministic rule matching over 75+ components and 1,400+ vulnerability rules for infrastructure; LLM-driven agentic auditing of MCP servers and agent-skill packages for protocol/tool; multi-turn black-box agent red teaming for agent behavior; and a jailbreak harness with 26+ attack operators over sixteen datasets for the model layer. The work claims this is the only open-source framework spanning all layers, including supply-chain auditing of agent skills, and releases the framework to serve as a practical foundation for agent security.
Significance. If the four-layer stratification and paradigm matching can be shown to be exhaustive and effective, the framework could provide a structured, practical base for AI agent security that addresses supply-chain risks in agent skills. The open-source release is a strength that could enable community extensions. However, the manuscript contains no evaluation data, benchmarks, attack coverage analysis, or comparisons, so the practical significance cannot be assessed from the provided text.
major comments (2)
- [Abstract] Abstract: The core claim that 'the attack surface of an AI agent is stratified across layers ... and no single detection paradigm fits all of them' is presented as a single observation with no supporting argument, mapping of known attacks to the four layers, coverage analysis, or comparison showing why alternatives (e.g., unified LLM-based detection) are inferior. This justification is load-bearing for the framework design and uniqueness assertion.
- [Abstract] Abstract: The uniqueness claim ('To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing') is made without citation context, comparison to prior red-teaming tools, or evidence, which directly supports the central contribution statement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments regarding the abstract. We address each major comment below with explanations drawn from the manuscript's design rationale and propose targeted revisions to improve clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The core claim that 'the attack surface of an AI agent is stratified across layers ... and no single detection paradigm fits all of them' is presented as a single observation with no supporting argument, mapping of known attacks to the four layers, coverage analysis, or comparison showing why alternatives (e.g., unified LLM-based detection) are inferior. This justification is load-bearing for the framework design and uniqueness assertion.
Authors: The stratification is motivated by the distinct technical characteristics of each layer, which necessitate paradigm matching: infrastructure (static code/config across 75+ components) requires deterministic rules for precision and low false positives; protocol/tool (MCP servers and skill packages) requires LLM-driven agentic auditing for dynamic analysis; agent behavior requires multi-turn black-box testing to capture sequential interactions; and model requires a specialized jailbreak harness with 26+ operators. The manuscript maps representative attacks to layers in the layer-specific sections and argues that a unified LLM paradigm would lack the determinism needed for infrastructure and the coverage for supply-chain risks. We will revise the abstract to include a concise sentence referencing these distinctions. revision: yes
-
Referee: [Abstract] Abstract: The uniqueness claim ('To our knowledge it is the only open-source framework to span all of these, including supply-chain auditing') is made without citation context, comparison to prior red-teaming tools, or evidence, which directly supports the central contribution statement.
Authors: The claim rests on the observation that prior open-source tools address at most one or two layers without supply-chain auditing of agent skills. While 'to our knowledge' is standard phrasing, we agree that explicit context would strengthen the statement. We will revise the abstract to reference the scope of existing tools and expand the related-work discussion with comparisons. revision: partial
Circularity Check
No circularity: framework description with asserted organizing principle, not derived result
full rationale
The paper is a framework presentation that organizes red teaming around a stated observation about four-layer stratification of the attack surface. No equations, fitted parameters, predictions, or first-principles derivations exist that could reduce to inputs by construction. The central claim is an organizing assertion rather than a result obtained from the framework itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This matches the default expectation of no significant circularity for descriptive framework papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The attack surface of an AI agent is stratified across infrastructure, protocol/tool, agent behavior, and model layers.
Reference graph
Works this paper leans on
-
[1]
Mcp attack matrix: Rug pull attacks
Akto Security Research. Mcp attack matrix: Rug pull attacks. https://www.akto.io/mc p-attack-matrix/rug-pull-attacks, 2026. Accessed: 2026-06-24
2026
-
[2]
Tool shadowing in mcp attack matrix
Akto Security Research. Tool shadowing in mcp attack matrix. https://www.akto.io/ mcp-attack-matrix/tool-shadowing, 2026. Accessed: 2026-06-24
2026
-
[3]
Introducing the model context protocol (mcp)
Anthropic. Introducing the model context protocol (mcp). https://www.anthropic.co m/news/model-context-protocol, Nov. 2024. Accessed: 2026-06-24
2024
-
[4]
Avgustinov, O
P . Avgustinov, O. De Moor, M. P . Jones, and M. Schäfer. Ql: Object-oriented queries on relational data. In30th European Conference on Object-Oriented Programming (ECOOP 2016), pages 2–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2016. 35
2016
-
[5]
P . Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024
2024
-
[6]
P . Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025
2025
-
[7]
Comfyui: The most powerful and modular ai engine for content creation
Comfy Org. Comfyui: The most powerful and modular ai engine for content creation. https://github.com/comfy-org/comfyui , 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[8]
Deepteam: Open-source llm red teaming framework
Confident AI. Deepteam: Open-source llm red teaming framework. https://github.c om/confident-ai/deepteam, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[9]
arXiv preprint arXiv:2406.11036 , year =
L. Derczynski, E. Galinkin, J. Martin, S. Majumdar, and N. Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036, 2024
-
[10]
Flowise: Drag and drop ui for building llm flows
FlowiseAI. Flowise: Drag and drop ui for building llm flows. https://github.com/f lowiseai/flowise, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[11]
llama.cpp: Llm inference in c/c++
ggml-org. llama.cpp: Llm inference in c/c++. https://github.com/ggml-org/llama .cpp, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[12]
Greshake, S
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023
2023
-
[13]
Y. Guo, P . Liu, W. Ma, Z. Deng, X. Zhu, P . Di, X. Xiao, and S. Wen. Systematic analysis of mcp security.arXiv preprint arXiv:2508.12538, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Hughes, S
J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, A. Somani, S. Koyejo, H. Sleight, E. Jones, E. Perez, et al. Best-of-n jailbreaking.Advances in Neural Information Processing Systems, 38: 73137–73221, 2026
2026
-
[15]
Kubeflow: Machine learning toolkit for kubernetes
Kubeflow Contributors. Kubeflow: Machine learning toolkit for kubernetes. https: //github.com/kubeflow/kubeflow, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[16]
Langflow: A visual framework for building and deploying ai agents and workflows
Langflow. Langflow: A visual framework for building and deploying ai agents and workflows. https://github.com/langflow-ai/langflow, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[17]
Dify: Open-source platform for llm application development
LangGenius. Dify: Open-source platform for llm application development. https: //github.com/langgenius/dify, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[18]
Mehrotra, M
A. Mehrotra, M. Zampetakis, P . Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024
2024
-
[19]
Common vulnerabilities and exposures (cve)
MITRE Corporation. Common vulnerabilities and exposures (cve). https://www.cve. org/, 2026. Accessed: 2026-06-24. 36
2026
-
[20]
Mlflow: Open source ai engineering platform for agents, llms, and machine learning models
MLflow Contributors. Mlflow: Open source ai engineering platform for agents, llms, and machine learning models. https://github.com/mlflow/mlflow , 2026. GitHub repository, Accessed: 2026-06-24
2026
- [21]
-
[22]
Offical33
A. Offical33. Api misconfiguration leads to data leakage. https://medium.com/@akas hoffical33/api-misconfiguration-leads-to-data-leakage-c174740687e2 ,
-
[23]
Medium article, Accessed: 2026-06-24
2026
-
[24]
Ollama. Ollama. https://ollama.com , 2026. Local Large Language Model Runtime, Accessed: 2026-06-24
2026
-
[25]
Owasp top 10 for large language model applications.https://owas p.org/www-project-top-10-for-large-language-model-applications/ , 2025
OWASP Foundation. Owasp top 10 for large language model applications.https://owas p.org/www-project-top-10-for-large-language-model-applications/ , 2025. Accessed: 2026-06-24
2025
-
[26]
Component analysis
OWASP Foundation. Component analysis. https://owasp.org/www-community/Com ponent_Analysis, 2026. Accessed: 2026-06-24
2026
-
[27]
Mcp tool poisoning
OWASP Foundation. Mcp tool poisoning. https://owasp.org/www-community/att acks/MCP_Tool_Poisoning, 2026. Accessed: 2026-06-24
2026
-
[28]
Prompt injection
OWASP Foundation. Prompt injection. https://owasp.org/www-community/attac ks/PromptInjection, 2026. Accessed: 2026-06-24
2026
-
[29]
Pearce, B
H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt. Examining zero-shot vulnera- bility repair with large language models. In2023 IEEE symposium on security and privacy (SP), pages 2339–2356. IEEE, 2023
2023
-
[30]
Nuclei: Fast and customizable vulnerability scanner based on simple yaml dsl.https://github.com/projectdiscovery/nuclei, 2026
ProjectDiscovery. Nuclei: Fast and customizable vulnerability scanner based on simple yaml dsl.https://github.com/projectdiscovery/nuclei, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[31]
Promptfoo: Llm evaluation and red teaming framework
Promptfoo. Promptfoo: Llm evaluation and red teaming framework. https://github.c om/promptfoo/promptfoo, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[32]
Agent confused deputy escalation
Promptfoo. Agent confused deputy escalation. https://www.promptfoo.dev/lm-s ecurity-db/vuln/agent-confused-deputy-escalation-d1becd4d , 2026. LM Security Database entry, Accessed: 2026-06-24
2026
-
[33]
B. Radosevich and J. Halloran. Mcp safety audit: Llms with the model context protocol allow major security exploits.arXiv preprint arXiv:2504.03767, 2025
-
[34]
Ray: A distributed compute framework for scaling ai and python applications
Ray Project. Ray: A distributed compute framework for scaling ai and python applications. https://github.com/ray-project/ray, 2026. GitHub repository, Accessed: 2026-06- 24
2026
-
[35]
Russinovich, A
M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025. 37
2025
-
[36]
Semgrep: Lightweight static analysis for many languages
Semgrep Inc. Semgrep: Lightweight static analysis for many languages. https://gith ub.com/semgrep/semgrep, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[37]
Agent scan: Security scanner for ai agents, mcp servers and agent skills
Snyk. Agent scan: Security scanner for ai agents, mcp servers and agent skills. https: //github.com/snyk/agent-scan, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[38]
H. Song, Y. Shen, W. Luo, L. Guo, T. Chen, J. Wang, B. Li, X. Zhang, and J. Chen. Beyond the protocol: Unveiling attack vectors in the model context protocol (mcp) ecosystem.IEEE Transactions on Software Engineering, 2026
2026
-
[39]
vllm: A high-throughput and memory-efficient inference and serving engine for large language models
vLLM Project. vllm: A high-throughput and memory-efficient inference and serving engine for large language models. https://github.com/vllm-project/vllm, 2026. GitHub repository, Accessed: 2026-06-24
2026
-
[40]
Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35811–35819, 2026
2026
-
[41]
Z. Ying, A. Liu, T. Zhang, Z. Yu, S. Liang, X. Liu, and D. Tao. Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security, 2025
2025
-
[42]
Z. Ying, A. Liu, S. Liang, L. Huang, J. Guo, W. Zhou, X. Liu, and D. Tao. Safebench: A safety evaluation framework for multimodal large language models.International Journal of Computer Vision, 134(1):18, 2026
2026
-
[43]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A. Contributions and Acknowledgments AI-Infra-Guard is developed by Tencent Zhuque Lab. Table 7 lists the core members and their contributions. An asterisk (∗) denotes a mem...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.