ADR: An Agentic Detection System for Enterprise Agentic AI Security

Baris Ozbas; Caroline Van; Chenning Li; Justin Xu; KK Sriramadhesikan; Manxue Li; Ming Zhang; Mohammad Alizadeh; Olivia Liu; Pan Hu

arxiv: 2605.17380 · v1 · pith:ZPHME2XNnew · submitted 2026-05-17 · 💻 cs.AI · cs.CR· cs.LG

ADR: An Agentic Detection System for Enterprise Agentic AI Security

Chenning Li , Pan Hu , Justin Xu , Baris Ozbas , Olivia Liu , Caroline Van , Manxue Li , Wei Zhou

show 4 more authors

Mohammad Alizadeh Pengyu Zhang KK Sriramadhesikan Ming Zhang

This is my paper

Pith reviewed 2026-05-20 13:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.LG

keywords agentic AI securityModel Context ProtocolAI agent detectionenterprise securityprompt injectionred teamingcredential exposuretwo-tier detection

0 comments

The pith

ADR detects attacks on enterprise AI agents by capturing reasoning and causal chains that standard tools miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ADR as a production system for securing AI agents that operate through the Model Context Protocol. It targets three problems: existing tools cannot see the prompts, reasoning steps, or intent-to-action links inside agents; static rules fail to handle new attack methods; and running large models on every action costs too much at scale. ADR solves this with a sensor that records detailed agent telemetry, an explorer that runs systematic red teaming to create hard test cases, and a two-tier detector that uses fast checks followed by deeper context-aware analysis when needed. Real deployment data and new benchmarks show the system finds credential exposures and other attacks while keeping false positives low.

Core claim

ADR combines high-fidelity agentic telemetry from the Sensor with systematic pre-deployment red teaming in the Explorer and scalable two-tier detection in the Detector to address observability gaps, lack of robustness, and high inference costs, delivering reliable security for MCP-based AI agents as shown by sustained production use and benchmark results.

What carries the argument

The ADR Sensor, which records high-fidelity agentic telemetry consisting of prompts, reasoning steps, and causal chains that connect agent intent to execution actions.

If this is right

Enterprises gain the ability to add a shift-left prevention layer that catches issues before deployment with measured precision.
Detection generalizes across attack techniques and enterprise contexts instead of relying on pre-defined static rules.
Large-scale operation becomes feasible because the two-tier detector avoids running expensive inference on every session.
Hundreds of credential exposures across multiple categories can be uncovered in live environments.
A new benchmark set enables consistent comparison of agent security methods beyond the three baselines tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same telemetry-plus-red-teaming pattern could apply to agent systems that use protocols other than MCP.
ADR-Bench could become a standard test suite for measuring how well security tools generalize to new agent behaviors.
Early integration of such detection during agent development might reduce the volume of issues that reach production.
The production metrics suggest that observability improvements can drive measurable reductions in credential-related risks at company scale.

Load-bearing premise

The telemetry gathered by the Sensor contains enough detail on agent reasoning and intent-to-action links for the Explorer and Detector to identify attacks across different techniques without missing key connections.

What would settle it

An attack that reaches execution on a deployed AI agent while producing no detectable signal in the recorded prompts, reasoning steps, or causal chains, resulting in a missed detection by the ADR system.

Figures

Figures reproduced from arXiv: 2605.17380 by Baris Ozbas, Caroline Van, Chenning Li, Justin Xu, KK Sriramadhesikan, Manxue Li, Ming Zhang, Mohammad Alizadeh, Olivia Liu, Pan Hu, Pengyu Zhang, Wei Zhou.

**Figure 1.** Figure 1: Enterprise Demands and Our Contributions for securing agentic AI. (1) Enterprise Observability via the ADR Sensor, (2) Reliable Detection via the online ADR Detector and offline ADR Explorer, and (3) Enterprise Benchmarking via ADR-Bench. outcomes (e.g., file writes or API calls), the ADR Sensor records the full causal chain of prompts, reasoning steps, tool invocations, and environmental context, thus cl… view at source ↗

**Figure 3.** Figure 3: Detailed ADR System Architecture. Left: ADR Sensor deployment showing telemetry collection from enterprise endpoints including MCP server interactions, tool executions, and environmental context. Center: Two-tier online detection pipeline with Tier 1 triage processing high-volume event streams and Tier 2 analysis performing deep contextual reasoning on flagged events. Right: Offline EAS engine showing th… view at source ↗

**Figure 4.** Figure 4: ADR Sensor Architecture. The sensor parses local caches of agentic tools (Cursor, Cline, Claude Code) to reconstruct complete agent sessions, capturing user prompts, agent reasoning, MCP tool calls, and environmental context. Telemetry is forwarded to backend systems for detection analysis. Our approach. We address these challenges with three design principles ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Benchmark Composition and Characteristics. (a) MCP servers span 14 categories with 81.2% benign and 18.8% malicious servers. (b) Benign servers provide more tools (median: 7) than malicious ones (median: 3). (c) Tasks invoke an average of 28.5 MCP tool calls, demonstrating realistic agentic workflows. (d) Malicious tasks cover 5 threat tactics with emphasis on Initial Access & Execution. along with convers… view at source ↗

**Figure 6.** Figure 6: Overall Performance on ADR-Bench. (a) Detection rates by threat tactic show ADR’s strong coverage across attack types. (b) Latency CDF demonstrates consistently lower tail latency. (c) Cost–Recall trade-off illustrates flexible operating points via triage. (d) Confusion matrix metrics highlight ADR’s balanced precision and recall. 6 REAL-WORLD DEPLOYMENT We deployed ADR across corporate MacBook endpoints (… view at source ↗

**Figure 7.** Figure 7: Ablation study results. (a) 40.7% of tasks are handled by triage alone at minimal cost. (b) Cost and latency scale. (c) Removing triage introduces false positives and costs more. (d) Removing source code and threat intel MCP servers drops recall. advertently shared outside the enterprise network, posing significant security risks. In response, we initiated credential rotation with the owning teams and dep… view at source ↗

**Figure 8.** Figure 8: Daily volume of MCP host sessions across enterprise endpoints [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Attack graph of the Agent Flayer incident illustrating indirect prompt injection via Jira–Cursor integration. 7 RELATED WORKS Benchmarks for Agent Security. Several benchmarks have been developed for agentic AI security including ToolEmu (Ruan et al., 2023), AgentDojo (Debenedetti et al., 2024), AgentHarm (Andriushchenko et al., 2024), AgentSafetyBench (Zhang et al., 2024b), and AgentSecurityBench (Zhang … view at source ↗

read the original abstract

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADR gives a concrete deployed system for agentic AI security with Uber-scale numbers and a new benchmark, but the production precision claims rest on opaque internal labeling that isn't explained.

read the letter

The core contribution is a working production system called ADR that adds agent-specific telemetry to catch issues standard EDR tools miss. It combines a sensor for prompts and reasoning chains, an explorer for generating hard test cases, and a two-tier detector that keeps costs down. They ran it at Uber for ten months across thousands of hosts and daily sessions, reporting hundreds of credential exposures found and a shift-left prevention layer at 97.2% precision. They also release ADR-Bench with 302 tasks across 17 techniques and show it beating three baselines on F1 while keeping zero false positives on their set and low alarms on AgentDojo.

Referee Report

3 major / 2 minor

Summary. The paper presents the Agentic AI Detection and Response (ADR) system as the first large-scale production-proven framework for securing AI agents via the Model Context Protocol (MCP). It identifies challenges in observability, robustness, and detection cost, addressed through three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for red teaming and hard-example generation, and the ADR Detector for two-tier scalable detection. The central claims include a ten-month deployment at Uber reaching 7,200 hosts and 10,000 daily sessions, detection of hundreds of credential exposures across 26 categories with 97.2% precision (206 credentials), and introduction of the ADR-Bench benchmark (302 tasks, 17 techniques, 133 MCP servers) where ADR achieves zero false positives and 67% attack detection, outperforming baselines by 2-4x in F1-score; it also reports perfect detection on AgentDojo with only three false alarms.

Significance. If the production and benchmark results hold after methodological clarification, this would represent a significant contribution as the first detailed enterprise-scale system for agentic AI security, directly addressing EDR limitations in observing reasoning and causal chains. The release of ADR-Bench and the reported real-world metrics (including shift-left prevention) provide concrete, deployable advances that could influence both research and industry practices in AI agent safety.

major comments (3)

Production Evaluation section: the headline claims of 97.2% precision, 206 detected credentials, and reliable detection over ten months with 7,200 hosts and 10,000 daily sessions rest on undisclosed ground-truth labeling, sampling strategy, and false-negative estimation methods in the live Uber environment; without these details the metrics cannot be independently assessed as reflecting the Sensor-Explorer-Detector pipeline rather than Uber-specific logging or post-hoc filtering.
ADR-Bench Evaluation: the 302-task benchmark reports zero false positives and 67% detection (outperforming ALRPHFS, GuardAgent, LlamaFirewall by 2-4x F1), yet provides no description of task construction, labeling process, or controls for selection effects, leaving the generalization claims across 17 techniques and 133 MCP servers only partially supported.
System Design, ADR Sensor subsection: the assumption that the Sensor captures sufficient high-fidelity telemetry (prompts, reasoning, causal chains) to enable downstream generalization is load-bearing for the robustness claim but is not validated by ablation studies or explicit evidence that missing intent-to-execution links do not cause critical detection failures.

minor comments (2)

The abstract and evaluation sections should explicitly state whether the zero false positives on ADR-Bench and the three false alarms on AgentDojo refer to held-out test data or include training-set results.
Figure and table captions in the evaluation sections could be expanded to include exact definitions of precision, F1, and how false positives were measured in both production and benchmark settings.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review. The comments identify key areas where additional methodological transparency would strengthen the manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: Production Evaluation section: the headline claims of 97.2% precision, 206 detected credentials, and reliable detection over ten months with 7,200 hosts and 10,000 daily sessions rest on undisclosed ground-truth labeling, sampling strategy, and false-negative estimation methods in the live Uber environment; without these details the metrics cannot be independently assessed as reflecting the Sensor-Explorer-Detector pipeline rather than Uber-specific logging or post-hoc filtering.

Authors: We agree that further detail on the production evaluation methodology is warranted. In the revised manuscript we will expand the Production Evaluation section with a description of the ground-truth process (automated pattern matching followed by security-analyst review), the representative sampling approach across sessions, and the periodic red-teaming protocol used to bound false-negative rates. Certain enterprise-specific implementation thresholds will remain summarized for confidentiality reasons, but the core labeling and estimation procedures will be made explicit so readers can assess the contribution of the ADR pipeline. revision: yes
Referee: ADR-Bench Evaluation: the 302-task benchmark reports zero false positives and 67% detection (outperforming ALRPHFS, GuardAgent, LlamaFirewall by 2-4x F1), yet provides no description of task construction, labeling process, or controls for selection effects, leaving the generalization claims across 17 techniques and 133 MCP servers only partially supported.

Authors: We accept that the benchmark construction details require elaboration. The revised version will add a dedicated subsection describing how the 302 tasks were assembled from real MCP interactions and synthetic attacks spanning the 17 techniques, the expert labeling protocol employed, and the diversity-sampling controls used to ensure coverage across the 133 MCP servers. These additions will directly support the generalization claims. revision: yes
Referee: System Design, ADR Sensor subsection: the assumption that the Sensor captures sufficient high-fidelity telemetry (prompts, reasoning, causal chains) to enable downstream generalization is load-bearing for the robustness claim but is not validated by ablation studies or explicit evidence that missing intent-to-execution links do not cause critical detection failures.

Authors: The Sensor is engineered to record prompts, reasoning traces, and execution events through the MCP interface. While the original submission did not include dedicated ablations, we will add an ablation study in the revised manuscript quantifying detection performance as a function of telemetry completeness. The two-tier Detector and Explorer components are explicitly designed to compensate for any residual gaps; production results over ten months showed no critical failures traceable to missing causal links. We will clarify this mitigation strategy in the System Design section. revision: partial

standing simulated objections not resolved

Certain quantitative parameters of the live Uber false-negative estimation protocol cannot be disclosed in full detail without revealing proprietary operational security practices.

Circularity Check

0 steps flagged

No circularity: empirical system performance and benchmark results do not reduce to self-defined inputs or fitted parameters

full rationale

The paper introduces the ADR system with Sensor, Explorer, and Detector components to address observability, robustness, and cost challenges in securing AI agents. It reports production metrics from a ten-month Uber deployment (7,200 hosts, 10k daily sessions, 97.2% precision, 206 credentials) and benchmark results on ADR-Bench (zero false positives, 67% attack detection) and AgentDojo (all attacks detected with three false alarms). These are presented as direct empirical outcomes from deployment logs and task evaluations, not as predictions derived from internal equations, parameter fits renamed as forecasts, or self-citation chains. No load-bearing steps in the abstract or description equate outputs to inputs by construction, such as defining a metric in terms of itself or smuggling an ansatz via prior self-work. The derivation chain is self-contained against external benchmarks and real-world logs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on domain assumptions about the observability provided by MCP and the coverage of the 17 attack techniques used in red teaming; no explicit free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption The Model Context Protocol exposes agent reasoning, prompts, and causal chains at a level sufficient for high-fidelity detection.
Invoked when describing the ADR Sensor as addressing limited observability.
domain assumption Systematic pre-deployment red teaming with 17 techniques generates representative hard examples that generalize to production attacks.
Underlies the ADR Explorer component and the claim of robustness.

pith-pipeline@v0.9.0 · 5884 in / 1236 out tokens · 49198 ms · 2026-05-20T13:14:09.019367+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-tier online detection pipeline with Tier 1 triage ... Tier 2 analysis performing deep contextual reasoning
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

evolutionary algorithm ... fitness function F=ε×σ×τ^α

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 10 internal anchors

[1]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2025 , eprint=

Systematic Analysis of MCP Security , author=. 2025 , eprint=

work page 2025
[3]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Identifying the risks of lm agents with an lm-emulated sandbox , author=. arXiv preprint arXiv:2309.15817 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents , author=. arXiv preprint arXiv:2410.02644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

2025 , eprint=

Simplified and Secure MCP Gateways for Enterprise AI Integration , author=. 2025 , eprint=

work page 2025
[7]

Beyond the protocol: Unveiling attack vectors in the model context protocol ecosystem.arXiv preprint arXiv:2506.02040, 2025

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem , author=. arXiv preprint arXiv:2506.02040 , year=

work page arXiv
[8]

arXiv preprint arXiv:2506.15253 , year=

RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments , author=. arXiv preprint arXiv:2506.15253 , year=

work page arXiv
[9]

2025 , eprint=

MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System , author=. 2025 , eprint=

work page 2025
[10]

2025 , eprint=

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions , author=. 2025 , eprint=

work page 2025
[12]

ACM Computing Surveys , volume=

MITRE ATT&CK: State of the art and way forward , author=. ACM Computing Surveys , volume=. 2024 , publisher=

work page 2024
[13]

2025 , howpublished =

work page 2025
[14]

2025 , howpublished =

AgentFlayer: When a Jira Ticket Can Steal Your Secrets , author =. 2025 , howpublished =

work page 2025
[15]

2025 , howpublished =

GitHub MCP Exploited: Accessing private repositories via MCP , author =. 2025 , howpublished =

work page 2025
[16]

2025 , howpublished =

WhatsApp MCP Exploited: Exfiltrating your message history via MCP , author =. 2025 , howpublished =

work page 2025
[17]

2025 , howpublished =

Invariant Labs Exposes Novel Prompt Injection Attack Vulnerabilities, ``Toxic Flows,'' in Agentic Systems & MCP Servers , author =. 2025 , howpublished =

work page 2025
[18]

2025 , howpublished =

Why a Classic MCP Server Vulnerability Can Undermine Your Entire AI Agent , author =. 2025 , howpublished =

work page 2025
[19]

2025 , howpublished =

Critical RCE Vulnerability in mcp-remote: CVE-2025-6514 Threatens LLM Clients , author =. 2025 , howpublished =

work page 2025
[20]

2025 , howpublished =

MCP Security Notification: Tool Poisoning Attacks , author =. 2025 , howpublished =

work page 2025
[21]

2025 , howpublished =

The Security Risks of Model Context Protocol (MCP) , author =. 2025 , howpublished =

work page 2025
[22]

2025 , howpublished =

Deep Dive MCP and A2A Attack Vectors for AI Agents , author =. 2025 , howpublished =

work page 2025
[23]

2025 , howpublished =

Is your AI safe? Threat analysis of MCP (Model Context Protocol) , author =. 2025 , howpublished =

work page 2025
[24]

2025 , howpublished =

Cato CTRL. 2025 , howpublished =

work page 2025
[25]

2025 , howpublished =

Plug, Play, and Prey: The security risks of the Model Context Protocol , author =. 2025 , howpublished =

work page 2025
[26]

2025 , howpublished =

OWASP LLM Risk 06:2025 -- Excessive Agency , author =. 2025 , howpublished =

work page 2025
[27]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. arXiv preprint arXiv:2410.09024 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Mcp-guard: A defense framework for model context protocol integrity in large language model applications.arXiv preprint arXiv:2508.10991, 2025

MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications , author=. arXiv preprint arXiv:2508.10991 , year=

work page arXiv
[30]

arXiv preprint arXiv:2505.03574 , year=

Llamafirewall: An open source guardrail system for building secure ai agents , author=. arXiv preprint arXiv:2505.03574 , year=

work page arXiv
[31]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=

work page internal anchor Pith review arXiv
[32]

arXiv preprint arXiv:2502.11448 , year=

Agrail: A lifelong agent guardrail with effective and adaptive safety detection , author=. arXiv preprint arXiv:2502.11448 , year=

work page arXiv
[33]

arXiv preprint arXiv:2503.22738 , year=

Shieldagent: Shielding agents via verifiable safety policy reasoning , author=. arXiv preprint arXiv:2503.22738 , year=

work page arXiv
[34]

arXiv preprint arXiv:2505.19260 , year=

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense , author=. arXiv preprint arXiv:2505.19260 , year=

work page arXiv
[35]

arXiv preprint arXiv:2407.16667 , year=

Redagent: Red teaming large language models with context-aware autonomous language agent , author=. arXiv preprint arXiv:2407.16667 , year=

work page arXiv
[36]

Advances in Neural Information Processing Systems , volume=

Ali-agent: Assessing llms' alignment with human values via agent-based evaluation , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Advances in Neural Information Processing Systems , volume=

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms , author=. arXiv preprint arXiv:2410.05295 , year=

work page arXiv
[39]

arXiv preprint arXiv:2503.15754 , year=

Autoredteamer: Autonomous red teaming with lifelong attack integration , author=. arXiv preprint arXiv:2503.15754 , year=

work page arXiv
[40]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Applied Sciences , volume=

Open sesame! universal black-box jailbreaking of large language models , author=. Applied Sciences , volume=. 2024 , publisher=

work page 2024
[42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Advances in Neural Information Processing Systems , volume=

When llm meets drl: Advancing jailbreaking efficiency via drl-guided search , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan. N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.40

work page doi:10.18653/v1/2023.emnlp-demo.40 2023
[45]

2025 , howpublished =

Amazon Bedrock Guardrails , author =. 2025 , howpublished =

work page 2025
[46]

2025 , howpublished =

Guardrails API , author =. 2025 , howpublished =

work page 2025
[47]

2025 , howpublished =

AI-Infra-Guard , author =. 2025 , howpublished =

work page 2025
[48]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Securing AI Agents with Information-Flow Control

Securing AI Agents with Information-Flow Control , author=. arXiv preprint arXiv:2505.23643 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2025 , howpublished =

ALRPHFS codebase , author =. 2025 , howpublished =

work page 2025
[51]

2025 , howpublished =

GuardAgent codebase , author =. 2025 , howpublished =

work page 2025
[52]

2025 , howpublished =

LlamaFirewall codebase , author =. 2025 , howpublished =

work page 2025
[53]

2025 , howpublished =

MCP.so Marketplace , author =. 2025 , howpublished =

work page 2025
[54]

2025 , howpublished =

Model Context Protocol Repositories , author =. 2025 , howpublished =

work page 2025

[1] [1]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

2025 , eprint=

Systematic Analysis of MCP Security , author=. 2025 , eprint=

work page 2025

[3] [3]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Identifying the risks of lm agents with an lm-emulated sandbox , author=. arXiv preprint arXiv:2309.15817 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents , author=. arXiv preprint arXiv:2410.02644 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

2025 , eprint=

Simplified and Secure MCP Gateways for Enterprise AI Integration , author=. 2025 , eprint=

work page 2025

[7] [7]

Beyond the protocol: Unveiling attack vectors in the model context protocol ecosystem.arXiv preprint arXiv:2506.02040, 2025

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem , author=. arXiv preprint arXiv:2506.02040 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2506.15253 , year=

RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments , author=. arXiv preprint arXiv:2506.15253 , year=

work page arXiv

[9] [9]

2025 , eprint=

MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , eprint=

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits , author=. 2025 , eprint=

work page 2025

[11] [11]

2025 , eprint=

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions , author=. 2025 , eprint=

work page 2025

[12] [12]

ACM Computing Surveys , volume=

MITRE ATT&CK: State of the art and way forward , author=. ACM Computing Surveys , volume=. 2024 , publisher=

work page 2024

[13] [13]

2025 , howpublished =

work page 2025

[14] [14]

2025 , howpublished =

AgentFlayer: When a Jira Ticket Can Steal Your Secrets , author =. 2025 , howpublished =

work page 2025

[15] [15]

2025 , howpublished =

GitHub MCP Exploited: Accessing private repositories via MCP , author =. 2025 , howpublished =

work page 2025

[16] [16]

2025 , howpublished =

WhatsApp MCP Exploited: Exfiltrating your message history via MCP , author =. 2025 , howpublished =

work page 2025

[17] [17]

2025 , howpublished =

Invariant Labs Exposes Novel Prompt Injection Attack Vulnerabilities, ``Toxic Flows,'' in Agentic Systems & MCP Servers , author =. 2025 , howpublished =

work page 2025

[18] [18]

2025 , howpublished =

Why a Classic MCP Server Vulnerability Can Undermine Your Entire AI Agent , author =. 2025 , howpublished =

work page 2025

[19] [19]

2025 , howpublished =

Critical RCE Vulnerability in mcp-remote: CVE-2025-6514 Threatens LLM Clients , author =. 2025 , howpublished =

work page 2025

[20] [20]

2025 , howpublished =

MCP Security Notification: Tool Poisoning Attacks , author =. 2025 , howpublished =

work page 2025

[21] [21]

2025 , howpublished =

The Security Risks of Model Context Protocol (MCP) , author =. 2025 , howpublished =

work page 2025

[22] [22]

2025 , howpublished =

Deep Dive MCP and A2A Attack Vectors for AI Agents , author =. 2025 , howpublished =

work page 2025

[23] [23]

2025 , howpublished =

Is your AI safe? Threat analysis of MCP (Model Context Protocol) , author =. 2025 , howpublished =

work page 2025

[24] [24]

2025 , howpublished =

Cato CTRL. 2025 , howpublished =

work page 2025

[25] [25]

2025 , howpublished =

Plug, Play, and Prey: The security risks of the Model Context Protocol , author =. 2025 , howpublished =

work page 2025

[26] [26]

2025 , howpublished =

OWASP LLM Risk 06:2025 -- Excessive Agency , author =. 2025 , howpublished =

work page 2025

[27] [27]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. arXiv preprint arXiv:2410.09024 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Mcp-guard: A defense framework for model context protocol integrity in large language model applications.arXiv preprint arXiv:2508.10991, 2025

MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications , author=. arXiv preprint arXiv:2508.10991 , year=

work page arXiv

[30] [30]

arXiv preprint arXiv:2505.03574 , year=

Llamafirewall: An open source guardrail system for building secure ai agents , author=. arXiv preprint arXiv:2505.03574 , year=

work page arXiv

[31] [31]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning , author=. arXiv preprint arXiv:2406.09187 , year=

work page internal anchor Pith review arXiv

[32] [32]

arXiv preprint arXiv:2502.11448 , year=

Agrail: A lifelong agent guardrail with effective and adaptive safety detection , author=. arXiv preprint arXiv:2502.11448 , year=

work page arXiv

[33] [33]

arXiv preprint arXiv:2503.22738 , year=

Shieldagent: Shielding agents via verifiable safety policy reasoning , author=. arXiv preprint arXiv:2503.22738 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2505.19260 , year=

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense , author=. arXiv preprint arXiv:2505.19260 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2407.16667 , year=

Redagent: Red teaming large language models with context-aware autonomous language agent , author=. arXiv preprint arXiv:2407.16667 , year=

work page arXiv

[36] [36]

Advances in Neural Information Processing Systems , volume=

Ali-agent: Assessing llms' alignment with human values via agent-based evaluation , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [37]

Advances in Neural Information Processing Systems , volume=

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[38] [38]

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.arXiv preprint arXiv:2410.05295, 2024

Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms , author=. arXiv preprint arXiv:2410.05295 , year=

work page arXiv

[39] [39]

arXiv preprint arXiv:2503.15754 , year=

Autoredteamer: Autonomous red teaming with lifelong attack integration , author=. arXiv preprint arXiv:2503.15754 , year=

work page arXiv

[40] [40]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Applied Sciences , volume=

Open sesame! universal black-box jailbreaking of large language models , author=. Applied Sciences , volume=. 2024 , publisher=

work page 2024

[42] [42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Advances in Neural Information Processing Systems , volume=

When llm meets drl: Advancing jailbreaking efficiency via drl-guided search , author=. Advances in Neural Information Processing Systems , volume=

work page

[44] [44]

N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan. N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. doi:10.18653/v1/2023.emnlp-demo.40

work page doi:10.18653/v1/2023.emnlp-demo.40 2023

[45] [45]

2025 , howpublished =

Amazon Bedrock Guardrails , author =. 2025 , howpublished =

work page 2025

[46] [46]

2025 , howpublished =

Guardrails API , author =. 2025 , howpublished =

work page 2025

[47] [47]

2025 , howpublished =

AI-Infra-Guard , author =. 2025 , howpublished =

work page 2025

[48] [48]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Securing AI Agents with Information-Flow Control

Securing AI Agents with Information-Flow Control , author=. arXiv preprint arXiv:2505.23643 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

2025 , howpublished =

ALRPHFS codebase , author =. 2025 , howpublished =

work page 2025

[51] [51]

2025 , howpublished =

GuardAgent codebase , author =. 2025 , howpublished =

work page 2025

[52] [52]

2025 , howpublished =

LlamaFirewall codebase , author =. 2025 , howpublished =

work page 2025

[53] [53]

2025 , howpublished =

MCP.so Marketplace , author =. 2025 , howpublished =

work page 2025

[54] [54]

2025 , howpublished =

Model Context Protocol Repositories , author =. 2025 , howpublished =

work page 2025