A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
hub Mixed citations
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Mixed citation behavior. Most common role is background (62%).
abstract
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.
Introduces a stakeholder-centric benchmark showing current web agents fail all tested prompt injection objectives, with failures falling into stealthy parasitism, misaligned disruption, or compounded failure modes.
SMSR is the first defense with a certified robustness bound against multi-session memory poisoning in persistent LLM agents, combining HMAC provenance signing with randomized ablation and verdict-based voting.
Introduces brain-prompt injection attacks on BCI-to-LLM agent pipelines and defines a Route-Safety Audit Contract with separation theorem, C3 decomposition, and split-conformal calibration to bound false-accept rates on EEG data.
Authors create LLMCVE dataset of LLM-in-the-loop vulnerabilities and demonstrate that agent-based repair methods achieve low success rates on them, particularly prompt injections at 28.57% Pass@1.
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
The paper introduces SafeClawArena, a 406-task benchmark evaluating security failures in three Claw-like agent platforms across skill supply-chain, state exploitation, data flow, and prompt injection surfaces.
RIFT-Bench is a graph representation-driven methodology for dynamic red-teaming that enables unified evaluations across diverse agentic AI architectures, demonstrated on 45 systems.
AuthGraph aligns an execution provenance graph with a clean authorization graph to detect parameter-source deviations from user intent, reducing attack success rates to 1-2% on AgentDojo and AgentDyn while retaining most task utility.
citing papers explorer
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
-
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails
Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.
-
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
Introduces a stakeholder-centric benchmark showing current web agents fail all tested prompt injection objectives, with failures falling into stealthy parasitism, misaligned disruption, or compounded failure modes.
-
SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems
SMSR is the first defense with a certified robustness bound against multi-session memory poisoning in persistent LLM agents, combining HMAC provenance signing with randomized ablation and verdict-based voting.
-
Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents
Introduces brain-prompt injection attacks on BCI-to-LLM agent pipelines and defines a Route-Safety Audit Contract with separation theorem, C3 decomposition, and split-conformal calibration to bound false-accept rates on EEG data.
-
Towards Demystifying and Repairing LLM-in-the-Loop Vulnerabilities
Authors create LLMCVE dataset of LLM-in-the-loop vulnerabilities and demonstrate that agent-based repair methods achieve low success rates on them, particularly prompt injections at 28.57% Pass@1.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
-
Toward a Principled Framework for Agent Safety Measurement
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
-
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Formal Policy Enforcement for Real-World Agentic Systems
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
-
Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens
The paper introduces SafeClawArena, a 406-task benchmark evaluating security failures in three Claw-like agent platforms across skill supply-chain, state exploitation, data flow, and prompt injection surfaces.
-
RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems
RIFT-Bench is a graph representation-driven methodology for dynamic red-teaming that enables unified evaluations across diverse agentic AI architectures, demonstrated on 45 systems.
-
Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
AuthGraph aligns an execution provenance graph with a clean authorization graph to detect parameter-source deviations from user intent, reducing attack success rates to 1-2% on AgentDojo and AgentDyn while retaining most task utility.
-
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
-
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection
LivePI benchmark reports indirect prompt injection success rates of 10.7-29.6% across five models on seven input surfaces and shows a two-layer defense blocking all malicious completions while preserving utility.
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks are insufficient.
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
-
SUDP: Secret-Use Delegation Protocol for Agentic Systems
SUDP is a three-party protocol in which an agent proposes an operation, the user issues a fresh grant, and a custodian executes it, satisfying seven security properties for bounded secret use without reusable authority transfer.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
Tracking Capabilities for Safer Agents
AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.
-
Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents
A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.
-
Think Twice Before You Act: Protecting LLM Agents Against Tool Description Poisoning via Isolated Planning
Tool-Guard uses isolated planning to quarantine suspicious tools, reducing tool description poisoning attacks on LLM agents while preserving task utility on AgentDojo and ASB benchmarks.
-
Order Is Not Control
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
-
Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs
A survey that maps safety risks in personalized LLMs, introduces a unified taxonomy, and highlights three structural inadequacies in existing research on user-invariant safety, isolated techniques, and short-term evaluations.
-
ADR: An Agentic Detection System for Enterprise Agentic AI Security
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
-
SafeAgent: A Runtime Protection Architecture for Agentic Systems
SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
-
SoK: Security of Autonomous LLM Agents in Agentic Commerce
The paper systematizes security for LLM agents in agentic commerce into five threat dimensions, identifies 12 cross-layer attack vectors, and proposes a layered defense architecture.
-
DRAFT: Task Decoupled Latent Reasoning for Agent Safety
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
-
AI-Driven Security Alert Screening and Alert Fatigue Mitigation in Security Operations Centers: A Comprehensive Survey
A literature survey synthesizes 119 studies on AI-driven alert screening into a four-stage taxonomy of filtering, triage, correlation, and generative augmentation while identifying gaps in deployment realism and robustness.