hub Canonical reference

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

· 2024 · cs.CR · arXiv 2406.13352

Canonical reference. 71% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 3 method 2 baseline 1

citation-polarity summary

background 12 use dataset 2 baseline 1 extend 1 use method 1

representative citing papers

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Do Coding Agents Understand Least-Privilege Authorization?

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

cs.CR · 2026-04-20 · unverdicted · novelty 7.0 · 2 refs

The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

cs.LG · 2025-10-10 · conditional · novelty 7.0

Adaptive attackers using optimization techniques bypass 12 recent LLM defenses with >90% success, showing that prior robustness claims relied on weak evaluations.

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.

Securing LLM Agents Need Intent-to-Execution Integrity

cs.CR · 2026-05-16 · conditional · novelty 6.0

The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.

Web Agents Should Adopt the Plan-Then-Execute Paradigm

cs.CR · 2026-05-14 · unverdicted · novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

cs.CR · 2026-05-13 · conditional · novelty 6.0

AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

cs.CR · 2026-05-13 · conditional · novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.

SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.

LoopTrap: Termination Poisoning Attacks on LLM Agents

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

cs.CR · 2026-05-03 · unverdicted · novelty 6.0

The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.

Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.

Alignment Contracts for Agentic Security Systems

cs.CR · 2026-04-30 · conditional · novelty 6.0

Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.

An AI Agent Execution Environment to Safeguard User Data

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

cs.CR · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.

citing papers explorer

Showing 35 of 35 citing papers.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 14 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 9 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates cs.AI · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 141 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection cs.CR · 2026-04-20 · unverdicted · none · ref 5 · 2 links · internal anchor
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections cs.LG · 2025-10-10 · conditional · none · ref 1 · internal anchor
Adaptive attackers using optimization techniques bypass 12 recent LLM defenses with >90% success, showing that prior robustness claims relied on weak evaluations.
An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments cs.CR · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
Securing LLM Agents Need Intent-to-Execution Integrity cs.CR · 2026-05-16 · conditional · none · ref 27 · internal anchor
The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.
Web Agents Should Adopt the Plan-Then-Execute Paradigm cs.CR · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills cs.CR · 2026-05-13 · conditional · none · ref 3 · internal anchor
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents cs.CR · 2026-05-13 · conditional · none · ref 4 · internal anchor
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents cs.CR · 2026-05-10 · unverdicted · none · ref 13 · internal anchor
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills cs.CR · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.
LoopTrap: Termination Poisoning Attacks on LLM Agents cs.CR · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration cs.CR · 2026-05-03 · unverdicted · none · ref 15 · internal anchor
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis cs.CR · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
Alignment Contracts for Agentic Security Systems cs.CR · 2026-04-30 · conditional · full · ref 16 · internal anchor
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents cs.CR · 2026-04-24 · unverdicted · none · ref 4 · internal anchor
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 14 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers cs.CR · 2026-04-20 · unverdicted · none · ref 4 · 2 links · internal anchor
Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.
Policy-Invisible Violations in LLM-Based Agents cs.AI · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection cs.CR · 2026-04-13 · unverdicted · none · ref 10 · 2 links · internal anchor
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts cs.CR · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw cs.CR · 2026-04-06 · conditional · none · ref 3 · internal anchor
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Agent-Sentry: Bounding LLM Agents via Execution Provenance cs.CR · 2026-03-24 · unverdicted · none · ref 7 · internal anchor
Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 2 · internal anchor
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Peering Behind the Shield: Guardrail Identification in Large Language Models cs.CR · 2025-02-03 · unverdicted · none · ref 13 · internal anchor
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation cs.CR · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 37 · internal anchor
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 192 · internal anchor
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer