hub

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang · 2024 · cs.CL · arXiv 2403.02691

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

open full Pith review browse 26 citing papers arXiv PDF

abstract

Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 2

citation-polarity summary

baseline 2 background 1 support 1

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

cs.CR · 2026-04-05 · unverdicted · novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

cs.CR · 2026-04-03 · accept · novelty 7.0

Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

cs.CR · 2026-03-30 · conditional · novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.

Formal Policy Enforcement for Real-World Agentic Systems

cs.CR · 2026-02-18 · unverdicted · novelty 7.0

FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.

Prompt Injection Attack to Tool Selection in LLM Agents

cs.CR · 2025-04-28 · conditional · novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

Web Agents Should Adopt the Plan-Then-Execute Paradigm

cs.CR · 2026-05-14 · unverdicted · novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.

How Adversarial Environments Mislead Agentic AI?

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

cs.MA · 2026-04-20 · unverdicted · novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

cs.CR · 2026-04-11 · unverdicted · novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

cs.CR · 2026-04-06 · conditional · novelty 6.0

Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

cs.LG · 2024-10-11 · accept · novelty 6.0

AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

cs.CR · 2026-05-07 · unverdicted · novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

cs.CR · 2026-04-22 · unverdicted · novelty 5.0

Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.

citing papers explorer

Showing 26 of 26 citing papers.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 77 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 71 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck cs.CR · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 33 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 93 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems cs.SE · 2026-04-07 · unverdicted · none · ref 64 · internal anchor
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents cs.CR · 2026-04-05 · unverdicted · none · ref 30 · internal anchor
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study cs.CR · 2026-04-03 · accept · none · ref 63 · internal anchor
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers cs.CR · 2026-03-30 · conditional · none · ref 5 · internal anchor
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
Formal Policy Enforcement for Real-World Agentic Systems cs.CR · 2026-02-18 · unverdicted · none · ref 67 · internal anchor
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 63 · internal anchor
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Web Agents Should Adopt the Plan-Then-Execute Paradigm cs.CR · 2026-05-14 · unverdicted · none · ref 35 · internal anchor
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis cs.CR · 2026-05-01 · unverdicted · none · ref 48 · internal anchor
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance cs.MA · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification cs.CR · 2026-04-11 · unverdicted · none · ref 10 · internal anchor
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw cs.CR · 2026-04-06 · conditional · none · ref 20 · internal anchor
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 81 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents cs.LG · 2024-10-11 · accept · none · ref 33 · internal anchor
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 42 · internal anchor
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation cs.CR · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 38 · internal anchor
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
SafeAgent: A Runtime Protection Architecture for Agentic Systems cs.AI · 2026-04-19 · unverdicted · none · ref 20 · internal anchor
SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
A Periodic Space of Distributed Computing: Vision & Framework cs.DC · 2026-04-14 · unverdicted · none · ref 84 · internal anchor
A periodic framework is proposed to characterize, compare, and predict behaviors across distributed computing solutions by mapping system properties in a structured space inspired by the chemical periodic table.

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer