Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
hub Mixed citations
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Mixed citation behavior. Most common role is background (50%).
abstract
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.
Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
LangChain, LlamaIndex, and Stripe Agent Toolkit default to capability gating without deterministic per-call value authorization, while the introduced ScopeGate enforces scope, authorization, money ceiling, idempotency, and default deny to block unauthorized actions.
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.
Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
citing papers explorer
-
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.