hub Mixed citations

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, Emre Kiciman · 2024 · cs.CR · arXiv 2403.14720

Mixed citation behavior. Most common role is background (62%).

42 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1 other 1

citation-polarity summary

background 5 support 1 unclear 1 use method 1

representative citing papers

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows

cs.CR · 2026-05-07 · unverdicted · novelty 8.0

Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving over 0.91 F1 on threat detection in evaluation.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents

cs.CR · 2026-06-13 · unverdicted · novelty 7.0

AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

IterInject is a new iterative attack method that uses outcome diagnosis and history-conditioned optimization to create indirect prompt injections, outperforming baselines on AgentDojo and InjectAgent and achieving full success on 5 of 9 targets against a production coding agent.

No More, No Less: Task Alignment in Terminal Agents

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

Toward a Principled Framework for Agent Safety Measurement

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

cs.CR · 2026-04-20 · unverdicted · novelty 7.0 · 2 refs

The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

cs.CR · 2026-04-05 · unverdicted · novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

cs.CR · 2026-03-30 · conditional · novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.

Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

cs.CR · 2026-06-29 · unverdicted · novelty 6.0

Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.

Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

PI-Hunter automates red-teaming of LLM agents by generating and iteratively evolving source-aware test cases to induce retrieval of embedded malicious instructions from external environments.

Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

cs.CR · 2026-06-09 · unverdicted · novelty 6.0

GT-MCP coordinates three LLM agents via a trust function and rollback to bound contextual drift and block adversarial injections in multi-turn interactions.

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

cs.CR · 2026-06-06 · unverdicted · novelty 6.0

POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.

PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.

The Surface You Test Is Not the Surface That Breaks

cs.CR · 2026-05-28 · unverdicted · novelty 6.0

Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-surface baselines by 9.1 pp on average.

citing papers explorer

Showing 42 of 42 citing papers.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution cs.CR · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts cs.CR · 2026-05-09 · unverdicted · none · ref 43 · 3 links · internal anchor
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows cs.CR · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving over 0.91 F1 on threat detection in evaluation.
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems cs.MA · 2024-10-09 · unverdicted · none · ref 60 · internal anchor
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 19 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA cs.AI · 2026-06-30 · unverdicted · none · ref 14 · internal anchor
Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents cs.CR · 2026-06-13 · unverdicted · none · ref 5 · internal anchor
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization cs.LG · 2026-05-23 · unverdicted · none · ref 2 · internal anchor
IterInject is a new iterative attack method that uses outcome diagnosis and history-conditioned optimization to create indirect prompt injections, outperforming baselines on AgentDojo and InjectAgent and achieving full success on 5 of 9 targets against a production coding agent.
No More, No Less: Task Alignment in Terminal Agents cs.LG · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Toward a Principled Framework for Agent Safety Measurement cs.CR · 2026-05-02 · unverdicted · none · ref 8 · internal anchor
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization cs.CR · 2026-04-27 · unverdicted · none · ref 3 · internal anchor
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection cs.CR · 2026-04-20 · unverdicted · none · ref 8 · 2 links · internal anchor
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents cs.CR · 2026-04-05 · unverdicted · none · ref 13 · internal anchor
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers cs.CR · 2026-03-30 · conditional · none · ref 9 · internal anchor
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense cs.CR · 2026-06-29 · unverdicted · none · ref 50 · internal anchor
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents cs.AI · 2026-06-21 · unverdicted · none · ref 18 · internal anchor
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections cs.CR · 2026-06-10 · unverdicted · none · ref 30 · internal anchor
PI-Hunter automates red-teaming of LLM agents by generating and iteratively evolving source-aware test cases to induce retrieval of embedded malicious instructions from external environments.
Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs cs.CR · 2026-06-09 · unverdicted · none · ref 24 · internal anchor
GT-MCP coordinates three LLM agents via a trust function and rollback to bound contextual drift and block adversarial injections in multi-turn interactions.
POISE: Position-Aware Undetectable Skill Injection on LLM Agents cs.CR · 2026-06-06 · unverdicted · none · ref 38 · internal anchor
POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models cs.AI · 2026-06-05 · unverdicted · none · ref 5 · internal anchor
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents cs.CR · 2026-06-02 · unverdicted · none · ref 7 · internal anchor
Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations cs.CR · 2026-06-02 · unverdicted · none · ref 8 · internal anchor
PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
The Surface You Test Is Not the Surface That Breaks cs.CR · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-surface baselines by 9.1 pp on average.
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection cs.CR · 2026-05-18 · unverdicted · none · ref 28 · 2 links · internal anchor
LivePI benchmark reports indirect prompt injection success rates of 10.7-29.6% across five models on seven input surfaces and shows a two-layer defense blocking all malicious completions while preserving utility.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents cs.CR · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection cs.CR · 2026-05-05 · unverdicted · none · ref 131 · internal anchor
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration cs.CR · 2026-05-03 · unverdicted · none · ref 34 · 2 links · internal anchor
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis cs.CR · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing cs.CR · 2026-04-26 · unverdicted · none · ref 4 · internal anchor
Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 24 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 38 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 28 · internal anchor
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents cs.CL · 2025-09-26 · conditional · none · ref 3 · internal anchor
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 13 · internal anchor
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction cs.CR · 2025-04-29 · unverdicted · none · ref 16 · internal anchor
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
Progent: Securing AI Agents with Privilege Control cs.CR · 2025-04-16 · unverdicted · none · ref 25 · internal anchor
Progent introduces a privilege-control framework for AI agents that uses LLM-generated symbolic rules over tools, SMT-solver-enforced monotonic updates, and deterministic checks to reduce attack success rates on AgentDojo and ASB benchmarks.
FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents cs.CL · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
FinHarness adds inline query and tool monitors plus an adaptive cascade verifier to finance LLM agents, cutting attack success rate on FinVault from 38.3% to 15.0% while keeping benign approvals near 39% and using 4.7x fewer advanced LLM calls.
Evaluation of Prompt Injection Defenses in Large Language Models cs.CR · 2026-04-26 · unverdicted · none · ref 10 · 2 links · internal anchor
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
Assessing Automated Prompt Injection Attacks in Agentic Environments cs.CR · 2026-06-09 · unverdicted · none · ref 17 · internal anchor
Black-box optimization outperforms gradient-based methods for prompt injection on LLM agents, with success depending on attacker model strength and limited transfer from small to frontier models.
Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs cs.CR · 2026-03-29 · unverdicted · none · ref 10 · 2 links · internal anchor
A domain-specific multi-layer safeguard for educational LLM tutors achieves zero false positives on benign tasks while providing measurable resistance to prompt injection, with explicit trade-offs versus existing guardrails on latency and attack bypass.
Security Considerations for Artificial Intelligence Agents cs.LG · 2026-03-12 · unverdicted · none · ref 17 · internal anchor
Frontier AI agents introduce new confidentiality, integrity, and availability risks through changed assumptions on code-data separation and authority boundaries, requiring layered defenses like sandboxing and policy enforcement.

Defending Against Indirect Prompt Injection Attacks With Spotlighting

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer