Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.
hub Canonical reference
Ignore Previous Prompt: Attack Techniques For Language Models
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demon
co-cited works
representative citing papers
Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection and serving stability.
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
Adaptive attackers using optimization techniques bypass 12 recent LLM defenses with >90% success, showing that prior robustness claims relied on weak evaluations.
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
citing papers explorer
No citing papers match the current filters.