hub Canonical reference

Ignore Previous Prompt: Attack Techniques For Language Models

· 2022 · cs.CL · arXiv 2211.09527

Canonical reference. 81% of citing Pith papers cite this work as background.

92 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 92 citing papers arXiv PDF

abstract

Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 method 2 baseline 1

citation-polarity summary

background 21 use method 2 baseline 1 support 1 unclear 1

claims ledger

abstract Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demon

co-cited works

representative citing papers

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

cs.CR · 2026-04-25 · unverdicted · novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

cs.CR · 2025-12-18 · conditional · novelty 8.0

ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.

Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content

cs.CR · 2026-05-23 · unverdicted · novelty 7.0

Log-substrate prompt injection via attacker-controlled fields enables effective attacks on LLM SOC assistants, with persona hijacks suppressing 68% of malicious logs and context manipulation reaching 96% success on summarization, reduced to 11.8% average under strongest defenses.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

cs.CR · 2026-05-03 · unverdicted · novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

cs.CR · 2026-04-22 · conditional · novelty 7.0

Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection and serving stability.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

cs.CR · 2026-03-30 · conditional · novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.

AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?

cs.CR · 2026-02-03 · accept · novelty 7.0

AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

cs.LG · 2025-10-10 · conditional · novelty 7.0

Adaptive attackers using optimization techniques bypass 12 recent LLM defenses with >90% success, showing that prior robustness claims relied on weak evaluations.

citing papers explorer

Showing 42 of 92 citing papers.

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Towards Understanding the Robustness of Sparse Autoencoders cs.LG · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection cs.CR · 2026-04-13 · unverdicted · none · ref 30 · 2 links · internal anchor
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning cs.CR · 2026-04-10 · unverdicted · none · ref 15 · internal anchor
BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models cs.LG · 2026-04-08 · unverdicted · none · ref 14 · internal anchor
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 46 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
A Security Analysis of the OpenClaw AI Agent Framework cs.CR · 2026-03-29 · conditional · none · ref 3 · 2 links · internal anchor
Security analysis of OpenClaw reveals composable RCE paths from LLM tool calls, invalid closed-world assumptions in exec allowlists, and plugin-based attacks that bypass runtime policy.
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks cs.CR · 2026-03-03 · conditional · none · ref 88 · internal anchor
Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking cs.CL · 2025-10-11 · conditional · none · ref 1 · internal anchor
ADMIT achieves 86% average attack success rate on RAG fact-checking at 0.93×10^{-6} poisoning rate across 4 retrievers, 11 LLMs, and 4 benchmarks while remaining robust to counter-evidence.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction cs.CR · 2025-04-29 · unverdicted · none · ref 30 · internal anchor
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models cs.CL · 2024-10-05 · unverdicted · none · ref 30 · internal anchor
ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 50 · internal anchor
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 44 · internal anchor
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models cs.CR · 2023-08-07 · unverdicted · none · ref 68 · internal anchor
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety cs.CL · 2026-06-26 · unverdicted · none · ref 26 · internal anchor
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
SCI-Defense combines perplexity detection, semantic integrity scoring across four manipulation dimensions, and inter-candidate detection to counter GEO attacks, reporting perfect precision on Amazon product data but domain-limited recall on web passages.
Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders cs.CR · 2026-05-20 · unverdicted · none · ref 22 · internal anchor
Graph-context LLM fraud defenders improve early refusal under replay and adaptive multi-round attacks compared to text baselines but increase benign over-refusal, with the cost localized to how the LLM consumes structured graph fields rather than encoder quality.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 47 · internal anchor
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI cs.CR · 2026-05-04 · unverdicted · none · ref 4 · 2 links · internal anchor
A survey providing a taxonomy of TEE platforms, an agent-centric threat model, and open challenges for applying confidential computing to secure agentic AI systems.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly cs.CR · 2026-05-02 · unverdicted · none · ref 14 · 2 links · internal anchor
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
TRUST: A Framework for Decentralized AI Service v.0.1 cs.AI · 2026-04-29 · unverdicted · none · ref 31 · internal anchor
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.
Evaluation of Prompt Injection Defenses in Large Language Models cs.CR · 2026-04-26 · unverdicted · none · ref 6 · 2 links · internal anchor
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory cs.AI · 2026-04-22 · unverdicted · none · ref 57 · internal anchor
FSFM is a biologically-inspired selective forgetting framework for LLM agents that claims to boost access efficiency by 8.49%, content quality by 29.2% signal-to-noise, and eliminate security risks entirely through a taxonomy of decay, deletion, safety, and adaptive mechanisms.
enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways cs.CR · 2026-04-18 · unverdicted · none · ref 21 · 2 links · internal anchor
enclawed is a sector-neutral hardening framework for AI gateways providing signed modules, audit trails, peer attestation, and a 356-case test suite for regulated deployments.
RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement cs.CR · 2026-04-08 · unverdicted · none · ref 19 · internal anchor
RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 17 · internal anchor
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
When the Loop Closes: Architectural Limits of In-Context Isolation, Metacognitive Co-option, and the Two-Target Design Problem in Human-LLM Systems cs.HC · 2026-03-14 · unverdicted · none · ref 23 · 2 links · internal anchor
A single-subject autoethnographic study documents rapid loss of decision-making agency in an LLM-based cognitive externalization system caused by context contamination and metacognitive co-option, with recovery only after physical interruption.
DRAFT: Task Decoupled Latent Reasoning for Agent Safety cs.LG · 2026-02-11 · unverdicted · none · ref 11 · internal anchor
DRAFT decouples agent safety judgment into latent extraction and reasoning stages, raising average benchmark accuracy from 63.27% to 91.18%.
When AI Meets Wall Street: A Survey on Trustworthy AI in Fintech cs.CR · 2026-05-28 · unverdicted · none · ref 93 · internal anchor
A survey that proposes a lifecycle-centric framework and the Financial AI Security and Robustness Taxonomy to organize 17 attack subtypes on AI pipelines in finance.
Engineering Robustness into Personal Agents with the AI Workflow Store cs.CR · 2026-05-11 · unverdicted · none · ref 41 · 2 links · internal anchor
Position paper advocating a shift from on-the-fly AI agent synthesis to reusable hardened workflows in an AI Workflow Store to improve robustness and security.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning cs.CL · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
Making AI-Assisted Grant Evaluation Auditable without Exposing the Model cs.CR · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference cs.CR · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents cs.CR · 2026-04-12 · unverdicted · none · ref 8 · internal anchor
Aethelgard is a learned governance system that scopes AI agent capabilities to the minimum needed for each task type using PPO policy training on audit logs.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 58 · internal anchor
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection cs.AI · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
GuardNet ensemble of BiLSTMs reaches AUROC 0.747 on blind n=200 test and F1 0.92 on proprietary n=50 set with 50 ms CPU latency for PI/JB detection.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI cs.CR · 2026-05-15 · unverdicted · none · ref 103 · internal anchor
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments cs.AI · 2026-04-06 · unverdicted · none · ref 34 · internal anchor
AI Trust OS is a proposed always-on operating layer that discovers undocumented AI systems via telemetry and produces continuous zero-trust compliance artifacts for regulations including ISO 42001, EU AI Act, SOC 2, GDPR, and HIPAA.
Security Considerations for Artificial Intelligence Agents cs.LG · 2026-03-12 · unverdicted · none · ref 33 · internal anchor
Frontier AI agents introduce new confidentiality, integrity, and availability risks through changed assumptions on code-data separation and authority boundaries, requiring layered defenses like sandboxing and policy enforcement.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 183 · internal anchor
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs cs.CR · 2023-12-06 · unverdicted · none · ref 14 · internal anchor
Impersonating complex misaligned personas via biographies and role-play bypasses safety in ChatGPT, Gemini, and Deepseek, succeeding on 38-40 out of 40 illicit questions across tested models.

Ignore Previous Prompt: Attack Techniques For Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer