super hub Canonical reference

Ignore Previous Prompt: Attack Techniques For Language Models

Ian Ribeiro · 2022 · cs.CL · arXiv 2211.09527

Canonical reference. 81% of citing Pith papers cite this work as background.

121 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 121 citing papers more from Ian Ribeiro arXiv PDF

abstract

Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 method 2 baseline 1

citation-polarity summary

background 21 use method 2 baseline 1 support 1 unclear 1

claims ledger

abstract Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demon

authors

F\'abio Perez Ian Ribeiro

co-cited works

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

cs.CR · 2026-04-25 · unverdicted · novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

cs.CR · 2025-12-18 · conditional · novelty 8.0

ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

cs.CR · 2026-06-17 · unverdicted · novelty 7.0

ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

cs.CR · 2026-06-16 · unverdicted · novelty 7.0

Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

cs.CR · 2026-06-16 · conditional · novelty 7.0

Handlebars double-brace escaping neutralizes angle-bracket role delimiters but not colon- or Markdown-based ones, as measured by survival rates and 5760 model trials across four LLMs.

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

cs.CR · 2026-06-11 · unverdicted · novelty 7.0

Introduces a stakeholder-centric benchmark showing current web agents fail all tested prompt injection objectives, with failures falling into stealthy parasitism, misaligned disruption, or compounded failure modes.

Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

Introduces Document-Authored Control-Signal Impersonation (DACSI) as a low-cost indirect prompt attack on RAG safety boundaries and evaluates its effectiveness across multiple models.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

Controlled experiments on GPT-4o-mini and Claude Haiku show indirect prompt injection success in ReAct agents decays sharply with injection depth, varies with payload framing, and remains stable across turn budgets.

Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content

cs.CR · 2026-05-23 · unverdicted · novelty 7.0

Log-substrate prompt injection via attacker-controlled fields enables effective attacks on LLM SOC assistants, with persona hijacks suppressing 68% of malicious logs and context manipulation reaching 96% success on summarization, reduced to 11.8% average under strongest defenses.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Eliciting Latent Predictions from Transformers with the Tuned Lens cs.LG · 2023-03-14 · accept · none · ref 69 · internal anchor
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 50 · internal anchor
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 44 · internal anchor
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models cs.CR · 2023-08-07 · unverdicted · none · ref 68 · internal anchor
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs cs.CR · 2023-12-06 · unverdicted · none · ref 14 · internal anchor
Impersonating complex misaligned personas via biographies and role-play bypasses safety in ChatGPT, Gemini, and Deepseek, succeeding on 38-40 out of 40 illicit questions across tested models.

Ignore Previous Prompt: Attack Techniques For Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer