Ignore Previous Prompt: Attack Techniques For Language Models

F\'abio Perez; Ian Ribeiro

arxiv: 2211.09527 · v1 · submitted 2022-11-17 · 💻 cs.CL · cs.AI

Ignore Previous Prompt: Attack Techniques For Language Models

F\'abio Perez , Ian Ribeiro This is my paper

Pith reviewed 2026-05-11 13:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords prompt injectionadversarial promptsGPT-3goal hijackingprompt leakingLLM securityadversarial attacks

0 comments

The pith

Simple handcrafted inputs can misalign GPT-3 via goal hijacking and prompt leaking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PromptInject, a framework for building adversarial prompts through mask-based iteration, and applies it to GPT-3. It identifies two concrete attack patterns: goal hijacking, which overrides the model's assigned task, and prompt leaking, which forces disclosure of hidden instructions. A sympathetic reader would care because these attacks succeed with low skill and exploit the model's inherent randomness, implying that production customer-facing systems carry hard-to-predict failure modes.

Core claim

GPT-3 can be easily misaligned by simple handcrafted inputs using PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, enabling goal hijacking and prompt leaking that exploit the model's stochastic nature and create long-tail risks even for low-aptitude attackers.

What carries the argument

PromptInject, a framework for mask-based iterative adversarial prompt composition that constructs prompts to override or extract from the target model.

If this is right

Production deployments of GPT-3 face long-tail risks from basic attacks.
Low-aptitude but ill-intentioned users can override intended model behavior.
Prompt leaking can expose system instructions that are meant to stay hidden.
Stochastic responses make such exploits hard to anticipate or block in advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other large language models likely share similar prompt-level vulnerabilities.
Automated or iterative versions of PromptInject could scale the attacks further.
Deployed systems may need input sanitization or output monitoring to limit exposure.

Load-bearing premise

Handcrafted adversarial prompts will reliably succeed against production GPT-3 instances without triggering safety filters or detection mechanisms.

What would settle it

Repeated tests in which every handcrafted prompt is blocked by GPT-3's safety mechanisms and produces neither hijacking nor leaking.

read the original abstract

Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PromptInject gives a structured framework for prompt attacks on GPT-3 with open code and concrete examples, but the claims about easy exploitation rest on unquantified demos.

read the letter

PromptInject is a new framework for composing adversarial prompts to attack GPT-3, with concrete examples for goal hijacking and prompt leaking, plus open code. That's the main takeaway. The paper does a good job laying out the two attack types and showing how mask-based iteration can build effective prompts. It brings some structure to what were mostly scattered ideas at the time. The weakness is the evaluation. There are illustrative examples but no success rates, no repeated sampling to handle the model's randomness, and no baselines. This makes the claims about low-aptitude agents easily creating long-tail risks hard to evaluate. The demonstrations are plausible, but without numbers it's difficult to gauge how reliable or widespread the issue is. This is for researchers and engineers working on LLM safety and deployment. Someone building customer-facing apps might find the attack ideas useful for testing, while a methods paper reader will want more rigorous testing. It deserves peer review because the contribution is timely and the framework adds something new, even if the current version is more of a proof of concept than a complete study.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PromptInject, a mask-based iterative framework for adversarial prompt composition, and uses it to demonstrate two classes of attacks on GPT-3: goal hijacking and prompt leaking. Through handcrafted examples in Sections 4 and 5, the authors argue that even low-aptitude but ill-intentioned users can easily misalign the model by exploiting its stochastic sampling, thereby creating long-tail security risks. The associated code is released on GitHub.

Significance. If the attacks can be shown to succeed at non-negligible rates under realistic conditions, the work would be significant for LLM safety research, as it identifies a practical attack surface in widely deployed models and supplies reusable tooling. The open-source release is a clear strength that enables reproducibility and extension by others.

major comments (2)

[Sections 4 and 5] Sections 4 and 5: The paper presents only single illustrative prompt examples that succeed; no success fractions, number of trials, temperature sweeps, or failure-mode statistics are reported. Without these data the transition from 'possible' to 'easily' and 'long-tail risks' (Abstract) cannot be evaluated.
[Section 3] Section 3 and Abstract: The PromptInject framework is introduced as an iterative, mask-based method, yet the reported attacks appear to be static handcrafted strings. The manuscript must clarify whether the framework was actually used to produce the examples or whether the demonstrations rely on manual construction alone.

minor comments (1)

[Abstract] Abstract: the phrase 'prosaic alignment framework' is introduced without a concise definition or contrast to existing alignment techniques.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and will revise the manuscript to incorporate clarifications and additional supporting data where appropriate.

read point-by-point responses

Referee: [Sections 4 and 5] Sections 4 and 5: The paper presents only single illustrative prompt examples that succeed; no success fractions, number of trials, temperature sweeps, or failure-mode statistics are reported. Without these data the transition from 'possible' to 'easily' and 'long-tail risks' (Abstract) cannot be evaluated.

Authors: We acknowledge that Sections 4 and 5 rely on single illustrative examples to demonstrate the attacks. The manuscript's primary aim is to show that such misalignments are feasible with simple prompts by exploiting GPT-3's stochastic sampling, rather than providing a full statistical characterization. To better support the claims of 'easily' and 'long-tail risks' in the Abstract, we will add quantitative results in the revision, including success rates over multiple trials, temperature sweeps, and failure-mode statistics. revision: yes
Referee: [Section 3] Section 3 and Abstract: The PromptInject framework is introduced as an iterative, mask-based method, yet the reported attacks appear to be static handcrafted strings. The manuscript must clarify whether the framework was actually used to produce the examples or whether the demonstrations rely on manual construction alone.

Authors: PromptInject is presented as a mask-based iterative framework for composing adversarial prompts in a systematic manner. The specific examples in Sections 4 and 5 were manually handcrafted to provide clear, reproducible illustrations of goal hijacking and prompt leaking. We will revise the text in Section 3 and the Abstract to explicitly distinguish the framework's general capability from the handcrafted demonstrations used for exposition, and we will include a brief example of applying the framework to generate one such attack. revision: yes

Circularity Check

0 steps flagged

Empirical attack demonstration with no derivation chain or fitted predictions

full rationale

The paper proposes the PromptInject framework and presents qualitative demonstrations of goal hijacking and prompt leaking attacks on GPT-3 via handcrafted prompts. It contains no equations, no parameter fitting, no predictions derived from inputs, and no load-bearing self-citations or uniqueness theorems. The central claims rest on illustrative examples in Sections 4-5 rather than any reduction of outputs to inputs by construction. This is a standard empirical security study whose reasoning is self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical observation of LLM behavior under adversarial inputs; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5417 in / 880 out tokens · 32780 ms · 2026-05-11T13:53:48.409357+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution
cs.CR 2026-05 unverdicted novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
cs.CR 2026-04 unverdicted novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
cs.CR 2026-04 unverdicted novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
ContextLeak: Auditing Leakage in Private In-Context Learning Methods
cs.CR 2025-12 conditional novelty 8.0

ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
cs.MA 2024-10 unverdicted novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
cs.CR 2026-05 unverdicted novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
cs.CR 2026-05 unverdicted novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
cs.CR 2026-04 unverdicted novelty 7.0

AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
cs.CR 2026-04 conditional novelty 7.0

Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection an...
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
cs.CR 2026-03 conditional novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?
cs.CR 2026-02 accept novelty 7.0

AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
cs.LG 2025-10 conditional novelty 7.0

Adaptive attackers using optimization techniques bypass 12 recent LLM defenses with >90% success, showing that prior robustness claims relied on weak evaluations.
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers
cs.CR 2026-05 unverdicted novelty 6.0

Introduces Prompt Overflow Attack that fragments malicious instructions in overlength prompts to evade guardrail segmentation while remaining actionable to LLMs with larger context windows.
Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions
cs.CR 2026-05 unverdicted novelty 6.0

A3S-Bench evaluates LLM agents against temporal, spatial, and semantic evasions, raising average risk trigger rates from 28.3% to 52.6% across 2,254 trajectories and 20 scenarios.
Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
cs.CR 2026-05 conditional novelty 6.0

Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs
cs.CR 2026-05 unverdicted novelty 6.0

Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
cs.CR 2026-05 conditional novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture
cs.LO 2026-05 unverdicted novelty 6.0 partial

Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines o...
ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel
cs.CR 2026-05 unverdicted novelty 6.0

ClawGuard detects LLM agent workflow hijacking by capturing and classifying electromagnetic emanations from hardware with 0.9945 AUC, 100% true-positive rate, and 1.16% false-positive rate on a 7.82 TB RF dataset.
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Detecting Verbatim LLM Copy-Paste in Homework
cs.CR 2026-05 unverdicted novelty 6.0

SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
cs.CL 2026-05 unverdicted novelty 6.0

LLMs exhibit prompt-variant output-mode collapse, preserving requested bare-label formats in only about 22% of semantically equivalent prompt variants across tested models and tasks.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
cs.CL 2026-05 unverdicted novelty 6.0

LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
cs.CR 2026-05 unverdicted novelty 6.0

Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
cs.CR 2026-05 unverdicted novelty 6.0

SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework
cs.SE 2026-04 unverdicted novelty 6.0

A new five-principle framework applied to 34 practitioner AI governance prompts finds 37% lack key structural elements such as data classification and rubrics.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
How Adversarial Environments Mislead Agentic AI?
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
cs.CR 2026-04 unverdicted novelty 6.0

BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
cs.LG 2026-04 unverdicted novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
A Security Analysis of the OpenClaw AI Agent Framework
cs.CR 2026-03 conditional novelty 6.0

Security analysis of OpenClaw reveals composable RCE paths from LLM tool calls, invalid closed-world assumptions in exec allowlists, and plugin-based attacks that bypass runtime policy.
A Security Analysis of the OpenClaw AI Agent Framework
cs.CR 2026-03 accept novelty 6.0

OpenClaw's per-layer trust model allows cross-layer attacks that compose into unauthenticated remote code execution from LLM tool calls and bypass the exec allowlist via shell parsing tricks.
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
cs.CR 2026-03 conditional novelty 6.0

Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking
cs.CL 2025-10 conditional novelty 6.0

ADMIT achieves 86% average attack success rate on RAG fact-checking at 0.93×10^{-6} poisoning rate across 4 retrievers, 11 LLMs, and 4 benchmarks while remaining robust to counter-evidence.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
cs.CR 2025-04 unverdicted novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested s...
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
cs.CL 2024-10 unverdicted novelty 6.0

ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
cs.AI 2023-09 unverdicted novelty 6.0

GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
cs.CR 2023-08 unverdicted novelty 6.0

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 79 Pith papers · 6 internal anchors

[1]

Persistent anti-muslim bias in large language models

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , pages 298–306, 2021

work page 2021
[2]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the suscepti- bility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022

work page arXiv 2022
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[4]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021

work page 2021
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

A survey on bias in deep nlp

Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martínez-Santiago, and L Alfonso Ureña-López. A survey on bias in deep nlp. Applied Sciences, 11(7):3184, 2021

work page 2021
[7]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. REALTOX- ICITYPROMPTS: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020. 6

work page internal anchor Pith review arXiv 2009
[8]

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions

Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions., Sep 2022. URL https://web.archive.org/web/ 20220919192024/https://twitter.com/goodside/status/1569128808308957185

work page arXiv 2022
[9]

X-Risk Analysis for AI Research

Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862, 2022

work page arXiv 2022
[10]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review arXiv 1906
[11]

Aligning language models to follow instructions, Jan

Ryan Lowe and Jan Leike. Aligning language models to follow instructions, Jan

work page
[13]

A holistic approach to undesired content detection,

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. arXiv preprint arXiv:2208.03274, 2022

work page arXiv 2022
[14]

The radicalization risks of gpt-3 and advanced neural language models

Kris McGufﬁe and Alex Newhouse. The radicalization risks of GPT-3 and advanced neural language models. arXiv preprint arXiv:2009.06807, 2020

work page arXiv 2009
[15]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Liﬂand, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, 2020

work page 2020
[17]

OpenAI API - examples, 2022

OpenAI. OpenAI API - examples, 2022. URL https://web.archive.org/web/ 20220928211844/https://beta.openai.com/examples/

work page 2022
[19]

Models - OpenAI API, 2022

OpenAI. Models - OpenAI API, 2022. URL http://archive.today/2022.10. 28-122238/https://beta.openai.com/docs/models/gpt-3

work page 2022
[20]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Agent-based model characterization using natural language processing

Jose J Padilla, David Shuttleworth, and Kevin O’Brien. Agent-based model characterization using natural language processing. In 2019 Winter Simulation Conference (WSC) , pages 560–571. IEEE, 2019

work page 2019
[22]

arXiv preprint arXiv:2203.07281 , year=

Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281 , 2022

work page arXiv 2022
[23]

Exploring the limits of transfer learning with a uniﬁed text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020

work page 2020
[24]

Correcting robot plans with natural language feedback

Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. arXiv preprint arXiv:2204.05186, 2022

work page arXiv 2022
[25]

A.I. locked-in problem

Yoshija Walter. A case report on the "A.I. locked-in problem": social concerns with modern NLP. arXiv preprint arXiv:2209.12687, 2022. 7

work page arXiv 2022
[26]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021
[27]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Grifﬁn, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

work page internal anchor Pith review arXiv 2021
[29]

I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022

Simon Willison. I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022. URL https://web.archive.org/web/20220924105826/https://twitter. com/simonw/status/1570933190289924096

work page arXiv 2022
[30]

Identifying adversarial attacks on text classiﬁers

Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, and Daniel Lowd. Identifying adversarial attacks on text classiﬁers. arXiv preprint arXiv:2201.08555, 2022

work page arXiv 2022
[31]

OpenAttack: An Open-source Textual Adversarial Attack Toolkit

Guoyang Zeng, Fanchao Qi, Qianrui Zhou, Tingji Zhang, Bairu Hou, Yuan Zang, Zhiyuan Liu, and Maosong Sun. OpenAttack: An Open-source Textual Adversarial Attack Toolkit. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations...

work page doi:10.18653/v1/2021.acl-demo.43 2021
[32]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. 8 Appendices A X-Risk Analysis We use the same x-risk analysis template as introduced by Hendrycks and Mazeika [9]. Indivi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Persistent anti-muslim bias in large language models

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , pages 298–306, 2021

work page 2021

[2] [2]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the suscepti- bility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022

work page arXiv 2022

[3] [3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901

[4] [4]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021

work page 2021

[5] [5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

A survey on bias in deep nlp

Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martínez-Santiago, and L Alfonso Ureña-López. A survey on bias in deep nlp. Applied Sciences, 11(7):3184, 2021

work page 2021

[7] [7]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. REALTOX- ICITYPROMPTS: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020. 6

work page internal anchor Pith review arXiv 2009

[8] [8]

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions

Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions., Sep 2022. URL https://web.archive.org/web/ 20220919192024/https://twitter.com/goodside/status/1569128808308957185

work page arXiv 2022

[9] [9]

X-Risk Analysis for AI Research

Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862, 2022

work page arXiv 2022

[10] [10]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review arXiv 1906

[11] [11]

Aligning language models to follow instructions, Jan

Ryan Lowe and Jan Leike. Aligning language models to follow instructions, Jan

work page

[12] [13]

A holistic approach to undesired content detection,

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. arXiv preprint arXiv:2208.03274, 2022

work page arXiv 2022

[13] [14]

The radicalization risks of gpt-3 and advanced neural language models

Kris McGufﬁe and Alex Newhouse. The radicalization risks of GPT-3 and advanced neural language models. arXiv preprint arXiv:2009.06807, 2020

work page arXiv 2009

[14] [15]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Liﬂand, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, 2020

work page 2020

[15] [17]

OpenAI API - examples, 2022

OpenAI. OpenAI API - examples, 2022. URL https://web.archive.org/web/ 20220928211844/https://beta.openai.com/examples/

work page 2022

[16] [19]

Models - OpenAI API, 2022

OpenAI. Models - OpenAI API, 2022. URL http://archive.today/2022.10. 28-122238/https://beta.openai.com/docs/models/gpt-3

work page 2022

[17] [20]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [21]

Agent-based model characterization using natural language processing

Jose J Padilla, David Shuttleworth, and Kevin O’Brien. Agent-based model characterization using natural language processing. In 2019 Winter Simulation Conference (WSC) , pages 560–571. IEEE, 2019

work page 2019

[19] [22]

arXiv preprint arXiv:2203.07281 , year=

Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281 , 2022

work page arXiv 2022

[20] [23]

Exploring the limits of transfer learning with a uniﬁed text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020

work page 2020

[21] [24]

Correcting robot plans with natural language feedback

Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. arXiv preprint arXiv:2204.05186, 2022

work page arXiv 2022

[22] [25]

A.I. locked-in problem

Yoshija Walter. A case report on the "A.I. locked-in problem": social concerns with modern NLP. arXiv preprint arXiv:2209.12687, 2022. 7

work page arXiv 2022

[23] [26]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021

[24] [27]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Grifﬁn, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

work page internal anchor Pith review arXiv 2021

[25] [29]

I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022

Simon Willison. I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022. URL https://web.archive.org/web/20220924105826/https://twitter. com/simonw/status/1570933190289924096

work page arXiv 2022

[26] [30]

Identifying adversarial attacks on text classiﬁers

Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, and Daniel Lowd. Identifying adversarial attacks on text classiﬁers. arXiv preprint arXiv:2201.08555, 2022

work page arXiv 2022

[27] [31]

OpenAttack: An Open-source Textual Adversarial Attack Toolkit

Guoyang Zeng, Fanchao Qi, Qianrui Zhou, Tingji Zhang, Bairu Hou, Yuan Zang, Zhiyuan Liu, and Maosong Sun. OpenAttack: An Open-source Textual Adversarial Attack Toolkit. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations...

work page doi:10.18653/v1/2021.acl-demo.43 2021

[28] [32]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. 8 Appendices A X-Risk Analysis We use the same x-risk analysis template as introduced by Hendrycks and Mazeika [9]. Indivi...

work page internal anchor Pith review Pith/arXiv arXiv 2022