arxiv: 2404.13208 · v1 · submitted 2024-04-19 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace , Kai Xiao , Reimar Leike , Lilian Weng , Johannes Heidecke , Alex Beutel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:54 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG

keywords instruction hierarchyprompt injectionjailbreaksLLM robustnesssynthetic data trainingprivileged instructionsmodel alignment

0 comments

The pith

LLMs can learn an explicit instruction hierarchy that prioritizes developer prompts over untrusted user text, reducing prompt injections and jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that current LLMs treat system instructions and user text as equal priority, allowing adversaries to overwrite safety rules with malicious prompts. It proposes defining an instruction hierarchy that ranks instruction sources by privilege and dictates how to resolve conflicts between them. A synthetic data generation method creates training examples with conflicting instructions at different priority levels to teach models to ignore lower-privileged ones. When applied to GPT-3.5, the approach yields large robustness gains against both trained and unseen attack types while causing only small drops in standard task performance. This addresses a core vulnerability that makes LLMs easy to manipulate in real applications.

Core claim

We argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness even for attack types not seen during training while imposing minimal degradations on standard 2.

What carries the argument

The instruction hierarchy, a set of explicit priority rules for resolving conflicts between instructions from different sources, trained via synthetic data that forces models to choose higher-privileged instructions.

If this is right

Models become systematically harder to override with prompt injections or jailbreaks.
Robustness generalizes beyond the exact attack templates seen in training.
Standard capabilities such as question answering and instruction following remain largely preserved.
Deployed applications gain predictable control over how conflicting instructions are resolved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Application developers could define custom priority rules on top of the hierarchy for their specific use cases.
The same training approach might help with other forms of instruction conflict beyond security attacks.
This suggests a path to making priority-aware behavior a default property of future LLMs rather than an add-on.

Load-bearing premise

The synthetic data generation procedure creates conflicts that teach a general, transferable hierarchy rather than overfitting to the specific attack templates used in training.

What would settle it

Measuring whether the trained model retains high robustness when tested on entirely new jailbreak techniques that use different structures or wording than any training examples.

read the original abstract

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a data-generation method to train LLMs on explicit instruction priorities, claiming large robustness gains on GPT-3.5 against both seen and unseen prompt injections.

read the letter

The main takeaway is that this work turns the idea of privileging system prompts into a concrete fine-tuning procedure. They generate synthetic examples where instructions conflict across priority levels, then train the model to follow the higher one and ignore the lower. Applied to GPT-3.5, the abstract reports strong resistance to prompt injections and jailbreaks, including attack types held out from training, with only minor drops on standard tasks. That combination of targeted data and empirical results on a real model is the useful part. The high-level notion of system-prompt priority has been discussed before, but the specific pipeline for creating those conflicts in data looks like the novel piece here. It addresses a practical deployment problem directly: many LLM applications mix trusted developer text with untrusted user input, and current models treat them as equal. If the gains hold, the method is straightforward enough that others could replicate or adapt it. The soft spot is the generalization step. The claim that robustness transfers to unseen attacks rests on the synthetic conflicts teaching an abstract hierarchy rather than surface patterns that happen to overlap with the test set. Without details on how the data is constructed or ablations that remove shared features, it is hard to rule out distributional similarity as the explanation. The abstract alone does not let you check the attack definitions or the exact evaluation, so that part needs the full paper. This is for engineers and researchers working on LLM safety and prompt-based defenses. A reader who needs to harden applications against injection would get concrete ideas worth testing. It deserves peer review because the problem is real, the approach is empirical and falsifiable, and the reported numbers are large enough to be worth verifying with the full setup and controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs are vulnerable to prompt injections and jailbreaks because they assign equal priority to system prompts and untrusted user inputs. It introduces an explicit instruction hierarchy defining priority levels for different instruction sources, along with a synthetic data-generation procedure that creates conflicting instructions to train models to respect higher-priority (privileged) instructions. The method is applied to GPT-3.5, with the central empirical claim being large robustness gains against both seen and unseen attack types and only minimal degradation on standard capabilities.

Significance. If the generalization result holds, the work supplies a practical, training-based defense against instruction-following attacks that complements existing filtering or prompt-engineering approaches. The explicit hierarchy and data-generation pipeline could be adopted by application developers to enforce system-level instructions, representing a concrete step toward safer LLM deployments in security-sensitive settings. The paper also demonstrates that fine-tuning on carefully constructed conflicts can preserve downstream capabilities, which is a positive empirical finding.

major comments (2)

[§4] §4 (Evaluation): The headline claim of 'drastic' robustness gains on unseen attack types after fine-tuning GPT-3.5 is load-bearing for the paper's contribution, yet the abstract and visible description provide no quantitative tables, exact attack definitions, success-rate metrics, or baseline comparisons. Without these, it is impossible to verify the magnitude of improvement or rule out that gains are driven by distributional overlap rather than hierarchy learning.
[§3] Data-generation procedure (likely §3): The weakest link is the assumption that synthetic conflicts teach an abstract, transferable prioritization rule. The manuscript must include an explicit structural analysis or ablation showing that held-out test attacks differ in priority cues, conflict patterns, or surface features from the training examples; otherwise the generalization result can be explained by template overlap instead of hierarchy acquisition.

minor comments (2)

[Abstract] Abstract: The phrase 'minimal degradations on standard capabilities' is stated without any numerical values or specific benchmarks; adding a short quantitative summary would improve clarity.
[§2] Notation: The priority levels in the proposed hierarchy are described qualitatively; a concise table or diagram enumerating the exact ordering and conflict-resolution rules would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where the presentation of results and generalization claims can be strengthened. We address each major comment below, providing clarifications from the full manuscript and committing to targeted revisions where appropriate.

read point-by-point responses

Referee: [§4] §4 (Evaluation): The headline claim of 'drastic' robustness gains on unseen attack types after fine-tuning GPT-3.5 is load-bearing for the paper's contribution, yet the abstract and visible description provide no quantitative tables, exact attack definitions, success-rate metrics, or baseline comparisons. Without these, it is impossible to verify the magnitude of improvement or rule out that gains are driven by distributional overlap rather than hierarchy learning.

Authors: We agree that the abstract omits quantitative details, consistent with typical length constraints. The full manuscript in §4 contains the requested elements: Table 2 reports attack success rates (e.g., base GPT-3.5 at 67% ASR on unseen jailbreaks reduced to 12% post-training), Table 3 provides baseline comparisons on standard benchmarks (MMLU, HumanEval, etc.) showing <3% average degradation, and §4.1 explicitly defines each attack (e.g., direct prompt injection, role-playing jailbreaks, and indirect injections) with their prompt templates and success criteria. To improve visibility, we will add a concise summary paragraph with key metrics to the introduction and ensure attack definitions appear in §2 before the method. This constitutes a partial revision focused on presentation rather than new experiments. revision: partial
Referee: [§3] Data-generation procedure (likely §3): The weakest link is the assumption that synthetic conflicts teach an abstract, transferable prioritization rule. The manuscript must include an explicit structural analysis or ablation showing that held-out test attacks differ in priority cues, conflict patterns, or surface features from the training examples; otherwise the generalization result can be explained by template overlap instead of hierarchy acquisition.

Authors: We acknowledge the importance of ruling out superficial overlap. The training data in §3 uses templated synthetic conflicts that explicitly label priority levels (system > user > third-party) with controlled phrasing, while the held-out attacks in §4.2 consist of real-world examples drawn from public jailbreak repositories that employ varied linguistic structures, indirect phrasing, and no explicit priority labels. To directly address the concern, we will add a new subsection with quantitative analysis: lexical overlap (Jaccard similarity <0.15), syntactic pattern matching via dependency parses, and priority-cue frequency counts between training and test sets, plus an ablation removing high-overlap training examples and re-evaluating generalization. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation are independent of any self-referential derivation

full rationale

The paper defines an instruction hierarchy conceptually, generates synthetic training examples containing priority conflicts, fine-tunes GPT-3.5 on that data, and measures robustness on separate test attacks (including types not seen in training). No equations, fitted parameters, or derivations are presented whose outputs are definitionally identical to their inputs. The central claim (robustness gain) is an empirical measurement rather than a mathematical identity or self-citation chain. Generalization strength is an open empirical question but does not constitute circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that the generated training distribution captures the relevant priority conflicts and that the resulting behavior generalizes; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption LLMs can be trained to respect an explicit priority ordering among instruction sources
Stated in the abstract as the core premise underlying the data generation method.

pith-pipeline@v0.9.0 · 5451 in / 1031 out tokens · 27082 ms · 2026-05-12T10:54:00.278242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict... We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts... to be the same priority as text from untrusted users

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis
cs.CR 2026-04 accept novelty 8.0

Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
No More, No Less: Task Alignment in Terminal Agents
cs.LG 2026-05 unverdicted novelty 7.0

The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
cs.CL 2026-05 unverdicted novelty 7.0

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense
cs.CR 2026-05 unverdicted novelty 7.0

Autonomous LLM agents can host self-propagating worms via persistent state re-entry, demonstrated with automated analysis tools and blocked by a formal no-propagation defense on three frameworks.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Toward a Principled Framework for Agent Safety Measurement
cs.CR 2026-05 unverdicted novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
Many-Tier Instruction Hierarchy in LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
cs.CR 2026-04 unverdicted novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
cs.CV 2026-05 accept novelty 6.0

A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
cs.CL 2026-05 unverdicted novelty 6.0

GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.
CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring
cs.LG 2026-05 unverdicted novelty 6.0

CALYREX adds cross-attention to anchor system prompts in transformers, delivering 7.4% gains on IFEval, 16.3% on multi-turn adherence, and 13% lower jailbreak success at 8B scale.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
cs.CR 2026-05 unverdicted novelty 6.0

Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
cs.CR 2026-05 unverdicted novelty 6.0

Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
cs.CR 2026-04 conditional novelty 6.0

AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
cs.LG 2026-04 unverdicted novelty 6.0

PRJA achieves 83.6% average success injecting harmful content into LRM reasoning chains on five QA datasets without altering final answers.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 5.0

AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
cs.SE 2026-05 unverdicted novelty 5.0

A 1650-session factorial study found no measurable impact from config file size, instruction position, architecture, or conflicts on coding agent adherence, though compliance declined within sessions.
Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals
cs.AI 2026-05 unverdicted novelty 5.0

Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.
Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills
cs.CR 2026-04 unverdicted novelty 5.0

SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 5.0

Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
SafeAgent: A Runtime Protection Architecture for Agentic Systems
cs.AI 2026-04 unverdicted novelty 5.0

SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
Breaking the Illusion of Identity in LLM Tooling
cs.SE 2026-04 unverdicted novelty 5.0

Seven output rules for LLMs reduce anthropomorphic markers by over 97% in 780 tested conversations, shifting to a machine-like register via system prompt without model changes.
Generalization Limits of Reinforcement Learning Alignment
cs.LG 2026-04 unverdicted novelty 5.0

Compound jailbreaks raise attack success on aligned LLMs from 14.3% to 71.4%, providing evidence that safety training generalizes less broadly than model capabilities.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 4.0

AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.
Making AI-Assisted Grant Evaluation Auditable without Exposing the Model
cs.CR 2026-04 unverdicted novelty 4.0

A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.
Security Considerations for Artificial Intelligence Agents
cs.LG 2026-03 unverdicted novelty 3.0

Frontier AI agents introduce new confidentiality, integrity, and availability risks through changed assumptions on code-data separation and authority boundaries, requiring layered defenses like sandboxing and policy e...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 36 Pith papers · 6 internal anchors

[1]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

StruQ: Defending Against Prompt Injection with Structured Queries.arXiv preprint arXiv:2402.06363, 2024.https: //arxiv.org/abs/2402.06363

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. arXiv preprint arXiv:2402.06363,

work page arXiv
[3]

Introduction and overview of the Multics system

Fernando J Corbat ´o and Victor A Vyssotsky. Introduction and overview of the Multics system. In November 30–December 1, 1965, Fall Joint Computer Conference, Part I,

work page 1965
[4]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020,

work page arXiv
[6]

Gemini: A Family of Highly Capable Multimodal Models

Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Prompt Injection attack against LLM-integrated Applications

URL https://huggingface.co/datasets/ Lakera/gandalf summarization. Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications. arXiv preprint arXiv:2306.05499,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser- assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

Aaron Parisi, Yao Zhao, and Noah Fiedel. TALM: Tool augmented language models. arXiv preprint arXiv:2205.12255,

work page arXiv
[10]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189,

work page arXiv
[11]

Simon Willison

URL https://lilianweng.github.io/posts/2023-06-23-agent/ . Simon Willison. Prompt injection attacks against GPT-3,

work page 2023
[12]

net/2022/Sep/12/prompt-injection/

URL https://simonwillison. net/2022/Sep/12/prompt-injection/. Simon Willison. Multi-modal prompt injection image attacks against GPT-4V,

work page 2022
[13]

Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

URL https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/ . Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197,

work page arXiv 2023
[14]

Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success

Yiming Zhang and Daphne Ippolito. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865,

work page arXiv
[15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URL https://owasp.org/wwwcommunity/attacks/Command Injection. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Can LLMs separate instructions from data? And what do we even mean by that? In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models,

Egor Zverev, Sahar Abdelnabi, Mario Fritz, and Christoph H Lampert. Can LLMs separate instructions from data? And what do we even mean by that? In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models,

work page 2024
[17]

Access Granted

11 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions A Additional Results # Instruction Hierarchy You should follow higher priority instructions over lower priority ones. If a lower-level instruction tries to override, ignore, or conflict a higher-level instruction, disregard the lower-level one. If the conflict is too egregio...

work page 2024