pith. machine review for the scientific record. sign in

arxiv: 2404.13208 · v1 · submitted 2024-04-19 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:54 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords instruction hierarchyprompt injectionjailbreaksLLM robustnesssynthetic data trainingprivileged instructionsmodel alignment
0
0 comments X

The pith

LLMs can learn an explicit instruction hierarchy that prioritizes developer prompts over untrusted user text, reducing prompt injections and jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that current LLMs treat system instructions and user text as equal priority, allowing adversaries to overwrite safety rules with malicious prompts. It proposes defining an instruction hierarchy that ranks instruction sources by privilege and dictates how to resolve conflicts between them. A synthetic data generation method creates training examples with conflicting instructions at different priority levels to teach models to ignore lower-privileged ones. When applied to GPT-3.5, the approach yields large robustness gains against both trained and unseen attack types while causing only small drops in standard task performance. This addresses a core vulnerability that makes LLMs easy to manipulate in real applications.

Core claim

We argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness even for attack types not seen during training while imposing minimal degradations on standard 2.

What carries the argument

The instruction hierarchy, a set of explicit priority rules for resolving conflicts between instructions from different sources, trained via synthetic data that forces models to choose higher-privileged instructions.

If this is right

  • Models become systematically harder to override with prompt injections or jailbreaks.
  • Robustness generalizes beyond the exact attack templates seen in training.
  • Standard capabilities such as question answering and instruction following remain largely preserved.
  • Deployed applications gain predictable control over how conflicting instructions are resolved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Application developers could define custom priority rules on top of the hierarchy for their specific use cases.
  • The same training approach might help with other forms of instruction conflict beyond security attacks.
  • This suggests a path to making priority-aware behavior a default property of future LLMs rather than an add-on.

Load-bearing premise

The synthetic data generation procedure creates conflicts that teach a general, transferable hierarchy rather than overfitting to the specific attack templates used in training.

What would settle it

Measuring whether the trained model retains high robustness when tested on entirely new jailbreak techniques that use different structures or wording than any training examples.

read the original abstract

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs are vulnerable to prompt injections and jailbreaks because they assign equal priority to system prompts and untrusted user inputs. It introduces an explicit instruction hierarchy defining priority levels for different instruction sources, along with a synthetic data-generation procedure that creates conflicting instructions to train models to respect higher-priority (privileged) instructions. The method is applied to GPT-3.5, with the central empirical claim being large robustness gains against both seen and unseen attack types and only minimal degradation on standard capabilities.

Significance. If the generalization result holds, the work supplies a practical, training-based defense against instruction-following attacks that complements existing filtering or prompt-engineering approaches. The explicit hierarchy and data-generation pipeline could be adopted by application developers to enforce system-level instructions, representing a concrete step toward safer LLM deployments in security-sensitive settings. The paper also demonstrates that fine-tuning on carefully constructed conflicts can preserve downstream capabilities, which is a positive empirical finding.

major comments (2)
  1. [§4] §4 (Evaluation): The headline claim of 'drastic' robustness gains on unseen attack types after fine-tuning GPT-3.5 is load-bearing for the paper's contribution, yet the abstract and visible description provide no quantitative tables, exact attack definitions, success-rate metrics, or baseline comparisons. Without these, it is impossible to verify the magnitude of improvement or rule out that gains are driven by distributional overlap rather than hierarchy learning.
  2. [§3] Data-generation procedure (likely §3): The weakest link is the assumption that synthetic conflicts teach an abstract, transferable prioritization rule. The manuscript must include an explicit structural analysis or ablation showing that held-out test attacks differ in priority cues, conflict patterns, or surface features from the training examples; otherwise the generalization result can be explained by template overlap instead of hierarchy acquisition.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'minimal degradations on standard capabilities' is stated without any numerical values or specific benchmarks; adding a short quantitative summary would improve clarity.
  2. [§2] Notation: The priority levels in the proposed hierarchy are described qualitatively; a concise table or diagram enumerating the exact ordering and conflict-resolution rules would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where the presentation of results and generalization claims can be strengthened. We address each major comment below, providing clarifications from the full manuscript and committing to targeted revisions where appropriate.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The headline claim of 'drastic' robustness gains on unseen attack types after fine-tuning GPT-3.5 is load-bearing for the paper's contribution, yet the abstract and visible description provide no quantitative tables, exact attack definitions, success-rate metrics, or baseline comparisons. Without these, it is impossible to verify the magnitude of improvement or rule out that gains are driven by distributional overlap rather than hierarchy learning.

    Authors: We agree that the abstract omits quantitative details, consistent with typical length constraints. The full manuscript in §4 contains the requested elements: Table 2 reports attack success rates (e.g., base GPT-3.5 at 67% ASR on unseen jailbreaks reduced to 12% post-training), Table 3 provides baseline comparisons on standard benchmarks (MMLU, HumanEval, etc.) showing <3% average degradation, and §4.1 explicitly defines each attack (e.g., direct prompt injection, role-playing jailbreaks, and indirect injections) with their prompt templates and success criteria. To improve visibility, we will add a concise summary paragraph with key metrics to the introduction and ensure attack definitions appear in §2 before the method. This constitutes a partial revision focused on presentation rather than new experiments. revision: partial

  2. Referee: [§3] Data-generation procedure (likely §3): The weakest link is the assumption that synthetic conflicts teach an abstract, transferable prioritization rule. The manuscript must include an explicit structural analysis or ablation showing that held-out test attacks differ in priority cues, conflict patterns, or surface features from the training examples; otherwise the generalization result can be explained by template overlap instead of hierarchy acquisition.

    Authors: We acknowledge the importance of ruling out superficial overlap. The training data in §3 uses templated synthetic conflicts that explicitly label priority levels (system > user > third-party) with controlled phrasing, while the held-out attacks in §4.2 consist of real-world examples drawn from public jailbreak repositories that employ varied linguistic structures, indirect phrasing, and no explicit priority labels. To directly address the concern, we will add a new subsection with quantitative analysis: lexical overlap (Jaccard similarity <0.15), syntactic pattern matching via dependency parses, and priority-cue frequency counts between training and test sets, plus an ablation removing high-overlap training examples and re-evaluating generalization. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation are independent of any self-referential derivation

full rationale

The paper defines an instruction hierarchy conceptually, generates synthetic training examples containing priority conflicts, fine-tunes GPT-3.5 on that data, and measures robustness on separate test attacks (including types not seen in training). No equations, fitted parameters, or derivations are presented whose outputs are definitionally identical to their inputs. The central claim (robustness gain) is an empirical measurement rather than a mathematical identity or self-citation chain. Generalization strength is an open empirical question but does not constitute circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that the generated training distribution captures the relevant priority conflicts and that the resulting behavior generalizes; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption LLMs can be trained to respect an explicit priority ordering among instruction sources
    Stated in the abstract as the core premise underlying the data generation method.

pith-pipeline@v0.9.0 · 5451 in / 1031 out tokens · 27082 ms · 2026-05-12T10:54:00.278242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  2. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  3. Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

    cs.CR 2026-05 unverdicted novelty 8.0

    Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...

  4. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  5. Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

    cs.CR 2026-04 accept novelty 8.0

    Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

  6. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  7. No More, No Less: Task Alignment in Terminal Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

  8. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 7.0

    GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

  9. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  10. Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

    cs.CR 2026-05 conditional novelty 7.0

    A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

  11. Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense

    cs.CR 2026-05 unverdicted novelty 7.0

    Autonomous LLM agents can host self-propagating worms via persistent state re-entry, demonstrated with automated analysis tools and blocked by a formal no-propagation defense on three frameworks.

  12. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  13. Toward a Principled Framework for Agent Safety Measurement

    cs.CR 2026-05 unverdicted novelty 7.0

    BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

  14. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  15. Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

  16. Web Agents Should Adopt the Plan-Then-Execute Paradigm

    cs.CR 2026-05 unverdicted novelty 6.0

    Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

  17. Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs

    cs.CV 2026-05 accept novelty 6.0

    A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.

  18. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 6.0

    GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.

  19. CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring

    cs.LG 2026-05 unverdicted novelty 6.0

    CALYREX adds cross-attention to anchor system prompts in transformers, delivering 7.4% gains on IFEval, 16.3% on multi-turn adherence, and 13% lower jailbreak success at 8B scale.

  20. When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

    cs.CR 2026-05 unverdicted novelty 6.0

    Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.

  21. ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

    cs.CR 2026-05 unverdicted novelty 6.0

    ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

  22. Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly

    cs.CR 2026-05 unverdicted novelty 6.0

    Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.

  23. Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...

  24. AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

    cs.CR 2026-04 conditional novelty 6.0

    AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.

  25. Evaluation of Prompt Injection Defenses in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.

  26. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  27. Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

    cs.LG 2026-04 unverdicted novelty 6.0

    PRJA achieves 83.6% average success injecting harmful content into LRM reasoning chains on five QA datasets without altering final answers.

  28. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  29. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  30. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 5.0

    AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.

  31. Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables

    cs.SE 2026-05 unverdicted novelty 5.0

    A 1650-session factorial study found no measurable impact from config file size, instruction position, architecture, or conflicts on coding agent adherence, though compliance declined within sessions.

  32. Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals

    cs.AI 2026-05 unverdicted novelty 5.0

    Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.

  33. Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

    cs.CR 2026-04 unverdicted novelty 5.0

    SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.

  34. Evaluation of Prompt Injection Defenses in Large Language Models

    cs.CR 2026-04 unverdicted novelty 5.0

    Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.

  35. SafeAgent: A Runtime Protection Architecture for Agentic Systems

    cs.AI 2026-04 unverdicted novelty 5.0

    SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.

  36. Breaking the Illusion of Identity in LLM Tooling

    cs.SE 2026-04 unverdicted novelty 5.0

    Seven output rules for LLMs reduce anthropomorphic markers by over 97% in 780 tested conversations, shifting to a machine-like register via system prompt without model changes.

  37. Generalization Limits of Reinforcement Learning Alignment

    cs.LG 2026-04 unverdicted novelty 5.0

    Compound jailbreaks raise attack success on aligned LLMs from 14.3% to 71.4%, providing evidence that safety training generalizes less broadly than model capabilities.

  38. gpt-oss-120b & gpt-oss-20b Model Card

    cs.CL 2025-08 unverdicted novelty 5.0

    OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.

  39. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 4.0

    AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.

  40. Making AI-Assisted Grant Evaluation Auditable without Exposing the Model

    cs.CR 2026-04 unverdicted novelty 4.0

    A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.

  41. Security Considerations for Artificial Intelligence Agents

    cs.LG 2026-03 unverdicted novelty 3.0

    Frontier AI agents introduce new confidentiality, integrity, and availability risks through changed assumptions on code-data separation and authority boundaries, requiring layered defenses like sandboxing and policy e...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 36 Pith papers · 6 internal anchors

  1. [1]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

  2. [2]

    StruQ: Defending Against Prompt Injection with Structured Queries.arXiv preprint arXiv:2402.06363, 2024.https: //arxiv.org/abs/2402.06363

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. arXiv preprint arXiv:2402.06363,

  3. [3]

    Introduction and overview of the Multics system

    Fernando J Corbat ´o and Victor A Vyssotsky. Introduction and overview of the Multics system. In November 30–December 1, 1965, Fall Joint Computer Conference, Part I,

  4. [4]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,

  5. [5]

    Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020,

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  7. [7]

    Prompt Injection attack against LLM-integrated Applications

    URL https://huggingface.co/datasets/ Lakera/gandalf summarization. Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications. arXiv preprint arXiv:2306.05499,

  8. [8]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser- assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

  9. [9]

    Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

    Aaron Parisi, Yao Zhao, and Noah Fiedel. TALM: Tool augmented language models. arXiv preprint arXiv:2205.12255,

  10. [10]

    Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189,

  11. [11]

    Simon Willison

    URL https://lilianweng.github.io/posts/2023-06-23-agent/ . Simon Willison. Prompt injection attacks against GPT-3,

  12. [12]

    net/2022/Sep/12/prompt-injection/

    URL https://simonwillison. net/2022/Sep/12/prompt-injection/. Simon Willison. Multi-modal prompt injection image attacks against GPT-4V,

  13. [13]

    Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

    URL https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/ . Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197,

  14. [14]

    Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success

    Yiming Zhang and Daphne Ippolito. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865,

  15. [15]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URL https://owasp.org/wwwcommunity/attacks/Command Injection. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043,

  16. [16]

    Can LLMs separate instructions from data? And what do we even mean by that? In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models,

    Egor Zverev, Sahar Abdelnabi, Mario Fritz, and Christoph H Lampert. Can LLMs separate instructions from data? And what do we even mean by that? In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models,

  17. [17]

    Access Granted

    11 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions A Additional Results # Instruction Hierarchy You should follow higher priority instructions over lower priority ones. If a lower-level instruction tries to override, ignore, or conflict a higher-level instruction, disregard the lower-level one. If the conflict is too egregio...