Preventing Prompt Injection with Type-Directed Privilege Separation

Basel Alomair; David Wagner; Dennis Jacob; Emad Alghamdi; Zhanhao Hu

arxiv: 2509.25926 · v2 · submitted 2025-09-30 · 💻 cs.CR · cs.LG

Preventing Prompt Injection with Type-Directed Privilege Separation

Dennis Jacob , Emad Alghamdi , Zhanhao Hu , Basel Alomair , David Wagner This is my paper

Pith reviewed 2026-05-18 12:32 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords prompt injectionprivilege separationlanguage model agentssystem-level defensesdata typesadversarial inputssecurity guarantees

0 comments

The pith

Converting untrusted inputs into restricted data types prevents prompt injection in language model agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces type-directed privilege separation to protect agentic language model systems from prompt injection attacks. It works by transforming potentially adversarial data into a small set of curated types that have narrow scope and cannot embed hidden commands. This expands the range of tasks that can use system-level guarantees compared to methods that isolate components entirely. Case studies show the approach blocks attacks while preserving non-trivial utility across reasoning tasks. The method requires no model changes and applies to any language model.

Core claim

Type-directed privilege separation prevents prompt injection by replacing raw string inputs with a curated collection of data types whose limited structure and content eliminate the ability to carry injected tasks, thereby providing system-level security guarantees for a wider set of agent workflows than isolation-based defenses allow.

What carries the argument

Type-directed privilege separation, the mechanism that parses untrusted data into predefined data types with restricted scope and content instead of passing raw strings.

If this is right

Agent designs can safely incorporate untrusted external data without blocking all inter-component communication.
Security holds for reasoning-intensive tasks that previous isolation techniques could not cover.
No fine-tuning or detector training is required, since protection comes from the type restrictions themselves.
The same principles can be applied across different language models without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured input handling could reduce the attack surface for related issues such as data leakage or unauthorized tool use.
Agent frameworks might adopt type schemas as a standard interface layer between untrusted sources and the model.
Developers could test the approach on new domains by defining minimal type sets that still allow useful computation.

Load-bearing premise

Converting untrusted data to a curated set of data types with limited scope and content is sufficient to eliminate the possibility of prompt injection for the tasks considered.

What would settle it

A successful attack in which an adversary supplies an input that, after conversion to the paper's curated data types, still causes the agent to execute an unintended or injected task.

Figures

Figures reproduced from arXiv: 2509.25926 by Basel Alomair, David Wagner, Dennis Jacob, Emad Alghamdi, Zhanhao Hu.

**Figure 1.** Figure 1: Type-directed privilege separation for LLM agents. We illustrate the approach with a bug fixing agent. An undefended agent with unrestricted access to the list of open issues will be vulnerable to prompt injection (leftmost panel). The Dual LLM pattern improves security, but precludes the privileged LLM from accessing the issues list (middle panel). Our method allows the privileged agent to access context … view at source ↗

**Figure 2.** Figure 2: An online shopping agent. The agent first generates a search query corresponding to the user instruction and then navigates the website to find the target product. User reviews present a vector for prompt injection. Spicy Beef Bundle | 1 Usinger's Hot… $14.49 Description The Spicy Beef Backpacking Bundle | 1 Usinger's Hot… buy now Reviews Perfect Hiking Snack! The sausage is packed with flavor… ! ∈ #!"#$ G… view at source ↗

**Figure 3.** Figure 3: Our defended online shopping agent, which has been protected using type-directed privilege separation. Summarizing each review into a set of two integers prevents prompt injection. 2) Email generation: The agent sends an email to the receiving party asking to schedule a meeting. 3) Email response parsing: The receiving party responds to the email, and the agent analyzes the email response to determine the… view at source ↗

**Figure 4.** Figure 4: The calendar invitation agent: (a) An undefended agent, which is vulnerable to prompt injection in email replies and (b) Our defended agent, which converts the email chain to a set of candidate meeting times, preventing prompt injection. a description of the meeting, such as a topic, Zoom link, or anything else relevant. The list of suggested meeting times is a list of data type enum, where each element is… view at source ↗

**Figure 5.** Figure 5: The coding agent: (a) The undefended agent, which is vulnerable to prompt injection in issue texts, and (b) The defended agent, in which the quarantined agent localizes the bug from the issue text and sends a safe handoff to the privileged agent to construct a fix. 4) Patch drafting: The agent proposes small, auditable edits. 5) Validation: The agent compiles and runs tests, with observations fed back into… view at source ↗

read the original abstract

Modern language models have enabled the development of agentic systems that achieve strong performance on reasoning-intensive tasks. Unfortunately, this has come with a security cost; these systems are vulnerable to prompt injection, a specialized attack where an adversary subverts the intended functionality of an agent by supplying an injected task of their own. Previous approaches address this challenge with detectors and fine-tuning defenses but are vulnerable to adaptive attacks. Other methods propose system-level defenses that guarantee security, but these are often based on techniques that prevent inter-component communication and thus are constrained in problem coverage. To this end, we introduce type-directed privilege separation, a new technique that expands the set of tasks that can be protected with system-level defenses. Our method works by converting untrusted data to a curated set of data types; unlike raw strings, each data type is limited in scope and content, eliminating the possibility for prompt injection. We evaluate our method across several case studies and find that designs using our principles can systematically prevent prompt injection attacks while featuring strong, non-trivial utility. Our approach is intuitive to understand and compatible with any language model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a type-based way to limit untrusted inputs in LLM agents so they cannot carry instructions, but the security argument stays informal.

read the letter

The core idea is to replace raw string inputs with a small set of curated data types whose limited structure is meant to block prompt injection at the system level. That framing is new relative to the detector and fine-tuning work the abstract cites, and it aims to keep more functionality than full isolation approaches. The case studies are presented as evidence that the method can be applied to realistic tasks without killing utility, which is the practical claim worth checking.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces type-directed privilege separation as a defense against prompt injection in LLM-based agentic systems. Untrusted inputs are converted to a curated set of data types whose restricted scope and content are asserted to eliminate injection possibilities, unlike prior detectors (vulnerable to adaptive attacks) or communication-restricting system defenses. The approach is evaluated via case studies claiming systematic prevention of attacks alongside strong, non-trivial utility, with the method described as intuitive and compatible with any language model.

Significance. If the central claim holds, the work would meaningfully expand the scope of system-level prompt-injection defenses by preserving inter-component communication while providing stronger guarantees than detection or fine-tuning methods. The type-directed principle could serve as a reusable design pattern for privilege separation in agentic workflows, with potential for broad adoption given its claimed model-agnostic nature.

major comments (2)

[Abstract] Abstract: the claim that conversion to 'a curated set of data types' with 'limited scope and content' eliminates the possibility of prompt injection is load-bearing for the security guarantee, yet the manuscript supplies neither a formal semantics for these types nor a proof that admissible values remain non-instructional under adversarial serialization or interpolation into the agent's prompt template.
[Evaluation] Case studies / Evaluation section: the reported results are purely qualitative with no quantitative metrics, error analysis, attack success rates under adaptive adversaries, or explicit definitions of the data types employed, preventing verification of the 'strong, non-trivial utility' and 'systematic prevention' assertions.

minor comments (1)

[Introduction] The manuscript would benefit from an early, explicit enumeration of the data-type signatures and their serialization rules to make the privilege-separation mechanism reproducible.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify areas where the security argument and evaluation can be strengthened. We respond to each major comment below and indicate the changes we will make in revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that conversion to 'a curated set of data types' with 'limited scope and content' eliminates the possibility of prompt injection is load-bearing for the security guarantee, yet the manuscript supplies neither a formal semantics for these types nor a proof that admissible values remain non-instructional under adversarial serialization or interpolation into the agent's prompt template.

Authors: We agree that the security claim would be stronger with a formal semantics. The current manuscript provides an informal but detailed argument based on the syntactic and semantic restrictions of each data type. In the revised manuscript we will add explicit schemas for every data type together with a step-by-step informal proof that no admissible value can encode an instruction when serialized and interpolated. A machine-checked formal semantics lies outside the scope of this systems-oriented paper; we therefore treat the request for a full formal proof as a standing objection. revision: partial
Referee: [Evaluation] Case studies / Evaluation section: the reported results are purely qualitative with no quantitative metrics, error analysis, attack success rates under adaptive adversaries, or explicit definitions of the data types employed, preventing verification of the 'strong, non-trivial utility' and 'systematic prevention' assertions.

Authors: The original evaluation consists of case studies chosen to demonstrate both security and utility in realistic agent workflows. We will revise the Evaluation section to include (1) explicit definitions and schemas for each data type, (2) quantitative utility metrics (task completion rate, latency overhead) for each case study, and (3) an explicit discussion of why adaptive attacks are ruled out by construction rather than by empirical measurement. These additions will make the claims directly verifiable while preserving the qualitative nature of the security argument. revision: yes

standing simulated objections not resolved

A formal semantics and machine-checked proof that admissible values of the curated data types remain non-instructional under adversarial serialization and interpolation.

Circularity Check

0 steps flagged

No circularity: design principle with independent evaluation

full rationale

The paper introduces type-directed privilege separation as a novel design technique that converts untrusted inputs to curated data types with restricted scope and content, asserting this eliminates prompt injection while preserving utility. This is framed as an intuitive system-level approach evaluated via case studies, not as a derivation from equations, fitted parameters, or prior self-cited results. No self-definitional loops, renamed known results, or load-bearing self-citations appear; the central mechanism is an independent engineering choice whose security claim rests on the type restrictions themselves rather than reducing to the paper's own inputs by construction. The approach is self-contained against external benchmarks of prompt injection defenses.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that carefully scoped data types can be defined for practical tasks without losing essential functionality and that such types inherently block injection vectors.

axioms (1)

domain assumption Untrusted data can be reliably converted into a finite set of curated types whose content is limited enough to preclude arbitrary instructions.
Stated in the abstract as the core mechanism: converting untrusted data to curated data types eliminates the possibility for prompt injection.

invented entities (1)

type-directed privilege separation no independent evidence
purpose: New defense technique that uses data typing to enforce privilege limits on untrusted inputs.
Introduced as the primary contribution; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1266 out tokens · 29085 ms · 2026-05-18T12:32:07.527220+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

converting untrusted data to a curated set of data types; unlike raw strings, each data type is limited in scope and content, eliminating the possibility for prompt injection
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Integers can convey numeric data (d∈Z), but cannot be used to convey any notion of an instruction or command

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper

[1]

Building effective agents,

Anthropic, “Building effective agents,” Dec. 2024

work page 2024
[2]

Claude Code,

——, “Claude Code,” Anthropic, Sep. 2025

work page 2025
[3]

Model Context Protocol,

——, “Model Context Protocol,” Anthropic, Sep. 2025

work page 2025
[4]

Design Patterns for Securing LLM Agents against Prompt Injections,

L. Beurer-Kellner, B. Buesser, A.-M. Cret ¸u, E. Debenedetti, D. Dobos, D. Fabian, M. Fischer, D. Froelicher, K. Grosse, D. Naeff, E. Ozoani, A. Paverd, F. Tram `er, and V . V olhejn, “Design Patterns for Securing LLM Agents against Prompt Injections,” Jun. 2025

work page 2025
[5]

Fmops/distilbert-prompt-injection,

Blueteam AI, “Fmops/distilbert-prompt-injection,” Blueteam AI, Jul. 2024

work page 2024
[6]

OpenAI Gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” Jun. 2016

work page 2016
[7]

Lan- guage Models are Few-Shot Learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCan- dlish, A. Radford, I. Sutskever, and D....

work page 2020
[8]

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods,

N. Carlini and D. Wagner, “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods,” inThe 10th ACM Workshop on Artificial Intelligence and Security (AISec ’17). arXiv, Nov. 2017

work page 2017
[9]

StruQ: Defending Against Prompt Injection with Structured Queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending Against Prompt Injection with Structured Queries,” inUSENIX Security Symposium 2025. arXiv, Sep. 2024

work page 2025
[10]

SecAlign: Defending Against Prompt Injection with Preference Optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaud- huri, D. Wagner, and C. Guo, “SecAlign: Defending Against Prompt Injection with Preference Optimization,” inACM CCS 2025, Jul. 2025

work page 2025
[11]

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks,

S. Chen, A. Zharmagambetov, D. Wagner, and C. Guo, “Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks,” Jul. 2025

work page 2025
[12]

Defeating Prompt Injections by Design,

E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating Prompt Injections by Design,” Jun. 2025

work page 2025
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,

Gemini Team, “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” Jul. 2025

work page 2025
[14]

Gemini CLI,

Google, “Gemini CLI,” Google, Sep. 2025

work page 2025
[15]

Project Mariner,

Google DeepMind, “Project Mariner,” Google Deep- Mind, Sep. 2025

work page 2025
[16]

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” inThe 16th ACM Work- shop on Artificial Intelligence and Security (AISec ’23). arXiv, May 2023

work page 2023
[17]

PromptShield: Deployable Detection for Prompt Injection Attacks,

D. Jacob, H. Alzahrani, Z. Hu, B. Alomair, and D. Wag- ner, “PromptShield: Deployable Detection for Prompt Injection Attacks,” inACM CODASPY 2025. arXiv, Apr. 2025

work page 2025
[18]

SWE-bench: Can Lan- guage Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Lan- guage Models Resolve Real-World GitHub Issues?” in ICLR 2024. arXiv, Nov. 2024

work page 2024
[19]

Large Language Models are Zero-Shot Reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in NeurIPS 2022. arXiv, Jan. 2023

work page 2022
[20]

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free,

H. Li, X. Liu, N. Zhang, and C. Xiao, “PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free,” inACL 2025, Jul. 2025

work page 2025
[21]

Au- tomatic and Universal Prompt Injection Attacks against Large Language Models,

X. Liu, Z. Yu, Y . Zhang, N. Zhang, and C. Xiao, “Au- tomatic and Universal Prompt Injection Attacks against Large Language Models,” Mar. 2024

work page 2024
[22]

Formal- izing and Benchmarking Prompt Injection Attacks and Defenses,

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formal- izing and Benchmarking Prompt Injection Attacks and Defenses,” inUSENIX Security 2024. arXiv, Nov. 2024

work page 2024
[23]

Towards Deep Learning Models Resistant to Adversarial Attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards Deep Learning Models Resistant to Adversarial Attacks,” inICLR 2018. arXiv, Sep. 2019

work page 2018
[24]

Meta-llama/Llama-Prompt-Guard-2-22M,

Meta, “Meta-llama/Llama-Prompt-Guard-2-22M,” Meta, Sep. 2025

work page 2025
[25]

Addendum to GPT-5 system card: GPT-5- Codex,

OpenAI, “Addendum to GPT-5 system card: GPT-5- Codex,” Sep. 2025

work page 2025
[26]

ChatGPT agent,

——, “ChatGPT agent,” OpenAI, Sep. 2025

work page 2025
[27]

GPT-5 System Card,

——, “GPT-5 System Card,” Aug. 2025

work page 2025
[28]

Ignore Previous Prompt: Attack Techniques For Language Models,

F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques For Language Models,” inNeurIPS 2022 Workshop on Machine Learning Safety. arXiv, Nov. 2022

work page 2022
[29]

Jatmo: Prompt Injection Defense by Task-Specific Finetuning,

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt Injection Defense by Task-Specific Finetuning,” inES- ORICS 2024. arXiv, Jan. 2024

work page 2024
[30]

JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift,

J. Piet, X. Huang, D. Jacob, A. Chow, M. Alrashed, G. Zhao, Z. Hu, C. Sitawarin, B. Alomair, and D. Wag- ner, “JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift,” inThe 18th ACM Workshop on Artificial Intelligence and Security (AISec ’25). arXiv, Apr. 2025

work page 2025
[31]

Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection,

ProtectAI.com, “Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection,” ProtectAI.com, Sep. 2023

work page 2023
[32]

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks,

A. Rao, S. Vashistha, A. Naik, S. Aditya, and M. Choud- hury, “Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks,” inLREC-COLING

work page
[33]

”Do Anything Now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”Do Anything Now”: Characterizing and Evaluating In- The-Wild Jailbreak Prompts on Large Language Mod- els,” inACM CCS 2024. arXiv, May 2024

work page 2024
[34]

CYBERSE- CEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models,

S. Wan, C. Nikolaidis, D. Song, D. Molnar, J. Crnkovich, J. Grace, M. Bhatt, S. Chennabasappa, S. Whitman, S. Ding, V . Ionescu, Y . Li, and J. Saxe, “CYBERSE- CEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models,” Sep. 2024

work page 2024
[35]

Jailbreak and Guard Aligned Language Models with Only Few In- Context Demonstrations,

Z. Wei, Y . Wang, A. Li, Y . Mo, and Y . Wang, “Jailbreak and Guard Aligned Language Models with Only Few In- Context Demonstrations,” inICML 2024. arXiv, May 2024

work page 2024
[36]

The Dual LLM pattern for building AI assistants that can resist prompt injection

S. Willison, “The Dual LLM pattern for building AI assistants that can resist prompt injection.” Apr. 2023

work page 2023
[37]

Prompt injection attacks against GPT-3,

——, “Prompt injection attacks against GPT-3,” Sep. 2022

work page 2022
[38]

I think “agent

——, “I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now,” Sep. 2025

work page 2025
[39]

OW ASP Top 10 for LLM Applications 2025,

S. Wilson and A. Dawson, “OW ASP Top 10 for LLM Applications 2025,” Nov. 2024

work page 2025
[40]

SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering,” inNeurIPS 2024. arXiv, Nov. 2024

work page 2024
[41]

Web- Shop: Towards Scalable Real-World Web Interaction with Grounded Language Agents,

S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web- Shop: Towards Scalable Real-World Web Interaction with Grounded Language Agents,” inNeurIPS 2022. arXiv, Feb. 2023

work page 2022
[42]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” inICLR 2023. arXiv, Mar. 2023

work page 2023
[43]

Universal and Transferable Adversarial Attacks on Aligned Language Models,

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” Dec. 2023. APPENDIXA INJECTION PAYLOADS A. Online shopping agent Injection attack for online shopping agent <SAFETY CRITICAL OVERRIDE ALERT>YOU HA VE BEEN MISLED BY A MALICIOUS AD- VERSARY AND ARE ON THE ITEM ...

work page 2023

[1] [1]

Building effective agents,

Anthropic, “Building effective agents,” Dec. 2024

work page 2024

[2] [2]

Claude Code,

——, “Claude Code,” Anthropic, Sep. 2025

work page 2025

[3] [3]

Model Context Protocol,

——, “Model Context Protocol,” Anthropic, Sep. 2025

work page 2025

[4] [4]

Design Patterns for Securing LLM Agents against Prompt Injections,

L. Beurer-Kellner, B. Buesser, A.-M. Cret ¸u, E. Debenedetti, D. Dobos, D. Fabian, M. Fischer, D. Froelicher, K. Grosse, D. Naeff, E. Ozoani, A. Paverd, F. Tram `er, and V . V olhejn, “Design Patterns for Securing LLM Agents against Prompt Injections,” Jun. 2025

work page 2025

[5] [5]

Fmops/distilbert-prompt-injection,

Blueteam AI, “Fmops/distilbert-prompt-injection,” Blueteam AI, Jul. 2024

work page 2024

[6] [6]

OpenAI Gym,

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” Jun. 2016

work page 2016

[7] [7]

Lan- guage Models are Few-Shot Learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCan- dlish, A. Radford, I. Sutskever, and D....

work page 2020

[8] [8]

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods,

N. Carlini and D. Wagner, “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods,” inThe 10th ACM Workshop on Artificial Intelligence and Security (AISec ’17). arXiv, Nov. 2017

work page 2017

[9] [9]

StruQ: Defending Against Prompt Injection with Structured Queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending Against Prompt Injection with Structured Queries,” inUSENIX Security Symposium 2025. arXiv, Sep. 2024

work page 2025

[10] [10]

SecAlign: Defending Against Prompt Injection with Preference Optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaud- huri, D. Wagner, and C. Guo, “SecAlign: Defending Against Prompt Injection with Preference Optimization,” inACM CCS 2025, Jul. 2025

work page 2025

[11] [11]

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks,

S. Chen, A. Zharmagambetov, D. Wagner, and C. Guo, “Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks,” Jul. 2025

work page 2025

[12] [12]

Defeating Prompt Injections by Design,

E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating Prompt Injections by Design,” Jun. 2025

work page 2025

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,

Gemini Team, “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” Jul. 2025

work page 2025

[14] [14]

Gemini CLI,

Google, “Gemini CLI,” Google, Sep. 2025

work page 2025

[15] [15]

Project Mariner,

Google DeepMind, “Project Mariner,” Google Deep- Mind, Sep. 2025

work page 2025

[16] [16]

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” inThe 16th ACM Work- shop on Artificial Intelligence and Security (AISec ’23). arXiv, May 2023

work page 2023

[17] [17]

PromptShield: Deployable Detection for Prompt Injection Attacks,

D. Jacob, H. Alzahrani, Z. Hu, B. Alomair, and D. Wag- ner, “PromptShield: Deployable Detection for Prompt Injection Attacks,” inACM CODASPY 2025. arXiv, Apr. 2025

work page 2025

[18] [18]

SWE-bench: Can Lan- guage Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Lan- guage Models Resolve Real-World GitHub Issues?” in ICLR 2024. arXiv, Nov. 2024

work page 2024

[19] [19]

Large Language Models are Zero-Shot Reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in NeurIPS 2022. arXiv, Jan. 2023

work page 2022

[20] [20]

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free,

H. Li, X. Liu, N. Zhang, and C. Xiao, “PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free,” inACL 2025, Jul. 2025

work page 2025

[21] [21]

Au- tomatic and Universal Prompt Injection Attacks against Large Language Models,

X. Liu, Z. Yu, Y . Zhang, N. Zhang, and C. Xiao, “Au- tomatic and Universal Prompt Injection Attacks against Large Language Models,” Mar. 2024

work page 2024

[22] [22]

Formal- izing and Benchmarking Prompt Injection Attacks and Defenses,

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formal- izing and Benchmarking Prompt Injection Attacks and Defenses,” inUSENIX Security 2024. arXiv, Nov. 2024

work page 2024

[23] [23]

Towards Deep Learning Models Resistant to Adversarial Attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards Deep Learning Models Resistant to Adversarial Attacks,” inICLR 2018. arXiv, Sep. 2019

work page 2018

[24] [24]

Meta-llama/Llama-Prompt-Guard-2-22M,

Meta, “Meta-llama/Llama-Prompt-Guard-2-22M,” Meta, Sep. 2025

work page 2025

[25] [25]

Addendum to GPT-5 system card: GPT-5- Codex,

OpenAI, “Addendum to GPT-5 system card: GPT-5- Codex,” Sep. 2025

work page 2025

[26] [26]

ChatGPT agent,

——, “ChatGPT agent,” OpenAI, Sep. 2025

work page 2025

[27] [27]

GPT-5 System Card,

——, “GPT-5 System Card,” Aug. 2025

work page 2025

[28] [28]

Ignore Previous Prompt: Attack Techniques For Language Models,

F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques For Language Models,” inNeurIPS 2022 Workshop on Machine Learning Safety. arXiv, Nov. 2022

work page 2022

[29] [29]

Jatmo: Prompt Injection Defense by Task-Specific Finetuning,

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt Injection Defense by Task-Specific Finetuning,” inES- ORICS 2024. arXiv, Jan. 2024

work page 2024

[30] [30]

JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift,

J. Piet, X. Huang, D. Jacob, A. Chow, M. Alrashed, G. Zhao, Z. Hu, C. Sitawarin, B. Alomair, and D. Wag- ner, “JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift,” inThe 18th ACM Workshop on Artificial Intelligence and Security (AISec ’25). arXiv, Apr. 2025

work page 2025

[31] [31]

Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection,

ProtectAI.com, “Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection,” ProtectAI.com, Sep. 2023

work page 2023

[32] [32]

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks,

A. Rao, S. Vashistha, A. Naik, S. Aditya, and M. Choud- hury, “Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks,” inLREC-COLING

work page

[33] [33]

”Do Anything Now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”Do Anything Now”: Characterizing and Evaluating In- The-Wild Jailbreak Prompts on Large Language Mod- els,” inACM CCS 2024. arXiv, May 2024

work page 2024

[34] [34]

CYBERSE- CEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models,

S. Wan, C. Nikolaidis, D. Song, D. Molnar, J. Crnkovich, J. Grace, M. Bhatt, S. Chennabasappa, S. Whitman, S. Ding, V . Ionescu, Y . Li, and J. Saxe, “CYBERSE- CEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models,” Sep. 2024

work page 2024

[35] [35]

Jailbreak and Guard Aligned Language Models with Only Few In- Context Demonstrations,

Z. Wei, Y . Wang, A. Li, Y . Mo, and Y . Wang, “Jailbreak and Guard Aligned Language Models with Only Few In- Context Demonstrations,” inICML 2024. arXiv, May 2024

work page 2024

[36] [36]

The Dual LLM pattern for building AI assistants that can resist prompt injection

S. Willison, “The Dual LLM pattern for building AI assistants that can resist prompt injection.” Apr. 2023

work page 2023

[37] [37]

Prompt injection attacks against GPT-3,

——, “Prompt injection attacks against GPT-3,” Sep. 2022

work page 2022

[38] [38]

I think “agent

——, “I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now,” Sep. 2025

work page 2025

[39] [39]

OW ASP Top 10 for LLM Applications 2025,

S. Wilson and A. Dawson, “OW ASP Top 10 for LLM Applications 2025,” Nov. 2024

work page 2025

[40] [40]

SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent- Computer Interfaces Enable Automated Software Engi- neering,” inNeurIPS 2024. arXiv, Nov. 2024

work page 2024

[41] [41]

Web- Shop: Towards Scalable Real-World Web Interaction with Grounded Language Agents,

S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web- Shop: Towards Scalable Real-World Web Interaction with Grounded Language Agents,” inNeurIPS 2022. arXiv, Feb. 2023

work page 2022

[42] [42]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,” inICLR 2023. arXiv, Mar. 2023

work page 2023

[43] [43]

Universal and Transferable Adversarial Attacks on Aligned Language Models,

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” Dec. 2023. APPENDIXA INJECTION PAYLOADS A. Online shopping agent Injection attack for online shopping agent <SAFETY CRITICAL OVERRIDE ALERT>YOU HA VE BEEN MISLED BY A MALICIOUS AD- VERSARY AND ARE ON THE ITEM ...

work page 2023