Agent Security is a Systems Problem

Andrey Labunets; Ashish Hooda; Earlence Fernandes; Guy Amir; Jihye Choi; Johann Rehberger; Kamalika Chaudhuri; Khawaja Shams; Mihai Christodorescu; Nils Palumbo

arxiv: 2605.18991 · v2 · pith:BNTP7U3Pnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI

Agent Security is a Systems Problem

Mihai Christodorescu , Earlence Fernandes , Ashish Hooda , Somesh Jha , Johann Rehberger , Kamalika Chaudhuri , Xiaohan Fu , Khawaja Shams

show 6 more authors

Guy Amir Jihye Choi Sarthak Choudhary Nils Palumbo Andrey Labunets Nishit V. Pandya

This is my paper

Pith reviewed 2026-05-21 07:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords agent securitysystems securityuntrusted componentssecurity invariantsAI agentsLLM securityadversarial attacks

0 comments

The pith

Agent security must treat the AI model as an untrusted component and enforce invariants at the system level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that security for AI agents cannot rely on making the underlying model more robust. Instead the model must be handled as an untrusted element inside a larger system where established security rules are applied at the system boundary. This view draws on decades of work in operating systems, networks, and formal methods to deliver more predictable protection. Analysis of eleven real-world attacks shows how the approach would block incidents that current methods miss. The authors also list research challenges that must be solved before the principles can be put into practice.

Core claim

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Efforts to increase model robustness are insufficient on their own. We must complement existing efforts with techniques from the systems security domain. Based on experience in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles that provide a foundation for designing agentic systems with predictable guarantees. Analysis of eleven representative real-world attacks shows how these principles could have prevented,

What carries the argument

Treating the AI model powering the agent as an untrusted component while enforcing security invariants at the overall system level using principles from operating systems, networks, and formal methods.

If this is right

The eleven analyzed attacks become preventable once systems principles such as isolation and invariant checking are applied.
Agent designs gain more reliable security properties by borrowing mechanisms proven in operating systems and networks.
Security research for agents shifts from model-only fixes toward layered system architectures.
Implementation requires solving identified research challenges around adapting traditional principles to open-ended agent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent platforms may need dedicated security kernels that sit outside the model to mediate all actions.
Developers could adopt mandatory system-level audits before deploying agents that control real-world resources.
The same systems lens may apply to other interactive AI tools that execute actions on behalf of users.

Load-bearing premise

Established systems security principles can be transferred to agentic systems to deliver predictable guarantees despite the stochastic and open-ended behavior of large language models.

What would settle it

A working agent system that applies full systems-level isolation, invariant enforcement, and related principles yet still experiences a successful attack would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.18991 by Andrey Labunets, Ashish Hooda, Earlence Fernandes, Guy Amir, Jihye Choi, Johann Rehberger, Kamalika Chaudhuri, Khawaja Shams, Mihai Christodorescu, Nils Palumbo, Nishit V. Pandya, Sarthak Choudhary, Somesh Jha, Xiaohan Fu.

**Figure 1.** Figure 1: Standard security architecture consists of requests and responses that cross a security boundary between an [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: OpenAI Operator exploit flow: a prompt-injected GitHub issues page can steer the agent into an authenticated [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent security should treat the model as untrusted with system-level enforcement, though adapting systems techniques to LLM variability needs more concrete work.

read the letter

Colleague, This paper's main point is that agent security is best handled by treating the underlying AI model as untrusted and enforcing invariants through system architecture instead of focusing only on making the model more robust. The work does well by surveying eleven real-world attacks and mapping them to systems security principles like isolation, least privilege, and formal verification. It shows how these could block issues such as unauthorized actions or data leaks in agent setups. The authors' experience in cybersecurity gives the analysis credibility, and listing open challenges at the end is a solid move. Where it falls short is in addressing the practical transfer. The attacks are analyzed after the fact, but there's no example of how a reference monitor or similar mechanism would actually work when the LLM can generate varying responses to the same query. The concern about non-determinism affecting invariant enforcement is fair and not fully tackled here. It's more of a position than a worked-out solution. This is useful for anyone thinking about securing production agents in enterprise settings or for systems researchers looking at AI applications. A reader interested in the intersection of ML and security would find the attack examples and principle list valuable for sparking ideas. I'd send it for peer review. The position is worth putting in front of referees to refine the argument and see what concrete next steps emerge.

Referee Report

2 major / 2 minor

Summary. The paper argues that agent security must be treated as a systems problem: the underlying AI model should be viewed as an untrusted component, with security invariants enforced at the system level rather than through model robustness alone. Drawing on experience in operating systems, networks, formal methods, and adversarial ML, the authors articulate core principles from decades of systems security research, analyze eleven real-world attacks to illustrate how these principles could have prevented them, and identify open research challenges for realizing the approach in agentic systems.

Significance. If the central claim holds, the work provides a timely reframing that could guide more robust designs for LLM-based agents by integrating established systems techniques. The retrospective analysis of eleven attacks supplies concrete grounding and useful examples. However, the absence of a concrete mechanism demonstrating invariant enforcement across stochastic LLM outputs limits the strength of the 'predictable guarantees' assertion.

major comments (2)

[Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.
[Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.

minor comments (2)

[Abstract] The abstract could more explicitly name the core principles being proposed rather than referring to them generically.
[Attack analysis] A table summarizing the eleven attacks, the violated invariant, and the relevant systems principle would improve readability and make the evidence easier to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the manuscript's arguments can be clarified. We address each major comment below and have made targeted revisions to strengthen the presentation of the position without altering the core claims.

read point-by-point responses

Referee: [Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.

Authors: The attack analysis section is intentionally retrospective, using real incidents to show how violations of systems principles enabled the attacks and how adherence to those principles could have prevented them. This serves as grounding for the position rather than an implementation claim. We agree that additional elaboration on enforcement under stochastic outputs would strengthen the discussion of predictable guarantees. In revision, we have added a paragraph to the attack analysis that describes how a reference monitor could mediate all external actions (e.g., tool invocations) by applying static policy checks and capability restrictions, independent of any particular LLM output or its variability across samples. This mechanism would reject or constrain actions that violate invariants even if the model produces inconsistent proposals. revision: yes
Referee: [Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.

Authors: The referee correctly notes that non-determinism is central to the argument. The principles are designed such that enforcement occurs at the system boundary via mechanisms that do not depend on the internal consistency or determinism of the untrusted model component. For instance, a reference monitor or sandbox enforces invariants by inspecting and controlling observable actions and resource accesses, which remain subject to policy regardless of output variation. We have revised the principles section to include an explicit discussion of this point, explaining that the orthogonality between the mediator and the LLM allows invariants to be maintained even when outputs differ across runs. We also emphasize that realizing full predictability remains an open research challenge, consistent with the manuscript's existing identification of implementation gaps. revision: yes

Circularity Check

0 steps flagged

No circularity; position paper draws on external systems literature

full rationale

The manuscript is a position paper whose central claim—that agent security requires treating the LLM as an untrusted component and enforcing invariants at the system level—is justified by reference to decades of independent systems-security research in OS, networks, and formal methods rather than any internal derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are presented that reduce to quantities defined by the authors themselves; the eleven-attack analysis is retrospective evidence, not a constructed forecast. The argument therefore remains self-contained against external benchmarks and exhibits no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the domain assumption that systems security techniques transfer to stochastic AI agents; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Systems security principles from operating systems, networks, and formal methods provide a foundation for designing agentic systems with predictable guarantees.
Invoked in the abstract as the basis for the core principles and attack analysis.

pith-pipeline@v0.9.0 · 5727 in / 1146 out tokens · 21845 ms · 2026-05-21T07:39:38.099915+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 9 internal anchors

[1]

AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/

Amazon Web Services. AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/. Accessed: 2026

work page 2026
[2]

Agent skills—Claude API docs, 2026

Anthropic. Agent skills—Claude API docs, 2026. Accessed: 2026-02-05. URL:https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview

work page 2026
[3]

Next-generation constitutional classifiers: More efficient protection against universal jailbreaks

Anthropic. Next-generation constitutional classifiers: More efficient protection against universal jailbreaks. https://www.anthropic.com/research/next-generation-constitutional-classifiers, January 2026

work page 2026
[4]

Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

Anthropic. Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

work page 2026
[5]

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, 2018. URL:https://arxiv.org/abs/1802.00420, arXiv: 1802.00420. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

Nihat Ay and Daniel Polani. Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

work page 2008
[7]

Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025

Luca Beurer-Kellner, Marc Fischer, Hemang Sarkar, Kristian Bonde Nielsen, Marco Milanta, and Aleksei Kudrin- skii. Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025. Accessed: 2026-02-05

work page 2025
[8]

StruQ: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025. URL: https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe

work page 2025
[9]

Defending against prompt injection with a few defensive tokens, 2025

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner. Defending against prompt injection with a few defensive tokens, 2025. URL:https://arxiv.org/abs/2507.07974,arXiv:2507.07974

work page arXiv 2025
[10]

SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2025

work page 2025
[11]

Meta secalign: A secure foundation llm against prompt injection attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URL:https://arxiv.org/abs/2507.02735,arXiv:2507.02735

work page arXiv 2026
[12]

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

work page
[13]

URL:https://arxiv.org/abs/2503.22738,arXiv:2503.22738

work page arXiv
[14]

How Not to Detect Prompt Injections with an LLM

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How Not to Detect Prompt Injections with an LLM. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, pages 218–229, 2025

work page 2025
[15]

Securing AI Agents with Information-Flow Control

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing AI Agents with Information-Flow Control, 2025. URL:https://arxiv.org/abs/2505.23643,arXiv:2505.23643

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, Eleftherios Ioannidis, John Kastner, Anwar Mamat, Darin McAdams, Matt McCutchen, Neha Rungta, Emina Torlak, and Andrew M. Wells. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proc. ACM Program. Lang., 8(OOPSLA1), April ...

work page doi:10.1145/3649835 2024
[17]

Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems

Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In2016 IEEE symposium on security and privacy (SP), pages 598–617. IEEE, 2016

work page 2016
[18]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating Prompt Injections by Design, 2025. URL: https://arxiv.org/abs/2503.18813,arXiv:2503.18813

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Dorothy E. Denning. A Lattice Model of Secure Information Flow.Commun. ACM, 19(5):236–243, May 1976. doi:10.1145/360051.360056

work page doi:10.1145/360051.360056 1976
[20]

Binder, a Logic-based Security Language

John DeTreville. Binder, a Logic-based Security Language. InProceedings 2002 IEEE Symposium on Security and Privacy, pages 105–113. IEEE, 2002

work page 2002
[21]

Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones.ACM Trans. Comput. Syst., 32(2), June 2014.doi:10.1145/2619091

work page doi:10.1145/2619091 2014
[22]

Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

work page 2021
[23]

CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026

Google. CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026. Accessed: 2026-02-05. 10

work page 2026
[24]

Identity and Access Management (IAM).https://cloud.google.com/iam/

Google Cloud. Identity and Access Management (IAM).https://cloud.google.com/iam/. Accessed: 2026

work page 2026
[25]

Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

work page
[26]

URL:https://arxiv.org/abs/2302.12173,arXiv:2302.12173

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Quantifying causal influences

Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf. Quantifying causal influences. 2013

work page 2013
[28]

A critical evaluation of defenses against prompt injection attacks

Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks. In34th USENIX Security Symposium (USENIX Security 25), 2025

work page 2025
[29]

InConference on Empirical Methods in Natural Language Processing

Sam Johnson, Viet Pham, and Thai Le. Manipulating llm web agents with indirect prompt injection attack via html accessibility tree.arXiv preprint arXiv:2507.14799, 2025

work page arXiv 2025
[30]

Optimizing agent planning for security and autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Beguelin. Optimizing agent planning for security and autonomy. InThe Fourteenth International Conference on Learning Representations, 2026. URL:https://openreview.net/forum?id=g0aVCDY3gS

work page 2026
[31]

ACE: A Security Architecture for LLM-Integrated App Systems

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings of the Network and Distributed System Security Symposium (NDSS), 2026

work page 2026
[32]

In: IEEE Symposium on Security and Privacy (S&P)

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208, 2025.doi:10.1109/SP61157.2025.00250

work page doi:10.1109/sp61157.2025.00250 2025
[33]

Prp: Propagating universal perturbations to attack large language model guard-rails, 2024

Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. Prp: Propagating universal perturbations to attack large language model guard-rails, 2024. URL: https://arxiv.org/abs/2402.15911,arXiv:2402.15911

work page arXiv 2024
[34]

ceLLMate: Sandboxing Browser AI Agents,

Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes. ceLLMate: Sandboxing Browser AI Agents,

work page
[35]

URL:https://arxiv.org/abs/2512.12594,arXiv:2512.12594

work page arXiv
[36]

Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/

Microsoft. Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/. Ac- cessed: 2026

work page 2026
[37]

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation, 2025. URL:https:// arxiv.org/abs/2510.05156,arXiv:2510.05156

work page arXiv 2025
[38]

Measuring information leakage using generalized gain functions

S Alvim M’rio, Kostas Chatzikokolakis, Catuscia Palamidessi, and Geoffrey Smith. Measuring information leakage using generalized gain functions. In2012 IEEE 25th Computer Security Foundations Symposium, pages 265–279. IEEE, 2012

work page 2012
[39]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections, 2025. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Automatically hardening web applications using precise tainting

Anh Nguyen-Tuong, Salvatore Guarnieri, Doug Greene, Jeff Shirley, and David Evans. Automatically hardening web applications using precise tainting. InIFIP International Information Security Conference, pages 295–307. Springer, 2005

work page 2005
[41]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

NVIDIA. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. https://github.com/NVIDIA/NeMo-Guardrails, 2023. Accessed: 2026-02-03. URL:https://github.com/ NVIDIA/NeMo-Guardrails

work page 2023
[42]

ClawHub, the skill dock for sharp agents, 2026

OpenClaw. ClawHub, the skill dock for sharp agents, 2026. Accessed: 2026-02-05. URL:https://clawhub.ai/. 11

work page 2026
[43]

Formal Policy Enforcement for Real-World Agentic Systems

Nils Palumbo, Sarthak Choudhary, Jihye Choi, Guy Amir, Prasad Chalasani, and Somesh Jha. Formal policy enforcement for real-world agentic systems, 2026. URL:https://arxiv.org/abs/2602.16708, arXiv:2602.16708

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Validating mechanistic interpretations: An axiomatic approach

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S Pasareanu, and Somesh Jha. Validating mechanistic interpretations: An axiomatic approach. InInternational Conference on Machine Learning, pages 47509–47544. PMLR, 2025

work page 2025
[45]

Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

Nishit V . Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks, 2025. URL:https://arxiv.org/ abs/2507.07417,arXiv:2507.07417

work page arXiv 2025
[46]

Defending against injection attacks through context-sensitive string evaluation

Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection attacks through context-sensitive string evaluation. InInternational Workshop on Recent Advances in Intrusion Detection, pages 124–145. Springer, 2005

work page 2005
[47]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Breaking Instruction Hierarchy in OpenAI’s gpt-4o-mini.https://embracethered. com/blog/posts/2024/chatgpt-gpt-4o-mini-instruction-hierarchie-bypasses/, July 2024. Accessed: 2025-11- 03

work page 2024
[48]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). DeepSeek AI: From prompt injection to account takeover.https://embracethered. com/blog/posts/2024/deepseek-ai-prompt-injection-to-xss-and-account-takeover/, 2024. Accessed on 2025- 09-05

work page 2024
[49]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Google Gemini: Planting Instructions For Delayed Automatic Tool Invocation, feb

work page
[50]

URL:https://embracethered.com/blog/posts/2024/llm-context-pollution-and-delayed-automated-tool- invocation/

work page 2024
[51]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Informa- tion.https://embracethered.com/blog/posts/2024/m365-copilot-prompt-injection-tool-invocation-and-data- exfil-using-ascii-smuggling/, 2024. Accessed on 2025-09-05

work page 2024
[52]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Spyware Injection Into Your ChatGPT’s Long-Term Memory (SpAIware).https:// embracethered.com/blog/posts/2024/chatgpt-macos-app-persistent-data-exfiltration/, 2024. Accessed on 2025-09-05

work page 2024
[53]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Terminal DiLLMas—Prompt Injection in the Terminal via ANSI Sequences.https:// embracethered.com/blog/posts/2024/terminal-dillmas-prompt-injection-ansi-sequences/, 2024. Accessed on 2025-09-05

work page 2024
[54]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). AI ClickFix: Hijacking Computer-Use Agents Using ClickFix. Blog post, May

work page
[55]

URL:https://embracethered.com/blog/posts/2025/ai-clickfix-ttp-claude/

work page 2025
[56]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). AMP–Agents that Modify System Configuration and Escape.https://embracethered. com/blog/posts/2025/amp-agents-that-modify-system-configuration-and-escape/, 2025. Accessed on 2025- 09-05

work page 2025
[57]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). ChatGPT Operator prompt injection exploits.https://embracethered.com/blog/ posts/2025/chatgpt-operator-prompt-injection-exploits/, 2025. Accessed on 2025-09-05

work page 2025
[58]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Claude Code: Data Exfiltration with DNS (CVE-2025-55284). Blog post, August

work page 2025
[59]

URL:https://embracethered.com/blog/posts/2025/claude-code-exfiltration-via-dns-requests/

work page 2025
[60]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Devin AI Kill Chain—Exposing Ports Leading to RCE and file Exfiltration. https://embracethered.com/blog/posts/2025/devin-ai-kill-chain-exposing-ports/, 2025. Accessed on 2025-09- 05

work page 2025
[61]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Devin can leak your secrets—Prompt Injection Leads to Exfiltration.https:// embracethered.com/blog/posts/2025/devin-can-leak-your-secrets/, 2025. Accessed on 2025-09-05. 12

work page 2025
[62]

User- driven access control: Rethinking permission granting in modern operating systems

Franziska Roesner, Tadayoshi Kohno, Alexander Moshchuk, Bryan Parno, Helen J Wang, and Crispin Cowan. User- driven access control: Rethinking permission granting in modern operating systems. In2012 IEEE Symposium on Security and Privacy, pages 224–238. IEEE, 2012

work page 2012
[63]

Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

Andrei Sabelfeld and David Sands. Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

work page 2009
[64]

Saltzer and M.D

J.H. Saltzer and M.D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975.doi:10.1109/PROC.1975.9939

work page doi:10.1109/proc.1975.9939 1975
[65]

Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, and Jerry Wei. Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026. URL:https://arxiv.org/abs/2603.29038, arXiv:2603. 29038

work page arXiv 2026
[66]

The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)

Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86). InProceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, page 552–561, New York, NY , USA, 2007. Association for Computing Machinery.doi:10.1145/1315245.1315313

work page doi:10.1145/1315245.1315313 2007
[67]

Progent: Securing AI Agents with Privilege Control

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable Privilege Control for LLM Agents, 2025. URL:https://arxiv.org/abs/2504.11703, arXiv: 2504.11703

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025

Marina Simakov. AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025. Accessed: 2025-09-17

work page 2025
[69]

On the foundations of quantitative information flow

Geoffrey Smith. On the foundations of quantitative information flow. InInternational Conference on Foundations of Software Science and Computational Structures, pages 288–302. Springer, 2009

work page 2009
[70]

Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, and Alina Oprea. Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks. arXiv preprint arXiv:2602.09222, 2026

work page arXiv 2026
[71]

SAGA: A Security Architecture for Governing AI Agentic Systems

Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, and Alina Oprea. SAGA: A Security Architecture for Governing AI Agentic Systems. InNetwork and Distributed System Security Symposium (NDSS), 2026

work page 2026
[72]

Lee, and G

Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S. Lee, and G. Edward Suh. Information flow control in machine learning through modular model architecture. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC’24, USA, 2024. USENIX Association. URL:https://www....

work page 2024
[73]

AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy

Udbhav Tiwari and Meredith Whittaker. AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy. 39th Chaos Communication Congress, Congress Center Hamburg, Hamburg, Germany. URL:https://media.ccc.de/ v/39c3-ai-agent-ai-spy

work page
[74]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, 2024. URL:https://arxiv.org/abs/2404.13208, arXiv:2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red-teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

work page arXiv 2025
[76]

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025. URL:https://arxiv.org/abs/2410.09102,arXiv:2410.09102

work page arXiv 2025
[77]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL:https://arxiv.org/abs/2307.15043, arXiv:2307.15043. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Terminal DiLLMa

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL:https://iclr.cc/virtual/2024/23872. 14 A Attacks on Agentic Systems: Additional Case Studies DeepSeek AI Acc...

work page 2025

[1] [1]

AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/

Amazon Web Services. AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/. Accessed: 2026

work page 2026

[2] [2]

Agent skills—Claude API docs, 2026

Anthropic. Agent skills—Claude API docs, 2026. Accessed: 2026-02-05. URL:https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview

work page 2026

[3] [3]

Next-generation constitutional classifiers: More efficient protection against universal jailbreaks

Anthropic. Next-generation constitutional classifiers: More efficient protection against universal jailbreaks. https://www.anthropic.com/research/next-generation-constitutional-classifiers, January 2026

work page 2026

[4] [4]

Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

Anthropic. Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

work page 2026

[5] [5]

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, 2018. URL:https://arxiv.org/abs/1802.00420, arXiv: 1802.00420. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

Nihat Ay and Daniel Polani. Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

work page 2008

[7] [7]

Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025

Luca Beurer-Kellner, Marc Fischer, Hemang Sarkar, Kristian Bonde Nielsen, Marco Milanta, and Aleksei Kudrin- skii. Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025. Accessed: 2026-02-05

work page 2025

[8] [8]

StruQ: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025. URL: https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe

work page 2025

[9] [9]

Defending against prompt injection with a few defensive tokens, 2025

Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner. Defending against prompt injection with a few defensive tokens, 2025. URL:https://arxiv.org/abs/2507.07974,arXiv:2507.07974

work page arXiv 2025

[10] [10]

SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2025

work page 2025

[11] [11]

Meta secalign: A secure foundation llm against prompt injection attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URL:https://arxiv.org/abs/2507.02735,arXiv:2507.02735

work page arXiv 2026

[12] [12]

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

work page

[13] [13]

URL:https://arxiv.org/abs/2503.22738,arXiv:2503.22738

work page arXiv

[14] [14]

How Not to Detect Prompt Injections with an LLM

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How Not to Detect Prompt Injections with an LLM. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, pages 218–229, 2025

work page 2025

[15] [15]

Securing AI Agents with Information-Flow Control

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing AI Agents with Information-Flow Control, 2025. URL:https://arxiv.org/abs/2505.23643,arXiv:2505.23643

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, Eleftherios Ioannidis, John Kastner, Anwar Mamat, Darin McAdams, Matt McCutchen, Neha Rungta, Emina Torlak, and Andrew M. Wells. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proc. ACM Program. Lang., 8(OOPSLA1), April ...

work page doi:10.1145/3649835 2024

[17] [17]

Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems

Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In2016 IEEE symposium on security and privacy (SP), pages 598–617. IEEE, 2016

work page 2016

[18] [18]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating Prompt Injections by Design, 2025. URL: https://arxiv.org/abs/2503.18813,arXiv:2503.18813

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Dorothy E. Denning. A Lattice Model of Secure Information Flow.Commun. ACM, 19(5):236–243, May 1976. doi:10.1145/360051.360056

work page doi:10.1145/360051.360056 1976

[20] [20]

Binder, a Logic-based Security Language

John DeTreville. Binder, a Logic-based Security Language. InProceedings 2002 IEEE Symposium on Security and Privacy, pages 105–113. IEEE, 2002

work page 2002

[21] [21]

Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones.ACM Trans. Comput. Syst., 32(2), June 2014.doi:10.1145/2619091

work page doi:10.1145/2619091 2014

[22] [22]

Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

work page 2021

[23] [23]

CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026

Google. CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026. Accessed: 2026-02-05. 10

work page 2026

[24] [24]

Identity and Access Management (IAM).https://cloud.google.com/iam/

Google Cloud. Identity and Access Management (IAM).https://cloud.google.com/iam/. Accessed: 2026

work page 2026

[25] [25]

Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

work page

[26] [26]

URL:https://arxiv.org/abs/2302.12173,arXiv:2302.12173

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Quantifying causal influences

Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf. Quantifying causal influences. 2013

work page 2013

[28] [28]

A critical evaluation of defenses against prompt injection attacks

Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks. In34th USENIX Security Symposium (USENIX Security 25), 2025

work page 2025

[29] [29]

InConference on Empirical Methods in Natural Language Processing

Sam Johnson, Viet Pham, and Thai Le. Manipulating llm web agents with indirect prompt injection attack via html accessibility tree.arXiv preprint arXiv:2507.14799, 2025

work page arXiv 2025

[30] [30]

Optimizing agent planning for security and autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Beguelin. Optimizing agent planning for security and autonomy. InThe Fourteenth International Conference on Learning Representations, 2026. URL:https://openreview.net/forum?id=g0aVCDY3gS

work page 2026

[31] [31]

ACE: A Security Architecture for LLM-Integrated App Systems

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings of the Network and Distributed System Security Symposium (NDSS), 2026

work page 2026

[32] [32]

In: IEEE Symposium on Security and Privacy (S&P)

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208, 2025.doi:10.1109/SP61157.2025.00250

work page doi:10.1109/sp61157.2025.00250 2025

[33] [33]

Prp: Propagating universal perturbations to attack large language model guard-rails, 2024

Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. Prp: Propagating universal perturbations to attack large language model guard-rails, 2024. URL: https://arxiv.org/abs/2402.15911,arXiv:2402.15911

work page arXiv 2024

[34] [34]

ceLLMate: Sandboxing Browser AI Agents,

Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes. ceLLMate: Sandboxing Browser AI Agents,

work page

[35] [35]

URL:https://arxiv.org/abs/2512.12594,arXiv:2512.12594

work page arXiv

[36] [36]

Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/

Microsoft. Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/. Ac- cessed: 2026

work page 2026

[37] [37]

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation, 2025. URL:https:// arxiv.org/abs/2510.05156,arXiv:2510.05156

work page arXiv 2025

[38] [38]

Measuring information leakage using generalized gain functions

S Alvim M’rio, Kostas Chatzikokolakis, Catuscia Palamidessi, and Geoffrey Smith. Measuring information leakage using generalized gain functions. In2012 IEEE 25th Computer Security Foundations Symposium, pages 265–279. IEEE, 2012

work page 2012

[39] [39]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections, 2025. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Automatically hardening web applications using precise tainting

Anh Nguyen-Tuong, Salvatore Guarnieri, Doug Greene, Jeff Shirley, and David Evans. Automatically hardening web applications using precise tainting. InIFIP International Information Security Conference, pages 295–307. Springer, 2005

work page 2005

[41] [41]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

NVIDIA. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. https://github.com/NVIDIA/NeMo-Guardrails, 2023. Accessed: 2026-02-03. URL:https://github.com/ NVIDIA/NeMo-Guardrails

work page 2023

[42] [42]

ClawHub, the skill dock for sharp agents, 2026

OpenClaw. ClawHub, the skill dock for sharp agents, 2026. Accessed: 2026-02-05. URL:https://clawhub.ai/. 11

work page 2026

[43] [43]

Formal Policy Enforcement for Real-World Agentic Systems

Nils Palumbo, Sarthak Choudhary, Jihye Choi, Guy Amir, Prasad Chalasani, and Somesh Jha. Formal policy enforcement for real-world agentic systems, 2026. URL:https://arxiv.org/abs/2602.16708, arXiv:2602.16708

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Validating mechanistic interpretations: An axiomatic approach

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S Pasareanu, and Somesh Jha. Validating mechanistic interpretations: An axiomatic approach. InInternational Conference on Machine Learning, pages 47509–47544. PMLR, 2025

work page 2025

[45] [45]

Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

Nishit V . Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks, 2025. URL:https://arxiv.org/ abs/2507.07417,arXiv:2507.07417

work page arXiv 2025

[46] [46]

Defending against injection attacks through context-sensitive string evaluation

Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection attacks through context-sensitive string evaluation. InInternational Workshop on Recent Advances in Intrusion Detection, pages 124–145. Springer, 2005

work page 2005

[47] [47]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Breaking Instruction Hierarchy in OpenAI’s gpt-4o-mini.https://embracethered. com/blog/posts/2024/chatgpt-gpt-4o-mini-instruction-hierarchie-bypasses/, July 2024. Accessed: 2025-11- 03

work page 2024

[48] [48]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). DeepSeek AI: From prompt injection to account takeover.https://embracethered. com/blog/posts/2024/deepseek-ai-prompt-injection-to-xss-and-account-takeover/, 2024. Accessed on 2025- 09-05

work page 2024

[49] [49]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Google Gemini: Planting Instructions For Delayed Automatic Tool Invocation, feb

work page

[50] [50]

URL:https://embracethered.com/blog/posts/2024/llm-context-pollution-and-delayed-automated-tool- invocation/

work page 2024

[51] [51]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Informa- tion.https://embracethered.com/blog/posts/2024/m365-copilot-prompt-injection-tool-invocation-and-data- exfil-using-ascii-smuggling/, 2024. Accessed on 2025-09-05

work page 2024

[52] [52]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Spyware Injection Into Your ChatGPT’s Long-Term Memory (SpAIware).https:// embracethered.com/blog/posts/2024/chatgpt-macos-app-persistent-data-exfiltration/, 2024. Accessed on 2025-09-05

work page 2024

[53] [53]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Terminal DiLLMas—Prompt Injection in the Terminal via ANSI Sequences.https:// embracethered.com/blog/posts/2024/terminal-dillmas-prompt-injection-ansi-sequences/, 2024. Accessed on 2025-09-05

work page 2024

[54] [54]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). AI ClickFix: Hijacking Computer-Use Agents Using ClickFix. Blog post, May

work page

[55] [55]

URL:https://embracethered.com/blog/posts/2025/ai-clickfix-ttp-claude/

work page 2025

[56] [56]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). AMP–Agents that Modify System Configuration and Escape.https://embracethered. com/blog/posts/2025/amp-agents-that-modify-system-configuration-and-escape/, 2025. Accessed on 2025- 09-05

work page 2025

[57] [57]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). ChatGPT Operator prompt injection exploits.https://embracethered.com/blog/ posts/2025/chatgpt-operator-prompt-injection-exploits/, 2025. Accessed on 2025-09-05

work page 2025

[58] [58]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Claude Code: Data Exfiltration with DNS (CVE-2025-55284). Blog post, August

work page 2025

[59] [59]

URL:https://embracethered.com/blog/posts/2025/claude-code-exfiltration-via-dns-requests/

work page 2025

[60] [60]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Devin AI Kill Chain—Exposing Ports Leading to RCE and file Exfiltration. https://embracethered.com/blog/posts/2025/devin-ai-kill-chain-exposing-ports/, 2025. Accessed on 2025-09- 05

work page 2025

[61] [61]

Rehberger (wunderwuzzi)

J. Rehberger (wunderwuzzi). Devin can leak your secrets—Prompt Injection Leads to Exfiltration.https:// embracethered.com/blog/posts/2025/devin-can-leak-your-secrets/, 2025. Accessed on 2025-09-05. 12

work page 2025

[62] [62]

User- driven access control: Rethinking permission granting in modern operating systems

Franziska Roesner, Tadayoshi Kohno, Alexander Moshchuk, Bryan Parno, Helen J Wang, and Crispin Cowan. User- driven access control: Rethinking permission granting in modern operating systems. In2012 IEEE Symposium on Security and Privacy, pages 224–238. IEEE, 2012

work page 2012

[63] [63]

Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

Andrei Sabelfeld and David Sands. Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

work page 2009

[64] [64]

Saltzer and M.D

J.H. Saltzer and M.D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975.doi:10.1109/PROC.1975.9939

work page doi:10.1109/proc.1975.9939 1975

[65] [65]

Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, and Jerry Wei. Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026. URL:https://arxiv.org/abs/2603.29038, arXiv:2603. 29038

work page arXiv 2026

[66] [66]

The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)

Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86). InProceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, page 552–561, New York, NY , USA, 2007. Association for Computing Machinery.doi:10.1145/1315245.1315313

work page doi:10.1145/1315245.1315313 2007

[67] [67]

Progent: Securing AI Agents with Privilege Control

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable Privilege Control for LLM Agents, 2025. URL:https://arxiv.org/abs/2504.11703, arXiv: 2504.11703

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025

Marina Simakov. AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025. Accessed: 2025-09-17

work page 2025

[69] [69]

On the foundations of quantitative information flow

Geoffrey Smith. On the foundations of quantitative information flow. InInternational Conference on Foundations of Software Science and Computational Structures, pages 288–302. Springer, 2009

work page 2009

[70] [70]

Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, and Alina Oprea. Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks. arXiv preprint arXiv:2602.09222, 2026

work page arXiv 2026

[71] [71]

SAGA: A Security Architecture for Governing AI Agentic Systems

Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, and Alina Oprea. SAGA: A Security Architecture for Governing AI Agentic Systems. InNetwork and Distributed System Security Symposium (NDSS), 2026

work page 2026

[72] [72]

Lee, and G

Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S. Lee, and G. Edward Suh. Information flow control in machine learning through modular model architecture. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC’24, USA, 2024. USENIX Association. URL:https://www....

work page 2024

[73] [73]

AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy

Udbhav Tiwari and Meredith Whittaker. AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy. 39th Chaos Communication Congress, Congress Center Hamburg, Hamburg, Germany. URL:https://media.ccc.de/ v/39c3-ai-agent-ai-spy

work page

[74] [74]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, 2024. URL:https://arxiv.org/abs/2404.13208, arXiv:2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red-teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

work page arXiv 2025

[76] [76]

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025. URL:https://arxiv.org/abs/2410.09102,arXiv:2410.09102

work page arXiv 2025

[77] [77]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL:https://arxiv.org/abs/2307.15043, arXiv:2307.15043. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Terminal DiLLMa

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL:https://iclr.cc/virtual/2024/23872. 14 A Attacks on Agentic Systems: Additional Case Studies DeepSeek AI Acc...

work page 2025