pith. sign in

arxiv: 2605.18991 · v2 · pith:BNTP7U3Pnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI

Agent Security is a Systems Problem

Pith reviewed 2026-05-21 07:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords agent securitysystems securityuntrusted componentssecurity invariantsAI agentsLLM securityadversarial attacks
0
0 comments X

The pith

Agent security must treat the AI model as an untrusted component and enforce invariants at the system level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that security for AI agents cannot rely on making the underlying model more robust. Instead the model must be handled as an untrusted element inside a larger system where established security rules are applied at the system boundary. This view draws on decades of work in operating systems, networks, and formal methods to deliver more predictable protection. Analysis of eleven real-world attacks shows how the approach would block incidents that current methods miss. The authors also list research challenges that must be solved before the principles can be put into practice.

Core claim

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Efforts to increase model robustness are insufficient on their own. We must complement existing efforts with techniques from the systems security domain. Based on experience in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles that provide a foundation for designing agentic systems with predictable guarantees. Analysis of eleven representative real-world attacks shows how these principles could have prevented,

What carries the argument

Treating the AI model powering the agent as an untrusted component while enforcing security invariants at the overall system level using principles from operating systems, networks, and formal methods.

If this is right

  • The eleven analyzed attacks become preventable once systems principles such as isolation and invariant checking are applied.
  • Agent designs gain more reliable security properties by borrowing mechanisms proven in operating systems and networks.
  • Security research for agents shifts from model-only fixes toward layered system architectures.
  • Implementation requires solving identified research challenges around adapting traditional principles to open-ended agent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent platforms may need dedicated security kernels that sit outside the model to mediate all actions.
  • Developers could adopt mandatory system-level audits before deploying agents that control real-world resources.
  • The same systems lens may apply to other interactive AI tools that execute actions on behalf of users.

Load-bearing premise

Established systems security principles can be transferred to agentic systems to deliver predictable guarantees despite the stochastic and open-ended behavior of large language models.

What would settle it

A working agent system that applies full systems-level isolation, invariant enforcement, and related principles yet still experiences a successful attack would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.18991 by Andrey Labunets, Ashish Hooda, Earlence Fernandes, Guy Amir, Jihye Choi, Johann Rehberger, Kamalika Chaudhuri, Khawaja Shams, Mihai Christodorescu, Nils Palumbo, Nishit V. Pandya, Sarthak Choudhary, Somesh Jha, Xiaohan Fu.

Figure 1
Figure 1. Figure 1: Standard security architecture consists of requests and responses that cross a security boundary between an [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OpenAI Operator exploit flow: a prompt-injected GitHub issues page can steer the agent into an authenticated [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that agent security must be treated as a systems problem: the underlying AI model should be viewed as an untrusted component, with security invariants enforced at the system level rather than through model robustness alone. Drawing on experience in operating systems, networks, formal methods, and adversarial ML, the authors articulate core principles from decades of systems security research, analyze eleven real-world attacks to illustrate how these principles could have prevented them, and identify open research challenges for realizing the approach in agentic systems.

Significance. If the central claim holds, the work provides a timely reframing that could guide more robust designs for LLM-based agents by integrating established systems techniques. The retrospective analysis of eleven attacks supplies concrete grounding and useful examples. However, the absence of a concrete mechanism demonstrating invariant enforcement across stochastic LLM outputs limits the strength of the 'predictable guarantees' assertion.

major comments (2)
  1. [Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.
  2. [Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.
minor comments (2)
  1. [Abstract] The abstract could more explicitly name the core principles being proposed rather than referring to them generically.
  2. [Attack analysis] A table summarizing the eleven attacks, the violated invariant, and the relevant systems principle would improve readability and make the evidence easier to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the manuscript's arguments can be clarified. We address each major comment below and have made targeted revisions to strengthen the presentation of the position without altering the core claims.

read point-by-point responses
  1. Referee: [Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.

    Authors: The attack analysis section is intentionally retrospective, using real incidents to show how violations of systems principles enabled the attacks and how adherence to those principles could have prevented them. This serves as grounding for the position rather than an implementation claim. We agree that additional elaboration on enforcement under stochastic outputs would strengthen the discussion of predictable guarantees. In revision, we have added a paragraph to the attack analysis that describes how a reference monitor could mediate all external actions (e.g., tool invocations) by applying static policy checks and capability restrictions, independent of any particular LLM output or its variability across samples. This mechanism would reject or constrain actions that violate invariants even if the model produces inconsistent proposals. revision: yes

  2. Referee: [Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.

    Authors: The referee correctly notes that non-determinism is central to the argument. The principles are designed such that enforcement occurs at the system boundary via mechanisms that do not depend on the internal consistency or determinism of the untrusted model component. For instance, a reference monitor or sandbox enforces invariants by inspecting and controlling observable actions and resource accesses, which remain subject to policy regardless of output variation. We have revised the principles section to include an explicit discussion of this point, explaining that the orthogonality between the mediator and the LLM allows invariants to be maintained even when outputs differ across runs. We also emphasize that realizing full predictability remains an open research challenge, consistent with the manuscript's existing identification of implementation gaps. revision: yes

Circularity Check

0 steps flagged

No circularity; position paper draws on external systems literature

full rationale

The manuscript is a position paper whose central claim—that agent security requires treating the LLM as an untrusted component and enforcing invariants at the system level—is justified by reference to decades of independent systems-security research in OS, networks, and formal methods rather than any internal derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are presented that reduce to quantities defined by the authors themselves; the eleven-attack analysis is retrospective evidence, not a constructed forecast. The argument therefore remains self-contained against external benchmarks and exhibits no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the domain assumption that systems security techniques transfer to stochastic AI agents; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Systems security principles from operating systems, networks, and formal methods provide a foundation for designing agentic systems with predictable guarantees.
    Invoked in the abstract as the basis for the core principles and attack analysis.

pith-pipeline@v0.9.0 · 5727 in / 1146 out tokens · 21845 ms · 2026-05-21T07:39:38.099915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 9 internal anchors

  1. [1]

    AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/

    Amazon Web Services. AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/. Accessed: 2026

  2. [2]

    Agent skills—Claude API docs, 2026

    Anthropic. Agent skills—Claude API docs, 2026. Accessed: 2026-02-05. URL:https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview

  3. [3]

    Next-generation constitutional classifiers: More efficient protection against universal jailbreaks

    Anthropic. Next-generation constitutional classifiers: More efficient protection against universal jailbreaks. https://www.anthropic.com/research/next-generation-constitutional-classifiers, January 2026

  4. [4]

    Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

    Anthropic. Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026

  5. [5]

    Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

    Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, 2018. URL:https://arxiv.org/abs/1802.00420, arXiv: 1802.00420. 9

  6. [6]

    Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

    Nihat Ay and Daniel Polani. Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008

  7. [7]

    Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025

    Luca Beurer-Kellner, Marc Fischer, Hemang Sarkar, Kristian Bonde Nielsen, Marco Milanta, and Aleksei Kudrin- skii. Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025. Accessed: 2026-02-05

  8. [8]

    StruQ: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025. URL: https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe

  9. [9]

    Defending against prompt injection with a few defensive tokens, 2025

    Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner. Defending against prompt injection with a few defensive tokens, 2025. URL:https://arxiv.org/abs/2507.07974,arXiv:2507.07974

  10. [10]

    SecAlign: Defending Against Prompt Injection with Preference Optimization

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2025

  11. [11]

    Meta secalign: A secure foundation llm against prompt injection attacks

    Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URL:https://arxiv.org/abs/2507.02735,arXiv:2507.02735

  12. [12]

    ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

    Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,

  13. [13]

    URL:https://arxiv.org/abs/2503.22738,arXiv:2503.22738

  14. [14]

    How Not to Detect Prompt Injections with an LLM

    Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How Not to Detect Prompt Injections with an LLM. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, pages 218–229, 2025

  15. [15]

    Securing AI Agents with Information-Flow Control

    Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing AI Agents with Information-Flow Control, 2025. URL:https://arxiv.org/abs/2505.23643,arXiv:2505.23643

  16. [16]

    Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, Eleftherios Ioannidis, John Kastner, Anwar Mamat, Darin McAdams, Matt McCutchen, Neha Rungta, Emina Torlak, and Andrew M. Wells. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proc. ACM Program. Lang., 8(OOPSLA1), April ...

  17. [17]

    Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems

    Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In2016 IEEE symposium on security and privacy (SP), pages 598–617. IEEE, 2016

  18. [18]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating Prompt Injections by Design, 2025. URL: https://arxiv.org/abs/2503.18813,arXiv:2503.18813

  19. [19]

    Dorothy E. Denning. A Lattice Model of Secure Information Flow.Commun. ACM, 19(5):236–243, May 1976. doi:10.1145/360051.360056

  20. [20]

    Binder, a Logic-based Security Language

    John DeTreville. Binder, a Logic-based Security Language. InProceedings 2002 IEEE Symposium on Security and Privacy, pages 105–113. IEEE, 2002

  21. [21]

    Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N

    William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones.ACM Trans. Comput. Syst., 32(2), June 2014.doi:10.1145/2619091

  22. [22]

    Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021

  23. [23]

    CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026

    Google. CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026. Accessed: 2026-02-05. 10

  24. [24]

    Identity and Access Management (IAM).https://cloud.google.com/iam/

    Google Cloud. Identity and Access Management (IAM).https://cloud.google.com/iam/. Accessed: 2026

  25. [25]

    Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,

  26. [26]

    URL:https://arxiv.org/abs/2302.12173,arXiv:2302.12173

  27. [27]

    Quantifying causal influences

    Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf. Quantifying causal influences. 2013

  28. [28]

    A critical evaluation of defenses against prompt injection attacks

    Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks. In34th USENIX Security Symposium (USENIX Security 25), 2025

  29. [29]

    InConference on Empirical Methods in Natural Language Processing

    Sam Johnson, Viet Pham, and Thai Le. Manipulating llm web agents with indirect prompt injection attack via html accessibility tree.arXiv preprint arXiv:2507.14799, 2025

  30. [30]

    Optimizing agent planning for security and autonomy

    Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Beguelin. Optimizing agent planning for security and autonomy. InThe Fourteenth International Conference on Learning Representations, 2026. URL:https://openreview.net/forum?id=g0aVCDY3gS

  31. [31]

    ACE: A Security Architecture for LLM-Integrated App Systems

    Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings of the Network and Distributed System Security Symposium (NDSS), 2026

  32. [32]

    In: IEEE Symposium on Security and Privacy (S&P)

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208, 2025.doi:10.1109/SP61157.2025.00250

  33. [33]

    Prp: Propagating universal perturbations to attack large language model guard-rails, 2024

    Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. Prp: Propagating universal perturbations to attack large language model guard-rails, 2024. URL: https://arxiv.org/abs/2402.15911,arXiv:2402.15911

  34. [34]

    ceLLMate: Sandboxing Browser AI Agents,

    Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes. ceLLMate: Sandboxing Browser AI Agents,

  35. [35]

    URL:https://arxiv.org/abs/2512.12594,arXiv:2512.12594

  36. [36]

    Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/

    Microsoft. Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/. Ac- cessed: 2026

  37. [37]

    Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation, 2025. URL:https:// arxiv.org/abs/2510.05156,arXiv:2510.05156

  38. [38]

    Measuring information leakage using generalized gain functions

    S Alvim M’rio, Kostas Chatzikokolakis, Catuscia Palamidessi, and Geoffrey Smith. Measuring information leakage using generalized gain functions. In2012 IEEE 25th Computer Security Foundations Symposium, pages 265–279. IEEE, 2012

  39. [39]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

    Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections, 2025. UR...

  40. [40]

    Automatically hardening web applications using precise tainting

    Anh Nguyen-Tuong, Salvatore Guarnieri, Doug Greene, Jeff Shirley, and David Evans. Automatically hardening web applications using precise tainting. InIFIP International Information Security Conference, pages 295–307. Springer, 2005

  41. [41]

    NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

    NVIDIA. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. https://github.com/NVIDIA/NeMo-Guardrails, 2023. Accessed: 2026-02-03. URL:https://github.com/ NVIDIA/NeMo-Guardrails

  42. [42]

    ClawHub, the skill dock for sharp agents, 2026

    OpenClaw. ClawHub, the skill dock for sharp agents, 2026. Accessed: 2026-02-05. URL:https://clawhub.ai/. 11

  43. [43]

    Formal Policy Enforcement for Real-World Agentic Systems

    Nils Palumbo, Sarthak Choudhary, Jihye Choi, Guy Amir, Prasad Chalasani, and Somesh Jha. Formal policy enforcement for real-world agentic systems, 2026. URL:https://arxiv.org/abs/2602.16708, arXiv:2602.16708

  44. [44]

    Validating mechanistic interpretations: An axiomatic approach

    Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S Pasareanu, and Somesh Jha. Validating mechanistic interpretations: An axiomatic approach. InInternational Conference on Machine Learning, pages 47509–47544. PMLR, 2025

  45. [45]

    Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

    Nishit V . Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks, 2025. URL:https://arxiv.org/ abs/2507.07417,arXiv:2507.07417

  46. [46]

    Defending against injection attacks through context-sensitive string evaluation

    Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection attacks through context-sensitive string evaluation. InInternational Workshop on Recent Advances in Intrusion Detection, pages 124–145. Springer, 2005

  47. [47]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Breaking Instruction Hierarchy in OpenAI’s gpt-4o-mini.https://embracethered. com/blog/posts/2024/chatgpt-gpt-4o-mini-instruction-hierarchie-bypasses/, July 2024. Accessed: 2025-11- 03

  48. [48]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). DeepSeek AI: From prompt injection to account takeover.https://embracethered. com/blog/posts/2024/deepseek-ai-prompt-injection-to-xss-and-account-takeover/, 2024. Accessed on 2025- 09-05

  49. [49]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Google Gemini: Planting Instructions For Delayed Automatic Tool Invocation, feb

  50. [50]

    URL:https://embracethered.com/blog/posts/2024/llm-context-pollution-and-delayed-automated-tool- invocation/

  51. [51]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Informa- tion.https://embracethered.com/blog/posts/2024/m365-copilot-prompt-injection-tool-invocation-and-data- exfil-using-ascii-smuggling/, 2024. Accessed on 2025-09-05

  52. [52]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Spyware Injection Into Your ChatGPT’s Long-Term Memory (SpAIware).https:// embracethered.com/blog/posts/2024/chatgpt-macos-app-persistent-data-exfiltration/, 2024. Accessed on 2025-09-05

  53. [53]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Terminal DiLLMas—Prompt Injection in the Terminal via ANSI Sequences.https:// embracethered.com/blog/posts/2024/terminal-dillmas-prompt-injection-ansi-sequences/, 2024. Accessed on 2025-09-05

  54. [54]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). AI ClickFix: Hijacking Computer-Use Agents Using ClickFix. Blog post, May

  55. [55]

    URL:https://embracethered.com/blog/posts/2025/ai-clickfix-ttp-claude/

  56. [56]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). AMP–Agents that Modify System Configuration and Escape.https://embracethered. com/blog/posts/2025/amp-agents-that-modify-system-configuration-and-escape/, 2025. Accessed on 2025- 09-05

  57. [57]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). ChatGPT Operator prompt injection exploits.https://embracethered.com/blog/ posts/2025/chatgpt-operator-prompt-injection-exploits/, 2025. Accessed on 2025-09-05

  58. [58]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Claude Code: Data Exfiltration with DNS (CVE-2025-55284). Blog post, August

  59. [59]

    URL:https://embracethered.com/blog/posts/2025/claude-code-exfiltration-via-dns-requests/

  60. [60]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Devin AI Kill Chain—Exposing Ports Leading to RCE and file Exfiltration. https://embracethered.com/blog/posts/2025/devin-ai-kill-chain-exposing-ports/, 2025. Accessed on 2025-09- 05

  61. [61]

    Rehberger (wunderwuzzi)

    J. Rehberger (wunderwuzzi). Devin can leak your secrets—Prompt Injection Leads to Exfiltration.https:// embracethered.com/blog/posts/2025/devin-can-leak-your-secrets/, 2025. Accessed on 2025-09-05. 12

  62. [62]

    User- driven access control: Rethinking permission granting in modern operating systems

    Franziska Roesner, Tadayoshi Kohno, Alexander Moshchuk, Bryan Parno, Helen J Wang, and Crispin Cowan. User- driven access control: Rethinking permission granting in modern operating systems. In2012 IEEE Symposium on Security and Privacy, pages 224–238. IEEE, 2012

  63. [63]

    Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

    Andrei Sabelfeld and David Sands. Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009

  64. [64]

    Saltzer and M.D

    J.H. Saltzer and M.D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975.doi:10.1109/PROC.1975.9939

  65. [65]

    Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026

    Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, and Jerry Wei. Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026. URL:https://arxiv.org/abs/2603.29038, arXiv:2603. 29038

  66. [66]

    The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)

    Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86). InProceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, page 552–561, New York, NY , USA, 2007. Association for Computing Machinery.doi:10.1145/1315245.1315313

  67. [67]

    Progent: Securing AI Agents with Privilege Control

    Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable Privilege Control for LLM Agents, 2025. URL:https://arxiv.org/abs/2504.11703, arXiv: 2504.11703

  68. [68]

    AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025

    Marina Simakov. AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025. Accessed: 2025-09-17

  69. [69]

    On the foundations of quantitative information flow

    Geoffrey Smith. On the foundations of quantitative information flow. InInternational Conference on Foundations of Software Science and Computational Structures, pages 288–302. Springer, 2009

  70. [70]

    Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks

    Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, and Alina Oprea. Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks. arXiv preprint arXiv:2602.09222, 2026

  71. [71]

    SAGA: A Security Architecture for Governing AI Agentic Systems

    Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, and Alina Oprea. SAGA: A Security Architecture for Governing AI Agentic Systems. InNetwork and Distributed System Security Symposium (NDSS), 2026

  72. [72]

    Lee, and G

    Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S. Lee, and G. Edward Suh. Information flow control in machine learning through modular model architecture. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC’24, USA, 2024. USENIX Association. URL:https://www....

  73. [73]

    AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy

    Udbhav Tiwari and Meredith Whittaker. AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy. 39th Chaos Communication Congress, Congress Center Hamburg, Hamburg, Germany. URL:https://media.ccc.de/ v/39c3-ai-agent-ai-spy

  74. [74]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, 2024. URL:https://arxiv.org/abs/2404.13208, arXiv:2404.13208

  75. [75]

    Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

    Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red-teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

  76. [76]

    Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025. URL:https://arxiv.org/abs/2410.09102,arXiv:2410.09102

  77. [77]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL:https://arxiv.org/abs/2307.15043, arXiv:2307.15043. 13

  78. [78]

    Terminal DiLLMa

    Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL:https://iclr.cc/virtual/2024/23872. 14 A Attacks on Agentic Systems: Additional Case Studies DeepSeek AI Acc...