Agent Security is a Systems Problem
Pith reviewed 2026-05-21 07:39 UTC · model grok-4.3
The pith
Agent security must treat the AI model as an untrusted component and enforce invariants at the system level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Efforts to increase model robustness are insufficient on their own. We must complement existing efforts with techniques from the systems security domain. Based on experience in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles that provide a foundation for designing agentic systems with predictable guarantees. Analysis of eleven representative real-world attacks shows how these principles could have prevented,
What carries the argument
Treating the AI model powering the agent as an untrusted component while enforcing security invariants at the overall system level using principles from operating systems, networks, and formal methods.
If this is right
- The eleven analyzed attacks become preventable once systems principles such as isolation and invariant checking are applied.
- Agent designs gain more reliable security properties by borrowing mechanisms proven in operating systems and networks.
- Security research for agents shifts from model-only fixes toward layered system architectures.
- Implementation requires solving identified research challenges around adapting traditional principles to open-ended agent behavior.
Where Pith is reading between the lines
- Agent platforms may need dedicated security kernels that sit outside the model to mediate all actions.
- Developers could adopt mandatory system-level audits before deploying agents that control real-world resources.
- The same systems lens may apply to other interactive AI tools that execute actions on behalf of users.
Load-bearing premise
Established systems security principles can be transferred to agentic systems to deliver predictable guarantees despite the stochastic and open-ended behavior of large language models.
What would settle it
A working agent system that applies full systems-level isolation, invariant enforcement, and related principles yet still experiences a successful attack would disprove the central claim.
Figures
read the original abstract
We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that agent security must be treated as a systems problem: the underlying AI model should be viewed as an untrusted component, with security invariants enforced at the system level rather than through model robustness alone. Drawing on experience in operating systems, networks, formal methods, and adversarial ML, the authors articulate core principles from decades of systems security research, analyze eleven real-world attacks to illustrate how these principles could have prevented them, and identify open research challenges for realizing the approach in agentic systems.
Significance. If the central claim holds, the work provides a timely reframing that could guide more robust designs for LLM-based agents by integrating established systems techniques. The retrospective analysis of eleven attacks supplies concrete grounding and useful examples. However, the absence of a concrete mechanism demonstrating invariant enforcement across stochastic LLM outputs limits the strength of the 'predictable guarantees' assertion.
major comments (2)
- [Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.
- [Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.
minor comments (2)
- [Abstract] The abstract could more explicitly name the core principles being proposed rather than referring to them generically.
- [Attack analysis] A table summarizing the eleven attacks, the violated invariant, and the relevant systems principle would improve readability and make the evidence easier to evaluate.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where the manuscript's arguments can be clarified. We address each major comment below and have made targeted revisions to strengthen the presentation of the position without altering the core claims.
read point-by-point responses
-
Referee: [Attack analysis] Attack analysis section: The discussion shows that the eleven attacks succeeded but provides only retrospective commentary; it does not exhibit a specific systems mechanism (e.g., reference monitor or sandbox) that would still enforce the claimed invariants when the LLM produces differing outputs for the same input across multiple samples.
Authors: The attack analysis section is intentionally retrospective, using real incidents to show how violations of systems principles enabled the attacks and how adherence to those principles could have prevented them. This serves as grounding for the position rather than an implementation claim. We agree that additional elaboration on enforcement under stochastic outputs would strengthen the discussion of predictable guarantees. In revision, we have added a paragraph to the attack analysis that describes how a reference monitor could mediate all external actions (e.g., tool invocations) by applying static policy checks and capability restrictions, independent of any particular LLM output or its variability across samples. This mechanism would reject or constrain actions that violate invariants even if the model produces inconsistent proposals. revision: yes
-
Referee: [Core principles] Principles articulation: The claim that systems security techniques yield predictable guarantees rests on an unverified extrapolation from deterministic domains; the manuscript does not address how non-deterministic LLM behavior affects enforcement of invariants, which is load-bearing for the central position.
Authors: The referee correctly notes that non-determinism is central to the argument. The principles are designed such that enforcement occurs at the system boundary via mechanisms that do not depend on the internal consistency or determinism of the untrusted model component. For instance, a reference monitor or sandbox enforces invariants by inspecting and controlling observable actions and resource accesses, which remain subject to policy regardless of output variation. We have revised the principles section to include an explicit discussion of this point, explaining that the orthogonality between the mediator and the LLM allows invariants to be maintained even when outputs differ across runs. We also emphasize that realizing full predictability remains an open research challenge, consistent with the manuscript's existing identification of implementation gaps. revision: yes
Circularity Check
No circularity; position paper draws on external systems literature
full rationale
The manuscript is a position paper whose central claim—that agent security requires treating the LLM as an untrusted component and enforcing invariants at the system level—is justified by reference to decades of independent systems-security research in OS, networks, and formal methods rather than any internal derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are presented that reduce to quantities defined by the authors themselves; the eleven-attack analysis is retrospective evidence, not a constructed forecast. The argument therefore remains self-contained against external benchmarks and exhibits no load-bearing self-citation chains or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Systems security principles from operating systems, networks, and formal methods provide a foundation for designing agentic systems with predictable guarantees.
Reference graph
Works this paper leans on
-
[1]
AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/
Amazon Web Services. AWS Identity and Access Management (IAM).https://aws.amazon.com/iam/. Accessed: 2026
work page 2026
-
[2]
Agent skills—Claude API docs, 2026
Anthropic. Agent skills—Claude API docs, 2026. Accessed: 2026-02-05. URL:https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview
work page 2026
-
[3]
Next-generation constitutional classifiers: More efficient protection against universal jailbreaks
Anthropic. Next-generation constitutional classifiers: More efficient protection against universal jailbreaks. https://www.anthropic.com/research/next-generation-constitutional-classifiers, January 2026
work page 2026
-
[4]
Anthropic. Poisoning fine-tuning datasets of constitutional classifiers.https://alignment.anthropic.com/2026/ backdooring-classifiers/, April 2026
work page 2026
-
[5]
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, 2018. URL:https://arxiv.org/abs/1802.00420, arXiv: 1802.00420. 9
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008
Nihat Ay and Daniel Polani. Information flows in causal networks.Advances in complex systems, 11(01):17–41, 2008
work page 2008
-
[7]
Luca Beurer-Kellner, Marc Fischer, Hemang Sarkar, Kristian Bonde Nielsen, Marco Milanta, and Aleksei Kudrin- skii. Introducing Guardrails: The Contextual Security Layer for the Agentic Era.https://invariantlabs.ai/blog/ guardrails, April 2025. Accessed: 2026-02-05
work page 2025
-
[8]
StruQ: Defending against prompt injection with structured queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025. URL: https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe
work page 2025
-
[9]
Defending against prompt injection with a few defensive tokens, 2025
Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David Wagner. Defending against prompt injection with a few defensive tokens, 2025. URL:https://arxiv.org/abs/2507.07974,arXiv:2507.07974
-
[10]
SecAlign: Defending Against Prompt Injection with Preference Optimization
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the ACM Conference on Computer and Communications Security (CCS), 2025
work page 2025
-
[11]
Meta secalign: A secure foundation llm against prompt injection attacks
Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URL:https://arxiv.org/abs/2507.02735,arXiv:2507.02735
-
[12]
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,
Zhaorun Chen, Mintong Kang, and Bo Li. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning,
- [13]
-
[14]
How Not to Detect Prompt Injections with an LLM
Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How Not to Detect Prompt Injections with an LLM. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security, pages 218–229, 2025
work page 2025
-
[15]
Securing AI Agents with Information-Flow Control
Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing AI Agents with Information-Flow Control, 2025. URL:https://arxiv.org/abs/2505.23643,arXiv:2505.23643
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, Eleftherios Ioannidis, John Kastner, Anwar Mamat, Darin McAdams, Matt McCutchen, Neha Rungta, Emina Torlak, and Andrew M. Wells. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proc. ACM Program. Lang., 8(OOPSLA1), April ...
-
[17]
Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In2016 IEEE symposium on security and privacy (SP), pages 598–617. IEEE, 2016
work page 2016
-
[18]
Defeating Prompt Injections by Design
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating Prompt Injections by Design, 2025. URL: https://arxiv.org/abs/2503.18813,arXiv:2503.18813
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Dorothy E. Denning. A Lattice Model of Secure Information Flow.Commun. ACM, 19(5):236–243, May 1976. doi:10.1145/360051.360056
-
[20]
Binder, a Logic-based Security Language
John DeTreville. Binder, a Logic-based Security Language. InProceedings 2002 IEEE Symposium on Security and Privacy, pages 105–113. IEEE, 2002
work page 2002
-
[21]
Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N
William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P. Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones.ACM Trans. Comput. Syst., 32(2), June 2014.doi:10.1145/2619091
-
[22]
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in neural information processing systems, 34:9574–9586, 2021
work page 2021
-
[23]
CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026
Google. CAPSEM: Contextual Agent Privacy and Security Manager.https://capsem.org/, 2026. Accessed: 2026-02-05. 10
work page 2026
-
[24]
Identity and Access Management (IAM).https://cloud.google.com/iam/
Google Cloud. Identity and Access Management (IAM).https://cloud.google.com/iam/. Accessed: 2026
work page 2026
-
[25]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,
-
[26]
URL:https://arxiv.org/abs/2302.12173,arXiv:2302.12173
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf. Quantifying causal influences. 2013
work page 2013
-
[28]
A critical evaluation of defenses against prompt injection attacks
Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks. In34th USENIX Security Symposium (USENIX Security 25), 2025
work page 2025
-
[29]
InConference on Empirical Methods in Natural Language Processing
Sam Johnson, Viet Pham, and Thai Le. Manipulating llm web agents with indirect prompt injection attack via html accessibility tree.arXiv preprint arXiv:2507.14799, 2025
-
[30]
Optimizing agent planning for security and autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Beguelin. Optimizing agent planning for security and autonomy. InThe Fourteenth International Conference on Learning Representations, 2026. URL:https://openreview.net/forum?id=g0aVCDY3gS
work page 2026
-
[31]
ACE: A Security Architecture for LLM-Integrated App Systems
Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings of the Network and Distributed System Security Symposium (NDSS), 2026
work page 2026
-
[32]
In: IEEE Symposium on Security and Privacy (S&P)
Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. In2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208, 2025.doi:10.1109/SP61157.2025.00250
-
[33]
Prp: Propagating universal perturbations to attack large language model guard-rails, 2024
Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. Prp: Propagating universal perturbations to attack large language model guard-rails, 2024. URL: https://arxiv.org/abs/2402.15911,arXiv:2402.15911
-
[34]
ceLLMate: Sandboxing Browser AI Agents,
Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes. ceLLMate: Sandboxing Browser AI Agents,
- [35]
-
[36]
Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/
Microsoft. Azure Policy Documentation.https://learn.microsoft.com/en-us/azure/governance/policy/. Ac- cessed: 2026
work page 2026
- [37]
-
[38]
Measuring information leakage using generalized gain functions
S Alvim M’rio, Kostas Chatzikokolakis, Catuscia Palamidessi, and Geoffrey Smith. Measuring information leakage using generalized gain functions. In2012 IEEE 25th Computer Security Foundations Symposium, pages 265–279. IEEE, 2012
work page 2012
-
[39]
Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections, 2025. UR...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Automatically hardening web applications using precise tainting
Anh Nguyen-Tuong, Salvatore Guarnieri, Doug Greene, Jeff Shirley, and David Evans. Automatically hardening web applications using precise tainting. InIFIP International Information Security Conference, pages 295–307. Springer, 2005
work page 2005
-
[41]
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
NVIDIA. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. https://github.com/NVIDIA/NeMo-Guardrails, 2023. Accessed: 2026-02-03. URL:https://github.com/ NVIDIA/NeMo-Guardrails
work page 2023
-
[42]
ClawHub, the skill dock for sharp agents, 2026
OpenClaw. ClawHub, the skill dock for sharp agents, 2026. Accessed: 2026-02-05. URL:https://clawhub.ai/. 11
work page 2026
-
[43]
Formal Policy Enforcement for Real-World Agentic Systems
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Guy Amir, Prasad Chalasani, and Somesh Jha. Formal policy enforcement for real-world agentic systems, 2026. URL:https://arxiv.org/abs/2602.16708, arXiv:2602.16708
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Validating mechanistic interpretations: An axiomatic approach
Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S Pasareanu, and Somesh Jha. Validating mechanistic interpretations: An axiomatic approach. InInternational Conference on Machine Learning, pages 47509–47544. PMLR, 2025
work page 2025
-
[45]
Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes
Nishit V . Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks, 2025. URL:https://arxiv.org/ abs/2507.07417,arXiv:2507.07417
-
[46]
Defending against injection attacks through context-sensitive string evaluation
Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection attacks through context-sensitive string evaluation. InInternational Workshop on Recent Advances in Intrusion Detection, pages 124–145. Springer, 2005
work page 2005
-
[47]
J. Rehberger (wunderwuzzi). Breaking Instruction Hierarchy in OpenAI’s gpt-4o-mini.https://embracethered. com/blog/posts/2024/chatgpt-gpt-4o-mini-instruction-hierarchie-bypasses/, July 2024. Accessed: 2025-11- 03
work page 2024
-
[48]
J. Rehberger (wunderwuzzi). DeepSeek AI: From prompt injection to account takeover.https://embracethered. com/blog/posts/2024/deepseek-ai-prompt-injection-to-xss-and-account-takeover/, 2024. Accessed on 2025- 09-05
work page 2024
-
[49]
J. Rehberger (wunderwuzzi). Google Gemini: Planting Instructions For Delayed Automatic Tool Invocation, feb
-
[50]
URL:https://embracethered.com/blog/posts/2024/llm-context-pollution-and-delayed-automated-tool- invocation/
work page 2024
-
[51]
J. Rehberger (wunderwuzzi). Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Informa- tion.https://embracethered.com/blog/posts/2024/m365-copilot-prompt-injection-tool-invocation-and-data- exfil-using-ascii-smuggling/, 2024. Accessed on 2025-09-05
work page 2024
-
[52]
J. Rehberger (wunderwuzzi). Spyware Injection Into Your ChatGPT’s Long-Term Memory (SpAIware).https:// embracethered.com/blog/posts/2024/chatgpt-macos-app-persistent-data-exfiltration/, 2024. Accessed on 2025-09-05
work page 2024
-
[53]
J. Rehberger (wunderwuzzi). Terminal DiLLMas—Prompt Injection in the Terminal via ANSI Sequences.https:// embracethered.com/blog/posts/2024/terminal-dillmas-prompt-injection-ansi-sequences/, 2024. Accessed on 2025-09-05
work page 2024
-
[54]
J. Rehberger (wunderwuzzi). AI ClickFix: Hijacking Computer-Use Agents Using ClickFix. Blog post, May
-
[55]
URL:https://embracethered.com/blog/posts/2025/ai-clickfix-ttp-claude/
work page 2025
-
[56]
J. Rehberger (wunderwuzzi). AMP–Agents that Modify System Configuration and Escape.https://embracethered. com/blog/posts/2025/amp-agents-that-modify-system-configuration-and-escape/, 2025. Accessed on 2025- 09-05
work page 2025
-
[57]
J. Rehberger (wunderwuzzi). ChatGPT Operator prompt injection exploits.https://embracethered.com/blog/ posts/2025/chatgpt-operator-prompt-injection-exploits/, 2025. Accessed on 2025-09-05
work page 2025
-
[58]
J. Rehberger (wunderwuzzi). Claude Code: Data Exfiltration with DNS (CVE-2025-55284). Blog post, August
work page 2025
-
[59]
URL:https://embracethered.com/blog/posts/2025/claude-code-exfiltration-via-dns-requests/
work page 2025
-
[60]
J. Rehberger (wunderwuzzi). Devin AI Kill Chain—Exposing Ports Leading to RCE and file Exfiltration. https://embracethered.com/blog/posts/2025/devin-ai-kill-chain-exposing-ports/, 2025. Accessed on 2025-09- 05
work page 2025
-
[61]
J. Rehberger (wunderwuzzi). Devin can leak your secrets—Prompt Injection Leads to Exfiltration.https:// embracethered.com/blog/posts/2025/devin-can-leak-your-secrets/, 2025. Accessed on 2025-09-05. 12
work page 2025
-
[62]
User- driven access control: Rethinking permission granting in modern operating systems
Franziska Roesner, Tadayoshi Kohno, Alexander Moshchuk, Bryan Parno, Helen J Wang, and Crispin Cowan. User- driven access control: Rethinking permission granting in modern operating systems. In2012 IEEE Symposium on Security and Privacy, pages 224–238. IEEE, 2012
work page 2012
-
[63]
Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009
Andrei Sabelfeld and David Sands. Declassification: Dimensions and principles.Journal of Computer Security, 17(5):517–548, 2009
work page 2009
-
[64]
J.H. Saltzer and M.D. Schroeder. The Protection of Information in Computer Systems.Proceedings of the IEEE, 63(9):1278–1308, 1975.doi:10.1109/PROC.1975.9939
-
[65]
Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, and Jerry Wei. Trojan-speak: Bypassing constitutional classifiers with no jailbreak tax via adversarial finetuning, 2026. URL:https://arxiv.org/abs/2603.29038, arXiv:2603. 29038
-
[66]
The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)
Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86). InProceedings of the 14th ACM Conference on Computer and Communications Security, CCS ’07, page 552–561, New York, NY , USA, 2007. Association for Computing Machinery.doi:10.1145/1315245.1315313
-
[67]
Progent: Securing AI Agents with Privilege Control
Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable Privilege Control for LLM Agents, 2025. URL:https://arxiv.org/abs/2504.11703, arXiv: 2504.11703
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Marina Simakov. AgentFlayer: When a Jira Ticket Can Steal Your Secrets.https://labs.zenity.io/p/when-a-jira- ticket-can-steal-your-secrets, August 2025. Accessed: 2025-09-17
work page 2025
-
[69]
On the foundations of quantitative information flow
Geoffrey Smith. On the foundations of quantitative information flow. InInternational Conference on Foundations of Software Science and Computational Structures, pages 288–302. Springer, 2009
work page 2009
-
[70]
Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks
Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, and Alina Oprea. Muzzle: Adaptive agentic red-teaming of web agents against indirect prompt injection attacks. arXiv preprint arXiv:2602.09222, 2026
-
[71]
SAGA: A Security Architecture for Governing AI Agentic Systems
Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, and Alina Oprea. SAGA: A Security Architecture for Governing AI Agentic Systems. InNetwork and Distributed System Security Symposium (NDSS), 2026
work page 2026
-
[72]
Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S. Lee, and G. Edward Suh. Information flow control in machine learning through modular model architecture. InProceedings of the 33rd USENIX Conference on Security Symposium, SEC’24, USA, 2024. USENIX Association. URL:https://www....
work page 2024
-
[73]
AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy
Udbhav Tiwari and Meredith Whittaker. AI Agent, AI Spy.https://media.ccc.de/v/39c3-ai-agent-ai-spy. 39th Chaos Communication Congress, Congress Center Hamburg, Hamburg, Germany. URL:https://media.ccc.de/ v/39c3-ai-agent-ai-spy
-
[74]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, 2024. URL:https://arxiv.org/abs/2404.13208, arXiv:2404.13208
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents
Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red-teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025
-
[76]
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025
Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy, 2025. URL:https://arxiv.org/abs/2410.09102,arXiv:2410.09102
-
[77]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL:https://arxiv.org/abs/2307.15043, arXiv:2307.15043. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL:https://iclr.cc/virtual/2024/23872. 14 A Attacks on Agentic Systems: Additional Case Studies DeepSeek AI Acc...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.