arxiv: 2505.23643 · v2 · submitted 2025-05-29 · 💻 cs.CR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Securing AI Agents with Information-Flow Control

Manuel Costa , Boris K\"opf , Aashish Kolluri , Andrew Paverd , Mark Russinovich , Ahmed Salem , Shruti Tople , Lukas Wutschitz

show 1 more author

Santiago Zanella-B\'eguelin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords information-flow controlAI agentsprompt injectiondynamic taint-trackingsecurity policiesagent plannersconfidentialityintegrity

0 comments

The pith

Fides applies information-flow control to AI agent planners to enforce security policies against prompt injection while preserving task utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a formal model of agent planners to reason about security and expressiveness under information-flow control. It characterizes the properties enforceable by dynamic taint-tracking and builds a taxonomy of tasks to evaluate security-utility trade-offs. From this foundation the authors design Fides, a planner that tracks confidentiality and integrity labels, enforces policies deterministically, and adds primitives for selective information hiding. Evaluation in AgentDojo shows Fides completes a broad range of tasks with security guarantees. A sympathetic reader cares because autonomous AI agents face prompt-injection risks and IFC offers a structured way to mitigate them without eliminating most functionality.

Core claim

Using a formal model of agent planners, the paper characterizes the class of properties enforceable by dynamic taint-tracking and constructs a taxonomy of tasks. Informed by this analysis, Fides is presented as a planner that tracks confidentiality and integrity labels to deterministically enforce security policies and introduces novel primitives for selectively hiding information, enabling completion of a broad range of tasks with security guarantees as demonstrated in AgentDojo.

What carries the argument

Fides, an agent planner that performs dynamic taint-tracking on confidentiality and integrity labels, deterministically enforces security policies, and supplies primitives for selective information hiding.

If this is right

Dynamic taint-tracking suffices to enforce a defined class of security properties in agent planners.
Deterministic policy enforcement is achievable without introducing nondeterminism into planner behavior.
Selective hiding primitives allow agents to suppress sensitive outputs while still completing useful work.
Security guarantees can coexist with high task-completion rates in benchmark environments such as AgentDojo.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same label-tracking discipline could be applied directly to language-model outputs or tool-calling interfaces beyond the planner layer.
The task taxonomy offers a reusable yardstick for comparing IFC planners against other security techniques such as sandboxing or input sanitization.
Extending the model to multi-agent or adaptive planners would reveal whether the same enforceable-property class still holds.

Load-bearing premise

The formal model of agent planners and the taxonomy of tasks accurately capture real-world security and utility trade-offs.

What would settle it

A concrete prompt-injection attack that succeeds on a Fides-protected agent in AgentDojo on a task the taxonomy classifies as securable, or a safe task that Fides prevents from completing.

read the original abstract

As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach enables us to complete a broad range of tasks with security guarantees. A tutorial to walk readers through the the concepts introduced in the paper can be found at https://github.com/microsoft/fides

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fides gives a formal model plus new hiding primitives for IFC in agents, but the taint-tracking guarantees look fragile once LLMs start paraphrasing or inferring.

read the letter

The paper's real contribution is the formal model that pins down exactly which security properties dynamic taint-tracking can enforce on agent planners, plus the task taxonomy and the Fides design that adds selective hiding. That combination is new and gives a cleaner way to talk about the security-utility trade-off than the usual prompt-injection patches. The AgentDojo results show the planner can still finish a range of tasks while respecting the labels, which is concrete evidence that the approach is not purely theoretical. Credit for shipping the tutorial and code link too; that makes the ideas easier to test. The soft spot is the one the stress-test note raises. Real LLM planners routinely rephrase, summarize, or draw inferences from tainted inputs, and standard taint rules do not automatically cover those operations. If the model treats the planner as a black-box that only copies or drops labels, the reported success rates do not yet prove the same guarantees survive realistic language-model behavior. The abstract gives no quantitative breakdown of how often inference causes under-tainting or over-tainting, so the central claim rests on an assumption that may not hold at scale. This is worth a serious referee because the model and taxonomy are grounded enough to be checked and extended, even if the current evaluation needs more stress on actual LLM inference patterns. People building or auditing autonomous agents would get value from the formal framing; the rest of the field can treat it as a starting point rather than a finished solution.

Referee Report

2 major / 1 minor

Summary. The paper presents a formal model for analyzing the security and expressiveness of AI agent planners under information-flow control. It characterizes the class of properties enforceable via dynamic taint-tracking, introduces a taxonomy of tasks to assess security-utility trade-offs, and proposes Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces policies, and adds novel primitives for selective information hiding. Evaluation on the AgentDojo benchmark is claimed to show that Fides completes a broad range of tasks while providing security guarantees against vulnerabilities such as prompt injection.

Significance. If the formal model accurately captures LLM planner behavior and the evaluation results hold under realistic conditions, the work could provide a principled foundation for securing autonomous AI agents without fully sacrificing utility. The selective hiding primitives and task taxonomy represent concrete contributions to IFC for agents. However, the absence of detailed quantitative metrics or proof sketches in the provided description limits assessment of whether the security guarantees are practically enforceable.

major comments (2)

[Formal model] Formal model section: The characterization of enforceable properties assumes perfect label propagation semantics for the planner treated as a black-box function. This is load-bearing for the central security claims, yet the model does not address how LLM-specific operations such as paraphrasing, summarization, or inference from tainted inputs affect taint labels, which can produce under-tainting and violate the claimed guarantees.
[Evaluation] Evaluation section: The AgentDojo results are referenced only at a high level with no reported quantitative metrics (e.g., task success rates, security violation counts, or comparisons to baselines). Without these data, the claim that Fides completes a broad range of tasks with security guarantees cannot be verified and undermines the utility-security trade-off analysis.

minor comments (1)

[Abstract] The GitHub tutorial link is mentioned but its stability and coverage of the formal model and Fides primitives should be confirmed in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will make targeted revisions to improve clarity and completeness.

read point-by-point responses

Referee: Formal model section: The characterization of enforceable properties assumes perfect label propagation semantics for the planner treated as a black-box function. This is load-bearing for the central security claims, yet the model does not address how LLM-specific operations such as paraphrasing, summarization, or inference from tainted inputs affect taint labels, which can produce under-tainting and violate the claimed guarantees.

Authors: We appreciate this observation. The formal model intentionally abstracts the planner as a black-box to characterize the class of properties enforceable by dynamic taint-tracking in general. This focuses on theoretical expressiveness rather than implementation details. We agree that LLM behaviors such as paraphrasing can cause under-tainting in practice. Fides mitigates this via its novel selective hiding primitives and deterministic enforcement at the system level. We will revise the manuscript to add explicit discussion of the abstraction's limitations and how Fides addresses practical LLM-specific risks. revision: partial
Referee: Evaluation section: The AgentDojo results are referenced only at a high level with no reported quantitative metrics (e.g., task success rates, security violation counts, or comparisons to baselines). Without these data, the claim that Fides completes a broad range of tasks with security guarantees cannot be verified and undermines the utility-security trade-off analysis.

Authors: We agree that the evaluation section would benefit from more detail. The manuscript references AgentDojo results to support broad task completion with security guarantees, but we will expand it with quantitative metrics including task success rates, security violation counts, and baseline comparisons. This revision will make the utility-security trade-off analysis verifiable and strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity in formal model, taxonomy, or Fides evaluation

full rationale

The paper introduces an independent formal model of agent planners, derives the class of properties enforceable by dynamic taint-tracking from that model, builds a task taxonomy to explore trade-offs, and evaluates the resulting Fides planner against the external AgentDojo benchmark. No load-bearing step reduces by definition, fitted-parameter renaming, or self-citation chain to its own inputs; the security guarantees are grounded in the new planner primitives and benchmark results rather than self-referential constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The formal model of planners is presumed to rest on standard assumptions about agent execution and label propagation.

pith-pipeline@v0.9.0 · 5475 in / 874 out tokens · 35507 ms · 2026-05-15T11:50:59.663951+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
cs.CR 2026-05 unverdicted novelty 8.0

Oracle Poisoning corrupts knowledge graphs used by AI agents via tool calls, leading tested models to accept fabricated claims at 100% under directed queries in a production-scale demonstration.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents
cs.AI 2026-04 unverdicted novelty 8.0

MCPHunt benchmark finds 11.5-41.3% policy-violating credential propagation in multi-server MCP agents across five models, reducible up to 97% by prompt mitigations while retaining most utility.
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
cs.CR 2026-04 unverdicted novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
cs.CR 2026-04 unverdicted novelty 8.0

TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
cs.CR 2026-04 unverdicted novelty 7.0

A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
cs.CR 2026-04 unverdicted novelty 7.0

Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
cs.CR 2026-04 unverdicted novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Formal Policy Enforcement for Real-World Agentic Systems
cs.CR 2026-02 unverdicted novelty 7.0

FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
cs.CR 2026-05 conditional novelty 6.0

MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
cs.CR 2026-05 conditional novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
Alignment Contracts for Agentic Security Systems
cs.CR 2026-04 conditional novelty 6.0 full

Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 5.0

AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
cs.SE 2026-04 unverdicted novelty 5.0

Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 4.0

AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 20 Pith papers

[1]

Get my drift? catching llm task drift with activation deltas

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, and Mario Fritz. Get my drift? catching llm task drift with activation deltas. In IEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2025. IEEE, 2025

work page 2025
[2]

Guidance: A guidance language for controlling large language models

Guidance AI. Guidance: A guidance language for controlling large language models. https://github.com/ guidance-ai/guidance, 2025. Accessed: 2025-04-12

work page 2025
[3]

Computer Use (beta)

Anthropic. Computer Use (beta). https://docs.anthropic.com/en/docs/agents-and-tools/ computer-use, 2024

work page 2024
[4]

Ahsan Ayub and Subhabrata Majumdar

Md. Ahsan Ayub and Subhabrata Majumdar. Embedding-based classifiers can detect prompt injection attacks. In Conference on Applied Machine Learning in Information Security (CAMLIS 2024), volume 3920 of CEUR Workshop Proceedings, pages 257–268. CEUR-WS.org, 2024

work page 2024
[5]

AI agents with formal security guarantees

Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. AI agents with formal security guarantees. In ICML 2024 Next Generation of AI Safety Workshop, 2024

work page 2024
[6]

Guiding llms the right way: Fast, non-invasive constrained generation, 2024

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Guiding llms the right way: Fast, non-invasive constrained generation, 2024

work page 2024
[7]

LangChain, 2022

Harrison Chase. LangChain, 2022

work page 2022
[8]

StruQ: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In 34th USENIX Security Symposium (USENIX Security ’25), 2025. To appear. 19 Securing AI Agents with Information-Flow Control

work page 2025
[9]

Secalign: Defending against prompt injection with preference optimization, 2025

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization, 2025

work page 2025
[10]

Defeating prompt injections by design, 2025

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tram`er. Defeating prompt injections by design, 2025

work page 2025
[11]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi ´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram `er. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024

work page 2024
[12]

A lattice model of secure information flow

Dorothy E Denning. A lattice model of secure information flow. Communications of the ACM, 19(5):236–243, 1976

work page 1976
[13]

Magentic-One: A generalist multi-agent system for solving complex tasks, 2024

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-One: A generalist multi-agent system for solving complex tasks, 2024

work page 2024
[14]

Grammar-constrained decoding for structured nlp tasks without finetuning

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding for structured nlp tasks without finetuning. In 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 10932–10952, 2023

work page 2023
[15]

Security policies and security models

Joseph A Goguen and Jos´e Meseguer. Security policies and security models. In 1982 IEEE Symposium on Security and Privacy, pages 11–11. IEEE, 1982

work page 1982
[16]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, 2023

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, 2023

work page 2023
[17]

Defending against indirect prompt injection attacks with spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlighting. In Conference on Applied Machine Learning in Information Security (CAMLIS 2024) , volume 3920 of CEUR Workshop Proceedings, pages 48–62. CEUR- WS.org, 2024

work page 2024
[18]

The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents, 2024

Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents, 2024

work page 2024
[19]

Devin: The first AI software engineer

Cognition Labs. Devin: The first AI software engineer. https://www.cognition-labs.com/, 2024

work page 2024
[20]

Prompt injection attack against LLM-integrated applications, 2023

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications, 2023

work page 2023
[21]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24) , pages 1831–1847. USENIX Association, 2024

work page 2024
[22]

Myers and Barbara Liskov

Andrew C. Myers and Barbara Liskov. A decentralized model for information flow control. In 16th ACM Symposium on Operating Systems Principles, SOSP ’97, page 129–142. ACM, 1997

work page 1997
[23]

Enforcing robust declassification

Andrew C Myers, Andrei Sabelfeld, and Steve Zdancewic. Enforcing robust declassification. In 17th IEEE Computer Security Foundations Workshop, CSF’2004, pages 172–186. IEEE, 2004

work page 2004
[24]

Openai agents sdk

OpenAI. Openai agents sdk. https://openai.github.io/openai-agents-python/, 2024

work page 2024
[25]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[26]

Language-based information-flow security

Andrei Sabelfeld and Andrew C Myers. Language-based information-flow security. IEEE J. on Selected Areas in Communications, 21(1):5–19, 2003

work page 2003
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36, pages 68539–68551. Curran Associates, Inc., 2023

work page 2023
[28]

Pierce, and Andrei Sabelfeld

Daniel Schoepe, Musard Balliu, Benjamin C. Pierce, and Andrei Sabelfeld. Explicit secrecy: A policy for taint tracking. In 2016 IEEE European Symposium on Security and Privacy, pages 15–30. IEEE, 2016

work page 2016
[29]

Permissive information-flow analysis for large language models, 2024

Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris K¨opf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, and Santiago Zanella-B´eguelin. Permissive information-flow analysis for large language models, 2024

work page 2024
[30]

Safety versus secrecy

Dennis V olpano. Safety versus secrecy. InStatic Analysis (SAS 1999), volume 1694 of Lecture Notes in Computer Science, pages 303–311. Springer, 1999

work page 1999
[31]

The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024

work page 2024
[32]

The dual LLM pattern for building ai assistants that can resist prompt injection

Simon Willison. The dual LLM pattern for building ai assistants that can resist prompt injection. Online: https://simonwillison.net/2023/Apr/25/dual-llm-pattern, April 2023

work page 2023
[33]

System-level defense against indirect prompt injection attacks: An information flow control perspective, 2024

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective, 2024

work page 2024
[34]

AutoGen: Enabling next-gen LLM applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, COLM 2024, 2024

work page 2024
[35]

Instructional segment embedding: Improving LLM safety with instruction hierarchy

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional segment embedding: Improving LLM safety with instruction hierarchy. In 13th International Conference on Learning Representations, 2025

work page 2025
[36]

GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis

Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis. In 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 507–518. ACL, 2024

work page 2024
[37]

React: Syner- gizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Syner- gizing reasoning and acting in language models. In 11th International Conference on Learning Representations, 2023

work page 2023
[38]

Benchmarking and defending against indirect prompt injection attacks on large language models, 2023

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models, 2023

work page 2023
[39]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. ACL, 2024

work page 2024
[40]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. ICLR, 2025

work page 2025
[41]

Titzer, Heather Miller, and Phillip B

Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. Rtbas: Defending llm agents against prompt injection and privacy leakage, 2025. 21 Securing AI Agents with Information-Flow Control

work page 2025
[42]

Improving alignment and robustness with circuit breakers

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. In Advances in Neural Information Processing Systems, volume 37, pages 83345–83373. Curran Associates, Inc., 2024. A Defining Explicit Secrecy We define a...

work page 2024
[46]

- **Action:** Refer directly to the returned variable identifiers in your final reply

**Field inside a dict within a list of dicts**: - Format: `#tool_name-result-tool_name_count-index.key#` ## USAGE GUIDELINES FOR VARIABLE IDENTIFIERS 26 Securing AI Agents with Information-Flow Control ### Use Case 1: Directly Showing Information to the User - **Scenario:** You simply need to present the responses directly to the user. - **Action:** Refer...

work page
[47]

**Basic types** (`str`, `bool`, `int`, `float`): - Format: `#tool_name-result-tool_name_count#`

work page
[48]

**Field inside a dict**: - Format: `#tool_name-result-tool_name_count.key#`

work page
[49]

**Basic type inside a list of dicts**: - Format: `#tool_name-result-tool_name_count-index#`

work page
[50]

Summarize the content in #v1# for the user

**Field inside a dict within a list of dicts**: - Format: `#tool_name-result-tool_name_count-index.key#` ## USING THE QUARANTINED LLM Let's say you have a list of variables in your context `[#v1#, #v2#, ...]`. When querying the quarantined LLM, format arguments as follows: - `query`: A clear and precise question about the variable content. - `variables`: ...

work page 2025
[51]

Incomplete Utility Evaluations: Due to the way AgentDojo evaluates success, as we have discussed before

work page
[52]

Guesswork: The planner may guess the correct arguments for the tool calls. For example, in travel tasks, the planner guesses the correct highly rated hotel to recommend from the given options even though it never gets to see the ratings and reviews of the hotels. Sometimes, the planner also guesses based on the data context. For example, in a banking task...

work page
[53]

For example, in banking task 10, the planner is asked to pay the bill like last month

Do Nothing Tasks: Here the planner is supposed to just not finish the task. For example, in banking task 10, the planner is asked to pay the bill like last month. However, the planner does not know the details of the bill and amount. Hence, it should not do anything, as the Variable Passing planner does. Additional Results on Expressivity We provide the t...

work page