pith. machine review for the scientific record. sign in

arxiv: 2505.23643 · v2 · submitted 2025-05-29 · 💻 cs.CR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Securing AI Agents with Information-Flow Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords information-flow controlAI agentsprompt injectiondynamic taint-trackingsecurity policiesagent plannersconfidentialityintegrity
0
0 comments X

The pith

Fides applies information-flow control to AI agent planners to enforce security policies against prompt injection while preserving task utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a formal model of agent planners to reason about security and expressiveness under information-flow control. It characterizes the properties enforceable by dynamic taint-tracking and builds a taxonomy of tasks to evaluate security-utility trade-offs. From this foundation the authors design Fides, a planner that tracks confidentiality and integrity labels, enforces policies deterministically, and adds primitives for selective information hiding. Evaluation in AgentDojo shows Fides completes a broad range of tasks with security guarantees. A sympathetic reader cares because autonomous AI agents face prompt-injection risks and IFC offers a structured way to mitigate them without eliminating most functionality.

Core claim

Using a formal model of agent planners, the paper characterizes the class of properties enforceable by dynamic taint-tracking and constructs a taxonomy of tasks. Informed by this analysis, Fides is presented as a planner that tracks confidentiality and integrity labels to deterministically enforce security policies and introduces novel primitives for selectively hiding information, enabling completion of a broad range of tasks with security guarantees as demonstrated in AgentDojo.

What carries the argument

Fides, an agent planner that performs dynamic taint-tracking on confidentiality and integrity labels, deterministically enforces security policies, and supplies primitives for selective information hiding.

If this is right

  • Dynamic taint-tracking suffices to enforce a defined class of security properties in agent planners.
  • Deterministic policy enforcement is achievable without introducing nondeterminism into planner behavior.
  • Selective hiding primitives allow agents to suppress sensitive outputs while still completing useful work.
  • Security guarantees can coexist with high task-completion rates in benchmark environments such as AgentDojo.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same label-tracking discipline could be applied directly to language-model outputs or tool-calling interfaces beyond the planner layer.
  • The task taxonomy offers a reusable yardstick for comparing IFC planners against other security techniques such as sandboxing or input sanitization.
  • Extending the model to multi-agent or adaptive planners would reveal whether the same enforceable-property class still holds.

Load-bearing premise

The formal model of agent planners and the taxonomy of tasks accurately capture real-world security and utility trade-offs.

What would settle it

A concrete prompt-injection attack that succeeds on a Fides-protected agent in AgentDojo on a task the taxonomy classifies as securable, or a safe task that Fides prevents from completing.

read the original abstract

As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach enables us to complete a broad range of tasks with security guarantees. A tutorial to walk readers through the the concepts introduced in the paper can be found at https://github.com/microsoft/fides

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a formal model for analyzing the security and expressiveness of AI agent planners under information-flow control. It characterizes the class of properties enforceable via dynamic taint-tracking, introduces a taxonomy of tasks to assess security-utility trade-offs, and proposes Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces policies, and adds novel primitives for selective information hiding. Evaluation on the AgentDojo benchmark is claimed to show that Fides completes a broad range of tasks while providing security guarantees against vulnerabilities such as prompt injection.

Significance. If the formal model accurately captures LLM planner behavior and the evaluation results hold under realistic conditions, the work could provide a principled foundation for securing autonomous AI agents without fully sacrificing utility. The selective hiding primitives and task taxonomy represent concrete contributions to IFC for agents. However, the absence of detailed quantitative metrics or proof sketches in the provided description limits assessment of whether the security guarantees are practically enforceable.

major comments (2)
  1. [Formal model] Formal model section: The characterization of enforceable properties assumes perfect label propagation semantics for the planner treated as a black-box function. This is load-bearing for the central security claims, yet the model does not address how LLM-specific operations such as paraphrasing, summarization, or inference from tainted inputs affect taint labels, which can produce under-tainting and violate the claimed guarantees.
  2. [Evaluation] Evaluation section: The AgentDojo results are referenced only at a high level with no reported quantitative metrics (e.g., task success rates, security violation counts, or comparisons to baselines). Without these data, the claim that Fides completes a broad range of tasks with security guarantees cannot be verified and undermines the utility-security trade-off analysis.
minor comments (1)
  1. [Abstract] The GitHub tutorial link is mentioned but its stability and coverage of the formal model and Fides primitives should be confirmed in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will make targeted revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: Formal model section: The characterization of enforceable properties assumes perfect label propagation semantics for the planner treated as a black-box function. This is load-bearing for the central security claims, yet the model does not address how LLM-specific operations such as paraphrasing, summarization, or inference from tainted inputs affect taint labels, which can produce under-tainting and violate the claimed guarantees.

    Authors: We appreciate this observation. The formal model intentionally abstracts the planner as a black-box to characterize the class of properties enforceable by dynamic taint-tracking in general. This focuses on theoretical expressiveness rather than implementation details. We agree that LLM behaviors such as paraphrasing can cause under-tainting in practice. Fides mitigates this via its novel selective hiding primitives and deterministic enforcement at the system level. We will revise the manuscript to add explicit discussion of the abstraction's limitations and how Fides addresses practical LLM-specific risks. revision: partial

  2. Referee: Evaluation section: The AgentDojo results are referenced only at a high level with no reported quantitative metrics (e.g., task success rates, security violation counts, or comparisons to baselines). Without these data, the claim that Fides completes a broad range of tasks with security guarantees cannot be verified and undermines the utility-security trade-off analysis.

    Authors: We agree that the evaluation section would benefit from more detail. The manuscript references AgentDojo results to support broad task completion with security guarantees, but we will expand it with quantitative metrics including task success rates, security violation counts, and baseline comparisons. This revision will make the utility-security trade-off analysis verifiable and strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity in formal model, taxonomy, or Fides evaluation

full rationale

The paper introduces an independent formal model of agent planners, derives the class of properties enforceable by dynamic taint-tracking from that model, builds a task taxonomy to explore trade-offs, and evaluates the resulting Fides planner against the external AgentDojo benchmark. No load-bearing step reduces by definition, fitted-parameter renaming, or self-citation chain to its own inputs; the security guarantees are grounded in the new planner primitives and benchmark results rather than self-referential constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The formal model of planners is presumed to rest on standard assumptions about agent execution and label propagation.

pith-pipeline@v0.9.0 · 5475 in / 874 out tokens · 35507 ms · 2026-05-15T11:50:59.663951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

    cs.CR 2026-05 unverdicted novelty 8.0

    Oracle Poisoning corrupts knowledge graphs used by AI agents via tool calls, leading tested models to accept fabricated claims at 100% under directed queries in a production-scale demonstration.

  2. Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

    cs.CR 2026-05 unverdicted novelty 8.0

    Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...

  3. MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents

    cs.AI 2026-04 unverdicted novelty 8.0

    MCPHunt benchmark finds 11.5-41.3% policy-violating credential propagation in multi-server MCP agents across five models, reducible up to 97% by prompt mitigations while retaining most utility.

  4. Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

    cs.CR 2026-04 unverdicted novelty 8.0

    NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

  5. TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

    cs.CR 2026-04 unverdicted novelty 8.0

    TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

  6. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  7. Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.

  8. Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

    cs.CR 2026-04 unverdicted novelty 7.0

    Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.

  9. Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

  10. Formal Policy Enforcement for Real-World Agentic Systems

    cs.CR 2026-02 unverdicted novelty 7.0

    FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.

  11. MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

    cs.CR 2026-05 conditional novelty 6.0

    MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...

  12. Web Agents Should Adopt the Plan-Then-Execute Paradigm

    cs.CR 2026-05 unverdicted novelty 6.0

    Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

  13. Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

    cs.CR 2026-05 conditional novelty 6.0

    Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.

  14. AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...

  15. ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

    cs.CR 2026-05 unverdicted novelty 6.0

    ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

  16. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  17. Alignment Contracts for Agentic Security Systems

    cs.CR 2026-04 conditional novelty 6.0 full

    Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...

  18. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  19. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 5.0

    AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.

  20. Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

    cs.SE 2026-04 unverdicted novelty 5.0

    Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.

  21. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 4.0

    AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 20 Pith papers

  1. [1]

    Get my drift? catching llm task drift with activation deltas

    Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, and Mario Fritz. Get my drift? catching llm task drift with activation deltas. In IEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2025. IEEE, 2025

  2. [2]

    Guidance: A guidance language for controlling large language models

    Guidance AI. Guidance: A guidance language for controlling large language models. https://github.com/ guidance-ai/guidance, 2025. Accessed: 2025-04-12

  3. [3]

    Computer Use (beta)

    Anthropic. Computer Use (beta). https://docs.anthropic.com/en/docs/agents-and-tools/ computer-use, 2024

  4. [4]

    Ahsan Ayub and Subhabrata Majumdar

    Md. Ahsan Ayub and Subhabrata Majumdar. Embedding-based classifiers can detect prompt injection attacks. In Conference on Applied Machine Learning in Information Security (CAMLIS 2024), volume 3920 of CEUR Workshop Proceedings, pages 257–268. CEUR-WS.org, 2024

  5. [5]

    AI agents with formal security guarantees

    Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. AI agents with formal security guarantees. In ICML 2024 Next Generation of AI Safety Workshop, 2024

  6. [6]

    Guiding llms the right way: Fast, non-invasive constrained generation, 2024

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Guiding llms the right way: Fast, non-invasive constrained generation, 2024

  7. [7]

    LangChain, 2022

    Harrison Chase. LangChain, 2022

  8. [8]

    StruQ: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In 34th USENIX Security Symposium (USENIX Security ’25), 2025. To appear. 19 Securing AI Agents with Information-Flow Control

  9. [9]

    Secalign: Defending against prompt injection with preference optimization, 2025

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization, 2025

  10. [10]

    Defeating prompt injections by design, 2025

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tram`er. Defeating prompt injections by design, 2025

  11. [11]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi ´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram `er. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024

  12. [12]

    A lattice model of secure information flow

    Dorothy E Denning. A lattice model of secure information flow. Communications of the ACM, 19(5):236–243, 1976

  13. [13]

    Magentic-One: A generalist multi-agent system for solving complex tasks, 2024

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-One: A generalist multi-agent system for solving complex tasks, 2024

  14. [14]

    Grammar-constrained decoding for structured nlp tasks without finetuning

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. Grammar-constrained decoding for structured nlp tasks without finetuning. In 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 10932–10952, 2023

  15. [15]

    Security policies and security models

    Joseph A Goguen and Jos´e Meseguer. Security policies and security models. In 1982 IEEE Symposium on Security and Privacy, pages 11–11. IEEE, 1982

  16. [16]

    Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, 2023

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, 2023

  17. [17]

    Defending against indirect prompt injection attacks with spotlighting

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlighting. In Conference on Applied Machine Learning in Information Security (CAMLIS 2024) , volume 3920 of CEUR Workshop Proceedings, pages 48–62. CEUR- WS.org, 2024

  18. [18]

    The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents, 2024

    Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents, 2024

  19. [19]

    Devin: The first AI software engineer

    Cognition Labs. Devin: The first AI software engineer. https://www.cognition-labs.com/, 2024

  20. [20]

    Prompt injection attack against LLM-integrated applications, 2023

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications, 2023

  21. [21]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24) , pages 1831–1847. USENIX Association, 2024

  22. [22]

    Myers and Barbara Liskov

    Andrew C. Myers and Barbara Liskov. A decentralized model for information flow control. In 16th ACM Symposium on Operating Systems Principles, SOSP ’97, page 129–142. ACM, 1997

  23. [23]

    Enforcing robust declassification

    Andrew C Myers, Andrei Sabelfeld, and Steve Zdancewic. Enforcing robust declassification. In 17th IEEE Computer Security Foundations Workshop, CSF’2004, pages 172–186. IEEE, 2004

  24. [24]

    Openai agents sdk

    OpenAI. Openai agents sdk. https://openai.github.io/openai-agents-python/, 2024

  25. [25]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  26. [26]

    Language-based information-flow security

    Andrei Sabelfeld and Andrew C Myers. Language-based information-flow security. IEEE J. on Selected Areas in Communications, 21(1):5–19, 2003

  27. [27]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36, pages 68539–68551. Curran Associates, Inc., 2023

  28. [28]

    Pierce, and Andrei Sabelfeld

    Daniel Schoepe, Musard Balliu, Benjamin C. Pierce, and Andrei Sabelfeld. Explicit secrecy: A policy for taint tracking. In 2016 IEEE European Symposium on Security and Privacy, pages 15–30. IEEE, 2016

  29. [29]

    Permissive information-flow analysis for large language models, 2024

    Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris K¨opf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, and Santiago Zanella-B´eguelin. Permissive information-flow analysis for large language models, 2024

  30. [30]

    Safety versus secrecy

    Dennis V olpano. Safety versus secrecy. InStatic Analysis (SAS 1999), volume 1694 of Lecture Notes in Computer Science, pages 303–311. Springer, 1999

  31. [31]

    The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024

  32. [32]

    The dual LLM pattern for building ai assistants that can resist prompt injection

    Simon Willison. The dual LLM pattern for building ai assistants that can resist prompt injection. Online: https://simonwillison.net/2023/Apr/25/dual-llm-pattern, April 2023

  33. [33]

    System-level defense against indirect prompt injection attacks: An information flow control perspective, 2024

    Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective, 2024

  34. [34]

    AutoGen: Enabling next-gen LLM applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, COLM 2024, 2024

  35. [35]

    Instructional segment embedding: Improving LLM safety with instruction hierarchy

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional segment embedding: Improving LLM safety with instruction hierarchy. In 13th International Conference on Learning Representations, 2025

  36. [36]

    GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis

    Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis. In 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 507–518. ACL, 2024

  37. [37]

    React: Syner- gizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Syner- gizing reasoning and acting in language models. In 11th International Conference on Learning Representations, 2023

  38. [38]

    Benchmarking and defending against indirect prompt injection attacks on large language models, 2023

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models, 2023

  39. [39]

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. ACL, 2024

  40. [40]

    Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. ICLR, 2025

  41. [41]

    Titzer, Heather Miller, and Phillip B

    Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. Rtbas: Defending llm agents against prompt injection and privacy leakage, 2025. 21 Securing AI Agents with Information-Flow Control

  42. [42]

    Improving alignment and robustness with circuit breakers

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. In Advances in Neural Information Processing Systems, volume 37, pages 83345–83373. Curran Associates, Inc., 2024. A Defining Explicit Secrecy We define a...

  43. [46]

    - **Action:** Refer directly to the returned variable identifiers in your final reply

    **Field inside a dict within a list of dicts**: - Format: `#tool_name-result-tool_name_count-index.key#` ## USAGE GUIDELINES FOR VARIABLE IDENTIFIERS 26 Securing AI Agents with Information-Flow Control ### Use Case 1: Directly Showing Information to the User - **Scenario:** You simply need to present the responses directly to the user. - **Action:** Refer...

  44. [47]

    **Basic types** (`str`, `bool`, `int`, `float`): - Format: `#tool_name-result-tool_name_count#`

  45. [48]

    **Field inside a dict**: - Format: `#tool_name-result-tool_name_count.key#`

  46. [49]

    **Basic type inside a list of dicts**: - Format: `#tool_name-result-tool_name_count-index#`

  47. [50]

    Summarize the content in #v1# for the user

    **Field inside a dict within a list of dicts**: - Format: `#tool_name-result-tool_name_count-index.key#` ## USING THE QUARANTINED LLM Let's say you have a list of variables in your context `[#v1#, #v2#, ...]`. When querying the quarantined LLM, format arguments as follows: - `query`: A clear and precise question about the variable content. - `variables`: ...

  48. [51]

    Incomplete Utility Evaluations: Due to the way AgentDojo evaluates success, as we have discussed before

  49. [52]

    Guesswork: The planner may guess the correct arguments for the tool calls. For example, in travel tasks, the planner guesses the correct highly rated hotel to recommend from the given options even though it never gets to see the ratings and reviews of the hotels. Sometimes, the planner also guesses based on the data context. For example, in a banking task...

  50. [53]

    For example, in banking task 10, the planner is asked to pay the bill like last month

    Do Nothing Tasks: Here the planner is supposed to just not finish the task. For example, in banking task 10, the planner is asked to pay the bill like last month. However, the planner does not know the details of the bill and amount. Hence, it should not do anything, as the Variable Passing planner does. Additional Results on Expressivity We provide the t...