pith. sign in

arxiv: 2504.11703 · v3 · pith:7PZFUWQOnew · submitted 2025-04-16 · 💻 cs.CR · cs.AI

Progent: Securing AI Agents with Privilege Control

Pith reviewed 2026-05-22 21:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords AI agentsprivilege controlsecurity policiesprompt injectionSMT solverleast privilegemonotonic confinementtool calls
0
0 comments X

The pith

Progent secures AI agents by representing privileges as symbolic rules over tool calls that an LLM generates and an SMT solver narrows or expands to enforce monotonic confinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agents that call external tools are vulnerable to attacks such as indirect prompt injection that can trigger unauthorized actions. Security requirements shift with each user task and execution state, and any defense must avoid crippling the agent's ability to finish its work. Progent addresses this by turning the principle of least privilege into concrete symbolic policies that list exactly which tool names and arguments are permitted. An LLM creates the initial policy from the stated task and proposes updates as the agent runs; an SMT solver then classifies each change as a narrowing that applies automatically or an expansion that needs explicit user approval. The result is deterministic checking of every tool call against the current policy, so the set of allowed actions shrinks unless the user consents to growth.

Core claim

Progent represents privilege as a security policy consisting of symbolic rules over tool names and arguments. These rules specify which tool calls are allowed for task completion and which unnecessary ones are blocked for security. Every tool call is checked against such a policy through a deterministic procedure, enforcing the principle of least privilege. To handle diverse user tasks and evolving execution contexts, an LLM automatically generates the initial policy from the user's task and updates it during execution as new information arrives. Each proposed update is determined by an SMT solver to be either a narrowing (applied automatically) or an expansion (requiring explicit approval),

What carries the argument

symbolic security policies over tool names and arguments, checked by a deterministic procedure and updated through LLM proposals that an SMT solver classifies as automatic narrowing or approval-required expansion to maintain monotonic confinement

Load-bearing premise

An LLM can reliably generate initial policies and propose updates that correctly capture the user's intended task scope and security needs without omitting necessary tools or permitting unsafe ones.

What would settle it

A successful indirect prompt injection that causes an unauthorized tool call to execute after policy checking, or a measurable drop in task success rate on the same benchmarks when the policy blocks actions the agent needs.

Figures

Figures reproduced from arXiv: 2504.11703 by Dawn Song, Hongwei Li, Jingxuan He, Linyu Wu, Tianneng Shi, Wenbo Guo, Zhun Wang.

Figure 1
Figure 1. Figure 1: Left: a realistic attack [28] exploiting coding agents to exfiltrate sensitive data about private GitHub repositories. Right top: Progent’s overall design as a proxy to enforce privilege control over agents’ tool calls. Right bottom: Progent’s precise and fine-grained security policies to prevent data leakage while maintaining agent utility. like GitHub [18] to access code repositories, handle issues, mana… view at source ↗
Figure 2
Figure 2. Figure 2: An example of a workspace agent that performs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A formal definition of tools in LLM agents. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Progent’s domain-specific language for defining [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between vanilla agent (no defense), prior defenses, and Progent on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison results on ASB [70]. Utility (no attack) 0 20 40 60 80 100 77.0 74.1 Utility (under attack) 0 20 40 60 80 100 19.6 64.4 ASR (under attack) 0 20 40 60 80 100 72.6 0.0 No defense Progent [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Progent’s consistent effectiveness over different agent LLMs, demonstrated on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Experimental results of Progent-LLM. to ensure both utility (the ability to complete the task) and security (preventing unauthorized actions). The LLM.update primitive addresses this challenge. During agent execution, LLM.update takes the original query, the toolkit, current poli￾cies, the most recent tool call, and its observation as input. It then generates an updated version of the policies. This is a t… view at source ↗
Figure 10
Figure 10. Figure 10: Progent’s consistent effectiveness of different LLMs for policy [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Progent-LLM’s consistent effectiveness over different agent LLMs, demonstrated on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The policies in Figure [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The policies in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The policies for AgentDojo Banking. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Complete prompt for initial policy generation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Complete prompt for checking if policy update is needed. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Complete prompt for performing policy update. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
read the original abstract

AI agents interact with external environments through tool calls, exposing them to attacks like indirect prompt injection that can trigger unauthorized actions. Securing these agents is challenging: they behave autonomously and probabilistically, security requirements evolve depending on the user's task and execution state, and there is an inherent tradeofff between security and utility. In this work, we introduce Progent, a novel framework that secures AI agents via privilege control. Progent represents privilege as a security policy consisting of symbolic rules over tool names and arguments. These rules specify which tool calls are allowed for task completion and which unnecessary ones are blocked for security. Every tool call is checked against such a policy through a deterministic procedure, enforcing the principle of least privilege. To handle diverse user tasks and evolving execution contexts, an LLM automatically generates the initial policy from the user's task and updates it during execution as new information arrives. Each proposed update is determined by an SMT solver to be either a narrowing (applied automatically) or an expansion (requiring explicit approval), ensuring that the agent's effective action space can only shrink without approval (monotonic confinement). This deterministic update mechanism preserves utility and prevents silent privilege escalation, even when adversarial inputs are present. Our evaluation on popular benchmarks (i.e., AgentDojo and ASB) shows that Progent significantly reduces attack success rates while maintaining high utility. We further validate Progent's practicality by showcasing its effectiveness in real-world agent frameworks such as LangChain and OpenAI Agents SDK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Progent, a framework for securing AI agents via privilege control. It represents privileges as symbolic policies over tool names and arguments, uses an LLM to generate initial policies from user tasks and update them during execution, and employs an SMT solver to enforce monotonic confinement (updates are either automatic narrowings or explicit-approval expansions). Every tool call is checked deterministically against the policy. The central claim is that this reduces attack success rates on AgentDojo and ASB while preserving high utility, and that it integrates practically with LangChain and OpenAI Agents SDK.

Significance. If the empirical claims hold, the work provides a practical mechanism for least-privilege enforcement in autonomous agents by combining LLM flexibility for policy creation with deterministic checking and monotonic update rules. The SMT-based confinement is a concrete, verifiable component that directly addresses silent privilege escalation, which is a strength relative to purely LLM-based guardrails.

major comments (2)
  1. [Evaluation] Evaluation section: the abstract claims that Progent 'significantly reduces attack success rates while maintaining high utility' on AgentDojo and ASB, yet reports no quantitative numbers, error bars, baseline comparisons, or details on how utility is measured (e.g., task completion rate, number of tool calls). This absence makes the central empirical claim impossible to assess for robustness or effect size.
  2. [Policy generation and update mechanism] Policy generation and update mechanism (described in the abstract and §3): the security and utility guarantees rest on the assumption that the LLM reliably produces initial policies and updates that correctly encode task scope without omitting required tools or allowing unsafe argument values. The monotonic-confinement property only prevents silent expansion; it cannot correct an initially flawed policy. No independent validation (e.g., manual audit of generated policies or fidelity metrics) is described, so benchmark outcomes are conditional on unverified LLM output quality.
minor comments (1)
  1. [Abstract] Abstract: 'tradeofff' contains a typographical error and should read 'tradeoff'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the abstract claims that Progent 'significantly reduces attack success rates while maintaining high utility' on AgentDojo and ASB, yet reports no quantitative numbers, error bars, baseline comparisons, or details on how utility is measured (e.g., task completion rate, number of tool calls). This absence makes the central empirical claim impossible to assess for robustness or effect size.

    Authors: The referee correctly identifies that the abstract states the empirical claim without supporting numbers. The evaluation section of the manuscript does contain the detailed results, including attack success rates on both benchmarks, baseline comparisons, and utility measured via task completion rate. To address the concern directly, we will revise the abstract to include key quantitative results (e.g., specific attack success rate reductions and utility percentages), error bars where applicable, and explicit baseline comparisons. We will also ensure the utility metric is defined in the abstract. revision: yes

  2. Referee: [Policy generation and update mechanism] Policy generation and update mechanism (described in the abstract and §3): the security and utility guarantees rest on the assumption that the LLM reliably produces initial policies and updates that correctly encode task scope without omitting required tools or allowing unsafe argument values. The monotonic-confinement property only prevents silent expansion; it cannot correct an initially flawed policy. No independent validation (e.g., manual audit of generated policies or fidelity metrics) is described, so benchmark outcomes are conditional on unverified LLM output quality.

    Authors: We agree that the approach relies on the quality of LLM-generated policies and that monotonic confinement only prevents unauthorized expansions rather than correcting initial policy errors. The current manuscript does not include independent validation such as manual audits or fidelity metrics. The reported high utility on the benchmarks provides indirect evidence that the generated policies are generally appropriate for the tasks. We will add a discussion of this assumption and its limitations in the revised manuscript, along with example generated policies in the appendix to improve transparency. A comprehensive manual audit of all policies is not feasible within the scope of this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external LLM/SMT components and benchmark evaluation

full rationale

The paper defines Progent as an LLM-generated symbolic policy checked by a deterministic SMT procedure that enforces monotonic narrowing; security and utility claims are then validated directly on external benchmarks (AgentDojo, ASB) and real frameworks (LangChain, OpenAI SDK). No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the central guarantee follows from the SMT decision rule applied to externally supplied policy proposals, making the derivation self-contained against those external oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the unproven reliability of LLM policy generation for security-critical decisions and on the assumption that benchmark tasks adequately represent real-world attack surfaces and utility requirements.

axioms (2)
  • domain assumption LLM can generate and update policies that accurately reflect user intent and security requirements for diverse tasks
    Invoked when the paper states that an LLM automatically generates the initial policy from the user's task and updates it during execution.
  • standard math SMT solver correctly classifies every policy update as narrowing or expansion and enforces monotonicity
    Relies on the deterministic procedure and SMT decision procedure described for update validation.
invented entities (1)
  • monotonic confinement no independent evidence
    purpose: Ensures the agent's effective action space can only shrink without explicit approval, preventing silent privilege escalation
    New mechanism introduced to combine LLM updates with deterministic safety guarantees.

pith-pipeline@v0.9.0 · 5811 in / 1397 out tokens · 31398 ms · 2026-05-22T21:06:56.911472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

    cs.CR 2026-04 unverdicted novelty 8.0

    TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

  2. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evidence-carrying multimodal agents decompose tool calls into predicates verified by constrained DOM/OCR/AX checkers to block hallucination-enabled unsafe actions.

  3. Do Coding Agents Understand Least-Privilege Authorization?

    cs.CR 2026-05 unverdicted novelty 7.0

    Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15...

  4. No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

    cs.CR 2026-05 unverdicted novelty 7.0

    Sefz discovers specification violations in 29.9% of 402 real-world agent skills by translating guardrails into reachability goals and guiding LLM mutations with a multi-armed bandit.

  5. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  6. Sealing the Audit-Runtime Gap for LLM Skills

    cs.CR 2026-05 unverdicted novelty 7.0

    SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.

  7. KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

    cs.SE 2026-03 accept novelty 7.0

    KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.

  8. Formal Policy Enforcement for Real-World Agentic Systems

    cs.CR 2026-02 unverdicted novelty 7.0

    FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.

  9. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?

    cs.CR 2026-02 accept novelty 7.0

    AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.

  10. PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.

  11. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving...

  12. MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

    cs.CR 2026-05 conditional novelty 6.0

    MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...

  13. SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...

  14. SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response

    cs.CR 2026-05 unverdicted novelty 6.0

    SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.

  15. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  16. Security Considerations for Multi-agent Systems

    cs.CR 2026-03 unverdicted novelty 6.0

    No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

  17. Agent Security is a Systems Problem

    cs.CR 2026-05 unverdicted novelty 5.0

    Agent security must be treated as a systems problem by viewing the AI model as untrusted and applying established systems security principles to enforce invariants.

  18. Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

    cs.CR 2026-05 unverdicted novelty 5.0

    Conleash uses a risk lattice, policy engine, and refinement loop to deliver scoped, consent-driven authorization for MCP tool calls, reaching 98.2% accuracy and 99.4% escalation catch rate on 984 traces with 8.2 ms ov...

  19. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 5.0

    AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.

  20. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

    cs.CR 2026-05 unverdicted novelty 5.0

    A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

  21. Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

    cs.SE 2026-04 unverdicted novelty 5.0

    Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.

  22. Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

    cs.SE 2026-04 conditional novelty 5.0

    Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

  23. Agent Security is a Systems Problem

    cs.CR 2026-05 unverdicted novelty 4.0

    The paper argues that agent security is best addressed as a systems problem by applying principles from operating systems, networks, and formal methods rather than relying solely on model robustness improvements.

  24. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 4.0

    AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.

  25. From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

    cs.CR 2026-05 unverdicted novelty 3.0

    The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institution...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 22 Pith papers · 9 internal anchors

  1. [1]

    Contributors to all-hands-ai/openhands

    All-Hands-AI/OpenHands. Contributors to all-hands-ai/openhands. https://github. com/All-Hands-AI/OpenHands/graphs/ contributors?from=5%2F4%2F2025, 2025. Ac- cessed: 2025-08-24

  2. [2]

    AWS Identity and Access Man- agement (IAM)

    Amazon Web Services. AWS Identity and Access Man- agement (IAM). https://aws.amazon.com/iam/,

  3. [3]

    Accessed: 2025-04-12

  4. [4]

    Claude code

    Anthropic. Claude code. https://www.anthropic. com/claude-code, 2025. Accessed: 2025-08-24

  5. [5]

    Introducing claude 4

    Anthropic. Introducing claude 4. https://www. anthropic.com/news/claude-4, 2025

  6. [6]

    Runtime verification meets android security

    Andreas Bauer, Jan-Christoph Küster, and Gil Vegliach. Runtime verification meets android security. In NASA Formal Methods Symposium, 2012

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Struq: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. In USENIX Security Symposium , 2025

  9. [9]

    Secalign: Defending against prompt injection with preference optimization

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. In The ACM Conference on Computer and Communications Security (CCS), 2025

  10. [10]

    Meta secalign: A secure foundation llm against prompt injection attacks

    Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks. arXiv preprint arXiv:2507.02735, 2025

  11. [11]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems, 2024

  12. [12]

    How not to detect prompt injections with an llm

    Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How not to detect prompt injections with an llm. arXiv preprint arXiv:2507.05630, 2025

  13. [13]

    Agent overview.https://docs.cursor

    Cursor Team. Agent overview.https://docs.cursor. com/en/agent/overview, 2025. Accessed: 2025-08- 24

  14. [14]

    Cedar: A new language for expressive, fast, safe, and analyzable authorization

    Joseph W Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hi- etala, Eleftherios Ioannidis, John Kastner, Anwar Ma- mat, et al. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proceedings of the ACM on Programming Languages, 8(OOPSLA1):670– 697, 2024

  15. [15]

    Z3: An effi- cient smt solver

    Leonardo De Moura and Nikolaj Bjørner. Z3: An effi- cient smt solver. In TACAS, 2008

  16. [16]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, 14 Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

  17. [17]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In The Thirty-eight Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2024

  18. [18]

    Binder, a logic-based security lan- guage

    John DeTreville. Binder, a logic-based security lan- guage. In Proceedings 2002 IEEE Symposium on Secu- rity and Privacy, pages 105–113. IEEE, 2002

  19. [19]

    Github mcp server: Github’s official mcp server

    GitHub. Github mcp server: Github’s official mcp server. https://github.com/github/ github-mcp-server, 2024. GitHub repository

  20. [20]

    Gemini 2.5: Updates to our family of thinking models

    Google. Gemini 2.5: Updates to our family of thinking models. https://developers.googleblog.com/ en/gemini-2-5-thinking-model-updates/ , 2025

  21. [21]

    Identity and Access Management (IAM)

    Google Cloud. Identity and Access Management (IAM). https://cloud.google.com/iam/, 2025. Accessed: 2025-04-12

  22. [22]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

  23. [23]

    The emerged security and privacy of llm agent: A survey with case studies

    Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

  24. [24]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021

  25. [25]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlight- ing. arXiv preprint arXiv:2403.14720, 2024

  26. [26]

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vid- gen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, ...

  27. [27]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Ak- ila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

  29. [29]

    Github mcp exploited: Accessing private repositories via mcp

    Invariant Labs. Github mcp exploited: Accessing private repositories via mcp. https://invariantlabs.ai/ blog/mcp-github-vulnerability, December 2024. Blog post

  30. [30]

    JSON. JSON. https://www.json.org/json-en. html, 2025. Accessed: 2025-01-10

  31. [31]

    JSON Schema

    JSON Schema. JSON Schema. https:// json-schema.org/, 2025. Accessed: 2025-01-10

  32. [32]

    Gmail Toolkit

    LangChain. Gmail Toolkit. https://python. langchain.com/docs/integrations/tools/ gmail/, 2025. Accessed: 2025-01-10

  33. [33]

    Instruction defense

    Learn Prompting. Instruction defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/instruction, 2024. Ac- cessed: 2025-08-24

  34. [34]

    Random sequence enclosure

    Learn Prompting. Random sequence enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence,

  35. [35]

    Accessed: 2025-08-24

  36. [36]

    Sandwich defense

    Learn Prompting. Sandwich defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense, 2024. Accessed: 2025-08-24

  37. [37]

    Retrieval-augmented generation for knowledge- intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. In NeurIPS, 2020. 15

  38. [38]

    Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks

    Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, and Meng Han. Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks. arXiv preprint arXiv:2409.19521, 2024

  39. [39]

    Sapper: A language for hardware-level security policy enforcement

    Xun Li, Vineeth Kashyap, Jason K Oberg, Mohit Ti- wari, Vasanth Ram Rajarathinam, Ryan Kastner, Timo- thy Sherwood, Ben Hardekopf, and Frederic T Chong. Sapper: A language for hardware-level security policy enforcement. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pages 97–112, 2014

  40. [40]

    Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024

  41. [41]

    Eia: Environmental injection attack on generalist web agents for privacy leakage

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Ji- awei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. ICLR, 2025

  42. [42]

    Automatic and universal prompt injection attacks against large language models,

    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957, 2024

  43. [43]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection at- tack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023

  44. [44]

    Datasentinel: A game-theoretic detection of prompt injection attacks

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. Proceedings 2025 IEEE Symposium on Security and Privacy, 2025

  45. [45]

    Llama Prompt Guard 2

    Meta. Llama Prompt Guard 2. https://www.llama. com/docs/model-cards-and-prompt-formats/ prompt-guard/, 2025. Accessed: 2025-08-14

  46. [46]

    Azure Policy Documentation

    Microsoft. Azure Policy Documentation. https://learn.microsoft.com/en-us/azure/ governance/policy/, 2025. Accessed: 2025-04-12

  47. [47]

    Use agent mode in VS Code

    Microsoft Corporation. Use agent mode in VS Code. https://code.visualstudio.com/docs/ copilot/chat/chat-agent-mode, 2025. Accessed: 2025-08-24

  48. [48]

    Adversarial search engine optimization for large language models

    Fredrik Nestaas, Edoardo Debenedetti, and Florian Tramèr. Adversarial search engine optimization for large language models. In ICLR, 2025

  49. [49]

    Function calling – OpenAI API

    OpenAI. Function calling – OpenAI API. https://platform.openai.com/docs/guides/ function-calling, 2025. Accessed: 2025-01-10

  50. [50]

    Introducing gpt-4.1 in the api

    OpenAI. Introducing gpt-4.1 in the api. https:// openai.com/index/gpt-4-1/, 2025

  51. [51]

    Ignore previous prompt: Attack techniques for language models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. NeurIPS ML Safety Workshop, 2022

  52. [52]

    Fine-tuned deberta- v3-base for prompt injection detection

    ProtectAI.com. Fine-tuned deberta- v3-base for prompt injection detection. https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection-v2 , 2024

  53. [53]

    python-jsonschema/jsonschema – GitHub

    python-jsonschema. python-jsonschema/jsonschema – GitHub. https://github.com/ python-jsonschema/jsonschema, 2025. Accessed: 2025-01-10

  54. [54]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

  55. [55]

    Tool- former: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Tool- former: Language models can teach themselves to use tools. In NeurIPS, 2023

  56. [56]

    Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records

    Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024

  57. [57]

    Reflexion: Lan- guage agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In NeurIPS, 2023

  58. [58]

    The dual llm pattern for building ai assistants that can resist prompt injec- tion

    Simon Willison. The dual llm pattern for building ai assistants that can resist prompt injec- tion. https://simonwillison.net/2023/Apr/25/ dual-llm-pattern/, 2023. Accessed: 2025-08-24

  59. [59]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Jo- hannes Heidecke, and Alex Beutel. The instruction hier- archy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

  60. [60]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, 16 Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023

  61. [61]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18, 2024

  62. [62]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

  63. [63]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Jun- yang Lin, Robert Brennan, Hao Peng, Heng Ji, and Gra- ham Neubig. Openhands: An open platform for AI ...

  64. [64]

    Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

    Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

  65. [65]

    Dissecting adversarial robustness of multimodal lm agents

    Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents. In NeurIPS 2024 Workshop on Open-World Agents, 2024

  66. [66]

    System-level defense against indirect prompt injection attacks: An information flow control perspective

    Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv preprint arXiv:2409.19091, 2024

  67. [67]

    A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

    Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick Mc- Daniel, and Chaowei Xiao. A new era in llm security: Exploring security concerns in real-world llm-based sys- tems. arXiv preprint arXiv:2402.18649, 2024

  68. [68]

    Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xi- aoyun Zhang, and Chi Wang. Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work. In COLM, 2024

  69. [69]

    IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems

    Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems. In Network and Distributed System Security Symposium (NDSS) , 2025

  70. [70]

    Ad- vweb: Controllable black-box attacks on vlm-powered web agents

    Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Ad- vweb: Controllable black-box attacks on vlm-powered web agents. arXiv preprint arXiv:2410.17401, 2024

  71. [71]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023

  72. [72]

    Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025

  73. [73]

    Attacking vision- language computer agents via pop-ups

    Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision- language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

  74. [74]

    list_repos

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language mod- els. In USENIX Security Symposium, 2025. A Sample policies Our implementation uses the JSON ecosystem. We give sam- ples of the policies in Figures 13 and 14. B Experiment Details We consistently use gpt-4...