Progent: Securing AI Agents with Privilege Control

Dawn Song; Hongwei Li; Jingxuan He; Linyu Wu; Tianneng Shi; Wenbo Guo; Zhun Wang

arxiv: 2504.11703 · v3 · pith:7PZFUWQOnew · submitted 2025-04-16 · 💻 cs.CR · cs.AI

Progent: Securing AI Agents with Privilege Control

Tianneng Shi , Jingxuan He , Zhun Wang , Hongwei Li , Linyu Wu , Wenbo Guo , Dawn Song This is my paper

Pith reviewed 2026-05-22 21:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentsprivilege controlsecurity policiesprompt injectionSMT solverleast privilegemonotonic confinementtool calls

0 comments

The pith

Progent secures AI agents by representing privileges as symbolic rules over tool calls that an LLM generates and an SMT solver narrows or expands to enforce monotonic confinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agents that call external tools are vulnerable to attacks such as indirect prompt injection that can trigger unauthorized actions. Security requirements shift with each user task and execution state, and any defense must avoid crippling the agent's ability to finish its work. Progent addresses this by turning the principle of least privilege into concrete symbolic policies that list exactly which tool names and arguments are permitted. An LLM creates the initial policy from the stated task and proposes updates as the agent runs; an SMT solver then classifies each change as a narrowing that applies automatically or an expansion that needs explicit user approval. The result is deterministic checking of every tool call against the current policy, so the set of allowed actions shrinks unless the user consents to growth.

Core claim

Progent represents privilege as a security policy consisting of symbolic rules over tool names and arguments. These rules specify which tool calls are allowed for task completion and which unnecessary ones are blocked for security. Every tool call is checked against such a policy through a deterministic procedure, enforcing the principle of least privilege. To handle diverse user tasks and evolving execution contexts, an LLM automatically generates the initial policy from the user's task and updates it during execution as new information arrives. Each proposed update is determined by an SMT solver to be either a narrowing (applied automatically) or an expansion (requiring explicit approval),

What carries the argument

symbolic security policies over tool names and arguments, checked by a deterministic procedure and updated through LLM proposals that an SMT solver classifies as automatic narrowing or approval-required expansion to maintain monotonic confinement

Load-bearing premise

An LLM can reliably generate initial policies and propose updates that correctly capture the user's intended task scope and security needs without omitting necessary tools or permitting unsafe ones.

What would settle it

A successful indirect prompt injection that causes an unauthorized tool call to execute after policy checking, or a measurable drop in task success rate on the same benchmarks when the policy blocks actions the agent needs.

Figures

Figures reproduced from arXiv: 2504.11703 by Dawn Song, Hongwei Li, Jingxuan He, Linyu Wu, Tianneng Shi, Wenbo Guo, Zhun Wang.

**Figure 1.** Figure 1: Left: a realistic attack [28] exploiting coding agents to exfiltrate sensitive data about private GitHub repositories. Right top: Progent’s overall design as a proxy to enforce privilege control over agents’ tool calls. Right bottom: Progent’s precise and fine-grained security policies to prevent data leakage while maintaining agent utility. like GitHub [18] to access code repositories, handle issues, mana… view at source ↗

**Figure 2.** Figure 2: An example of a workspace agent that performs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A formal definition of tools in LLM agents. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Progent’s domain-specific language for defining [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between vanilla agent (no defense), prior defenses, and Progent on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison results on ASB [70]. Utility (no attack) 0 20 40 60 80 100 77.0 74.1 Utility (under attack) 0 20 40 60 80 100 19.6 64.4 ASR (under attack) 0 20 40 60 80 100 72.6 0.0 No defense Progent [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Progent’s consistent effectiveness over different agent LLMs, demonstrated on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Experimental results of Progent-LLM. to ensure both utility (the ability to complete the task) and security (preventing unauthorized actions). The LLM.update primitive addresses this challenge. During agent execution, LLM.update takes the original query, the toolkit, current policies, the most recent tool call, and its observation as input. It then generates an updated version of the policies. This is a t… view at source ↗

**Figure 10.** Figure 10: Progent’s consistent effectiveness of different LLMs for policy [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 12.** Figure 12: Progent-LLM’s consistent effectiveness over different agent LLMs, demonstrated on AgentDojo [ [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: The policies in Figure [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: The policies in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: The policies for AgentDojo Banking. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Complete prompt for initial policy generation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Complete prompt for checking if policy update is needed. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Complete prompt for performing policy update. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

read the original abstract

AI agents interact with external environments through tool calls, exposing them to attacks like indirect prompt injection that can trigger unauthorized actions. Securing these agents is challenging: they behave autonomously and probabilistically, security requirements evolve depending on the user's task and execution state, and there is an inherent tradeofff between security and utility. In this work, we introduce Progent, a novel framework that secures AI agents via privilege control. Progent represents privilege as a security policy consisting of symbolic rules over tool names and arguments. These rules specify which tool calls are allowed for task completion and which unnecessary ones are blocked for security. Every tool call is checked against such a policy through a deterministic procedure, enforcing the principle of least privilege. To handle diverse user tasks and evolving execution contexts, an LLM automatically generates the initial policy from the user's task and updates it during execution as new information arrives. Each proposed update is determined by an SMT solver to be either a narrowing (applied automatically) or an expansion (requiring explicit approval), ensuring that the agent's effective action space can only shrink without approval (monotonic confinement). This deterministic update mechanism preserves utility and prevents silent privilege escalation, even when adversarial inputs are present. Our evaluation on popular benchmarks (i.e., AgentDojo and ASB) shows that Progent significantly reduces attack success rates while maintaining high utility. We further validate Progent's practicality by showcasing its effectiveness in real-world agent frameworks such as LangChain and OpenAI Agents SDK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Progent's monotonic SMT confinement is a reasonable mechanism but the security and utility claims rest on untested LLM policy generation quality.

read the letter

The key takeaway is that Progent uses LLM-generated symbolic policies with SMT-enforced monotonic updates to control agent privileges, but the approach's effectiveness still hinges on the quality of those LLM-generated policies. The framework is new in how it ties together policy synthesis, deterministic checking, and the monotonic guarantee for evolving tasks. It does a good job describing a deployable system that plugs into existing agent libraries and claims strong benchmark results on attack mitigation without much utility loss. The soft spots are in the evaluation details and the reliance on LLM correctness. The abstract gives no numbers, so we can't judge the effect sizes or variance. More importantly, as the stress test notes, if the LLM omits tools or allows unsafe ones initially, the SMT only locks that in rather than correcting it. The paper would need to show that policy generation is reliable or provide separate validation for that step. Overall this is for security researchers focused on AI agents. It should go to peer review because the problem is real and the proposed mechanism is concrete enough to be worth testing and refining.

Referee Report

2 major / 1 minor

Summary. The paper introduces Progent, a framework for securing AI agents via privilege control. It represents privileges as symbolic policies over tool names and arguments, uses an LLM to generate initial policies from user tasks and update them during execution, and employs an SMT solver to enforce monotonic confinement (updates are either automatic narrowings or explicit-approval expansions). Every tool call is checked deterministically against the policy. The central claim is that this reduces attack success rates on AgentDojo and ASB while preserving high utility, and that it integrates practically with LangChain and OpenAI Agents SDK.

Significance. If the empirical claims hold, the work provides a practical mechanism for least-privilege enforcement in autonomous agents by combining LLM flexibility for policy creation with deterministic checking and monotonic update rules. The SMT-based confinement is a concrete, verifiable component that directly addresses silent privilege escalation, which is a strength relative to purely LLM-based guardrails.

major comments (2)

[Evaluation] Evaluation section: the abstract claims that Progent 'significantly reduces attack success rates while maintaining high utility' on AgentDojo and ASB, yet reports no quantitative numbers, error bars, baseline comparisons, or details on how utility is measured (e.g., task completion rate, number of tool calls). This absence makes the central empirical claim impossible to assess for robustness or effect size.
[Policy generation and update mechanism] Policy generation and update mechanism (described in the abstract and §3): the security and utility guarantees rest on the assumption that the LLM reliably produces initial policies and updates that correctly encode task scope without omitting required tools or allowing unsafe argument values. The monotonic-confinement property only prevents silent expansion; it cannot correct an initially flawed policy. No independent validation (e.g., manual audit of generated policies or fidelity metrics) is described, so benchmark outcomes are conditional on unverified LLM output quality.

minor comments (1)

[Abstract] Abstract: 'tradeofff' contains a typographical error and should read 'tradeoff'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the abstract claims that Progent 'significantly reduces attack success rates while maintaining high utility' on AgentDojo and ASB, yet reports no quantitative numbers, error bars, baseline comparisons, or details on how utility is measured (e.g., task completion rate, number of tool calls). This absence makes the central empirical claim impossible to assess for robustness or effect size.

Authors: The referee correctly identifies that the abstract states the empirical claim without supporting numbers. The evaluation section of the manuscript does contain the detailed results, including attack success rates on both benchmarks, baseline comparisons, and utility measured via task completion rate. To address the concern directly, we will revise the abstract to include key quantitative results (e.g., specific attack success rate reductions and utility percentages), error bars where applicable, and explicit baseline comparisons. We will also ensure the utility metric is defined in the abstract. revision: yes
Referee: [Policy generation and update mechanism] Policy generation and update mechanism (described in the abstract and §3): the security and utility guarantees rest on the assumption that the LLM reliably produces initial policies and updates that correctly encode task scope without omitting required tools or allowing unsafe argument values. The monotonic-confinement property only prevents silent expansion; it cannot correct an initially flawed policy. No independent validation (e.g., manual audit of generated policies or fidelity metrics) is described, so benchmark outcomes are conditional on unverified LLM output quality.

Authors: We agree that the approach relies on the quality of LLM-generated policies and that monotonic confinement only prevents unauthorized expansions rather than correcting initial policy errors. The current manuscript does not include independent validation such as manual audits or fidelity metrics. The reported high utility on the benchmarks provides indirect evidence that the generated policies are generally appropriate for the tasks. We will add a discussion of this assumption and its limitations in the revised manuscript, along with example generated policies in the appendix to improve transparency. A comprehensive manual audit of all policies is not feasible within the scope of this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external LLM/SMT components and benchmark evaluation

full rationale

The paper defines Progent as an LLM-generated symbolic policy checked by a deterministic SMT procedure that enforces monotonic narrowing; security and utility claims are then validated directly on external benchmarks (AgentDojo, ASB) and real frameworks (LangChain, OpenAI SDK). No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the central guarantee follows from the SMT decision rule applied to externally supplied policy proposals, making the derivation self-contained against those external oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the unproven reliability of LLM policy generation for security-critical decisions and on the assumption that benchmark tasks adequately represent real-world attack surfaces and utility requirements.

axioms (2)

domain assumption LLM can generate and update policies that accurately reflect user intent and security requirements for diverse tasks
Invoked when the paper states that an LLM automatically generates the initial policy from the user's task and updates it during execution.
standard math SMT solver correctly classifies every policy update as narrowing or expansion and enforces monotonicity
Relies on the deterministic procedure and SMT decision procedure described for update validation.

invented entities (1)

monotonic confinement no independent evidence
purpose: Ensures the agent's effective action space can only shrink without explicit approval, preventing silent privilege escalation
New mechanism introduced to combine LLM updates with deterministic safety guarantees.

pith-pipeline@v0.9.0 · 5811 in / 1397 out tokens · 31398 ms · 2026-05-22T21:06:56.911472+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
cs.CR 2026-04 unverdicted novelty 8.0

TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evidence-carrying multimodal agents decompose tool calls into predicates verified by constrained DOM/OCR/AX checkers to block hallucination-enabled unsafe actions.
Do Coding Agents Understand Least-Privilege Authorization?
cs.CR 2026-05 unverdicted novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15...
No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
cs.CR 2026-05 unverdicted novelty 7.0

Sefz discovers specification violations in 29.9% of 402 real-world agent skills by translating guardrails into reachability goals and guiding LLM mutations with a multi-armed bandit.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
Sealing the Audit-Runtime Gap for LLM Skills
cs.CR 2026-05 unverdicted novelty 7.0

SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.
KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents
cs.SE 2026-03 accept novelty 7.0

KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.
Formal Policy Enforcement for Real-World Agentic Systems
cs.CR 2026-02 unverdicted novelty 7.0

FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?
cs.CR 2026-02 accept novelty 7.0

AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents
cs.CR 2026-05 unverdicted novelty 6.0

PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 6.0

Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving...
MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
cs.CR 2026-05 conditional novelty 6.0

MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
cs.CR 2026-05 unverdicted novelty 6.0

SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Agent Security is a Systems Problem
cs.CR 2026-05 unverdicted novelty 5.0

Agent security must be treated as a systems problem by viewing the AI model as untrusted and applying established systems security principles to enforce invariants.
Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization
cs.CR 2026-05 unverdicted novelty 5.0

Conleash uses a risk lattice, policy engine, and refinement loop to deliver scoped, consent-driven authorization for MCP tool calls, reaching 98.2% accuracy and 99.4% escalation catch rate on 984 traces with 8.2 ms ov...
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 5.0

AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
cs.SE 2026-04 unverdicted novelty 5.0

Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
cs.SE 2026-04 conditional novelty 5.0

Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
Agent Security is a Systems Problem
cs.CR 2026-05 unverdicted novelty 4.0

The paper argues that agent security is best addressed as a systems problem by applying principles from operating systems, networks, and formal methods rather than relying solely on model robustness improvements.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 4.0

AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
cs.CR 2026-05 unverdicted novelty 3.0

The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institution...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 22 Pith papers · 9 internal anchors

[1]

Contributors to all-hands-ai/openhands

All-Hands-AI/OpenHands. Contributors to all-hands-ai/openhands. https://github. com/All-Hands-AI/OpenHands/graphs/ contributors?from=5%2F4%2F2025, 2025. Ac- cessed: 2025-08-24

work page 2025
[2]

AWS Identity and Access Man- agement (IAM)

Amazon Web Services. AWS Identity and Access Man- agement (IAM). https://aws.amazon.com/iam/,

work page
[3]

Accessed: 2025-04-12

work page 2025
[4]

Claude code

Anthropic. Claude code. https://www.anthropic. com/claude-code, 2025. Accessed: 2025-08-24

work page 2025
[5]

Introducing claude 4

Anthropic. Introducing claude 4. https://www. anthropic.com/news/claude-4, 2025

work page 2025
[6]

Runtime verification meets android security

Andreas Bauer, Jan-Christoph Küster, and Gil Vegliach. Runtime verification meets android security. In NASA Formal Methods Symposium, 2012

work page 2012
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Struq: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. In USENIX Security Symposium , 2025

work page 2025
[9]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. In The ACM Conference on Computer and Communications Security (CCS), 2025

work page 2025
[10]

Meta secalign: A secure foundation llm against prompt injection attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks. arXiv preprint arXiv:2507.02735, 2025

work page arXiv 2025
[11]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems, 2024

work page 2024
[12]

How not to detect prompt injections with an llm

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How not to detect prompt injections with an llm. arXiv preprint arXiv:2507.05630, 2025

work page arXiv 2025
[13]

Agent overview.https://docs.cursor

Cursor Team. Agent overview.https://docs.cursor. com/en/agent/overview, 2025. Accessed: 2025-08- 24

work page 2025
[14]

Cedar: A new language for expressive, fast, safe, and analyzable authorization

Joseph W Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hi- etala, Eleftherios Ioannidis, John Kastner, Anwar Ma- mat, et al. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proceedings of the ACM on Programming Languages, 8(OOPSLA1):670– 697, 2024

work page 2024
[15]

Z3: An effi- cient smt solver

Leonardo De Moura and Nikolaj Bjørner. Z3: An effi- cient smt solver. In TACAS, 2008

work page 2008
[16]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, 14 Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In The Thirty-eight Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2024

work page 2024
[18]

Binder, a logic-based security lan- guage

John DeTreville. Binder, a logic-based security lan- guage. In Proceedings 2002 IEEE Symposium on Secu- rity and Privacy, pages 105–113. IEEE, 2002

work page 2002
[19]

Github mcp server: Github’s official mcp server

GitHub. Github mcp server: Github’s official mcp server. https://github.com/github/ github-mcp-server, 2024. GitHub repository

work page 2024
[20]

Gemini 2.5: Updates to our family of thinking models

Google. Gemini 2.5: Updates to our family of thinking models. https://developers.googleblog.com/ en/gemini-2-5-thinking-model-updates/ , 2025

work page 2025
[21]

Identity and Access Management (IAM)

Google Cloud. Identity and Access Management (IAM). https://cloud.google.com/iam/, 2025. Accessed: 2025-04-12

work page 2025
[22]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

work page 2023
[23]

The emerged security and privacy of llm agent: A survey with case studies

Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024
[24]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021

work page 2021
[25]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlight- ing. arXiv preprint arXiv:2403.14720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vid- gen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, ...

work page 2024
[27]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Ak- ila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Github mcp exploited: Accessing private repositories via mcp

Invariant Labs. Github mcp exploited: Accessing private repositories via mcp. https://invariantlabs.ai/ blog/mcp-github-vulnerability, December 2024. Blog post

work page 2024
[30]

JSON. JSON. https://www.json.org/json-en. html, 2025. Accessed: 2025-01-10

work page 2025
[31]

JSON Schema

JSON Schema. JSON Schema. https:// json-schema.org/, 2025. Accessed: 2025-01-10

work page 2025
[32]

Gmail Toolkit

LangChain. Gmail Toolkit. https://python. langchain.com/docs/integrations/tools/ gmail/, 2025. Accessed: 2025-01-10

work page 2025
[33]

Instruction defense

Learn Prompting. Instruction defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/instruction, 2024. Ac- cessed: 2025-08-24

work page 2024
[34]

Random sequence enclosure

Learn Prompting. Random sequence enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence,

work page
[35]

Accessed: 2025-08-24

work page 2025
[36]

Sandwich defense

Learn Prompting. Sandwich defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense, 2024. Accessed: 2025-08-24

work page 2024
[37]

Retrieval-augmented generation for knowledge- intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. In NeurIPS, 2020. 15

work page 2020
[38]

Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks

Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, and Meng Han. Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks. arXiv preprint arXiv:2409.19521, 2024

work page arXiv 2024
[39]

Sapper: A language for hardware-level security policy enforcement

Xun Li, Vineeth Kashyap, Jason K Oberg, Mohit Ti- wari, Vasanth Ram Rajarathinam, Ryan Kastner, Timo- thy Sherwood, Ben Hardekopf, and Frederic T Chong. Sapper: A language for hardware-level security policy enforcement. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pages 97–112, 2014

work page 2014
[40]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Eia: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Ji- awei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. ICLR, 2025

work page 2025
[42]

Automatic and universal prompt injection attacks against large language models,

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957, 2024

work page arXiv 2024
[43]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection at- tack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. Proceedings 2025 IEEE Symposium on Security and Privacy, 2025

work page 2025
[45]

Llama Prompt Guard 2

Meta. Llama Prompt Guard 2. https://www.llama. com/docs/model-cards-and-prompt-formats/ prompt-guard/, 2025. Accessed: 2025-08-14

work page 2025
[46]

Azure Policy Documentation

Microsoft. Azure Policy Documentation. https://learn.microsoft.com/en-us/azure/ governance/policy/, 2025. Accessed: 2025-04-12

work page 2025
[47]

Use agent mode in VS Code

Microsoft Corporation. Use agent mode in VS Code. https://code.visualstudio.com/docs/ copilot/chat/chat-agent-mode, 2025. Accessed: 2025-08-24

work page 2025
[48]

Adversarial search engine optimization for large language models

Fredrik Nestaas, Edoardo Debenedetti, and Florian Tramèr. Adversarial search engine optimization for large language models. In ICLR, 2025

work page 2025
[49]

Function calling – OpenAI API

OpenAI. Function calling – OpenAI API. https://platform.openai.com/docs/guides/ function-calling, 2025. Accessed: 2025-01-10

work page 2025
[50]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https:// openai.com/index/gpt-4-1/, 2025

work page 2025
[51]

Ignore previous prompt: Attack techniques for language models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. NeurIPS ML Safety Workshop, 2022

work page 2022
[52]

Fine-tuned deberta- v3-base for prompt injection detection

ProtectAI.com. Fine-tuned deberta- v3-base for prompt injection detection. https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection-v2 , 2024

work page 2024
[53]

python-jsonschema/jsonschema – GitHub

python-jsonschema. python-jsonschema/jsonschema – GitHub. https://github.com/ python-jsonschema/jsonschema, 2025. Accessed: 2025-01-10

work page 2025
[54]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Tool- former: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Tool- former: Language models can teach themselves to use tools. In NeurIPS, 2023

work page 2023
[56]

Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024

work page 2024
[57]

Reflexion: Lan- guage agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In NeurIPS, 2023

work page 2023
[58]

The dual llm pattern for building ai assistants that can resist prompt injec- tion

Simon Willison. The dual llm pattern for building ai assistants that can resist prompt injec- tion. https://simonwillison.net/2023/Apr/25/ dual-llm-pattern/, 2023. Accessed: 2025-08-24

work page 2023
[59]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Jo- hannes Heidecke, and Alex Beutel. The instruction hier- archy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, 16 Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023

work page 2023
[61]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18, 2024

work page 2024
[62]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

work page 2024
[63]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Jun- yang Lin, Robert Brennan, Hao Peng, Heng Ji, and Gra- ham Neubig. Openhands: An open platform for AI ...

work page 2025
[64]

Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

work page arXiv 2025
[65]

Dissecting adversarial robustness of multimodal lm agents

Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents. In NeurIPS 2024 Workshop on Open-World Agents, 2024

work page 2024
[66]

System-level defense against indirect prompt injection attacks: An information flow control perspective

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv preprint arXiv:2409.19091, 2024

work page arXiv 2024
[67]

A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick Mc- Daniel, and Chaowei Xiao. A new era in llm security: Exploring security concerns in real-world llm-based sys- tems. arXiv preprint arXiv:2402.18649, 2024

work page arXiv 2024
[68]

Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xi- aoyun Zhang, and Chi Wang. Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work. In COLM, 2024

work page 2024
[69]

IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems. In Network and Distributed System Security Symposium (NDSS) , 2025

work page 2025
[70]

Ad- vweb: Controllable black-box attacks on vlm-powered web agents

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Ad- vweb: Controllable black-box attacks on vlm-powered web agents. arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024
[71]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023

work page 2023
[72]

Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025

work page 2025
[73]

Attacking vision- language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision- language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024
[74]

list_repos

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language mod- els. In USENIX Security Symposium, 2025. A Sample policies Our implementation uses the JSON ecosystem. We give sam- ples of the policies in Figures 13 and 14. B Experiment Details We consistently use gpt-4...

work page 2025

[1] [1]

Contributors to all-hands-ai/openhands

All-Hands-AI/OpenHands. Contributors to all-hands-ai/openhands. https://github. com/All-Hands-AI/OpenHands/graphs/ contributors?from=5%2F4%2F2025, 2025. Ac- cessed: 2025-08-24

work page 2025

[2] [2]

AWS Identity and Access Man- agement (IAM)

Amazon Web Services. AWS Identity and Access Man- agement (IAM). https://aws.amazon.com/iam/,

work page

[3] [3]

Accessed: 2025-04-12

work page 2025

[4] [4]

Claude code

Anthropic. Claude code. https://www.anthropic. com/claude-code, 2025. Accessed: 2025-08-24

work page 2025

[5] [5]

Introducing claude 4

Anthropic. Introducing claude 4. https://www. anthropic.com/news/claude-4, 2025

work page 2025

[6] [6]

Runtime verification meets android security

Andreas Bauer, Jan-Christoph Küster, and Gil Vegliach. Runtime verification meets android security. In NASA Formal Methods Symposium, 2012

work page 2012

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Struq: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. In USENIX Security Symposium , 2025

work page 2025

[9] [9]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. In The ACM Conference on Computer and Communications Security (CCS), 2025

work page 2025

[10] [10]

Meta secalign: A secure foundation llm against prompt injection attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks. arXiv preprint arXiv:2507.02735, 2025

work page arXiv 2025

[11] [11]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. Advances in Neural Information Processing Systems, 2024

work page 2024

[12] [12]

How not to detect prompt injections with an llm

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, and Somesh Jha. How not to detect prompt injections with an llm. arXiv preprint arXiv:2507.05630, 2025

work page arXiv 2025

[13] [13]

Agent overview.https://docs.cursor

Cursor Team. Agent overview.https://docs.cursor. com/en/agent/overview, 2025. Accessed: 2025-08- 24

work page 2025

[14] [14]

Cedar: A new language for expressive, fast, safe, and analyzable authorization

Joseph W Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hi- etala, Eleftherios Ioannidis, John Kastner, Anwar Ma- mat, et al. Cedar: A new language for expressive, fast, safe, and analyzable authorization. Proceedings of the ACM on Programming Languages, 8(OOPSLA1):670– 697, 2024

work page 2024

[15] [15]

Z3: An effi- cient smt solver

Leonardo De Moura and Nikolaj Bjørner. Z3: An effi- cient smt solver. In TACAS, 2008

work page 2008

[16] [16]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, 14 Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In The Thirty-eight Conference on Neural Information Process- ing Systems Datasets and Benchmarks Track, 2024

work page 2024

[18] [18]

Binder, a logic-based security lan- guage

John DeTreville. Binder, a logic-based security lan- guage. In Proceedings 2002 IEEE Symposium on Secu- rity and Privacy, pages 105–113. IEEE, 2002

work page 2002

[19] [19]

Github mcp server: Github’s official mcp server

GitHub. Github mcp server: Github’s official mcp server. https://github.com/github/ github-mcp-server, 2024. GitHub repository

work page 2024

[20] [20]

Gemini 2.5: Updates to our family of thinking models

Google. Gemini 2.5: Updates to our family of thinking models. https://developers.googleblog.com/ en/gemini-2-5-thinking-model-updates/ , 2025

work page 2025

[21] [21]

Identity and Access Management (IAM)

Google Cloud. Identity and Access Management (IAM). https://cloud.google.com/iam/, 2025. Accessed: 2025-04-12

work page 2025

[22] [22]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

work page 2023

[23] [23]

The emerged security and privacy of llm agent: A survey with case studies

Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024

[24] [24]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021

work page 2021

[25] [25]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlight- ing. arXiv preprint arXiv:2403.14720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qi- hui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vid- gen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, ...

work page 2024

[27] [27]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Ak- ila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Github mcp exploited: Accessing private repositories via mcp

Invariant Labs. Github mcp exploited: Accessing private repositories via mcp. https://invariantlabs.ai/ blog/mcp-github-vulnerability, December 2024. Blog post

work page 2024

[30] [30]

JSON. JSON. https://www.json.org/json-en. html, 2025. Accessed: 2025-01-10

work page 2025

[31] [31]

JSON Schema

JSON Schema. JSON Schema. https:// json-schema.org/, 2025. Accessed: 2025-01-10

work page 2025

[32] [32]

Gmail Toolkit

LangChain. Gmail Toolkit. https://python. langchain.com/docs/integrations/tools/ gmail/, 2025. Accessed: 2025-01-10

work page 2025

[33] [33]

Instruction defense

Learn Prompting. Instruction defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/instruction, 2024. Ac- cessed: 2025-08-24

work page 2024

[34] [34]

Random sequence enclosure

Learn Prompting. Random sequence enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence,

work page

[35] [35]

Accessed: 2025-08-24

work page 2025

[36] [36]

Sandwich defense

Learn Prompting. Sandwich defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense, 2024. Accessed: 2025-08-24

work page 2024

[37] [37]

Retrieval-augmented generation for knowledge- intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. In NeurIPS, 2020. 15

work page 2020

[38] [38]

Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks

Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, and Meng Han. Gentel-safe: A uni- fied benchmark and shielding framework for defend- ing against prompt injection attacks. arXiv preprint arXiv:2409.19521, 2024

work page arXiv 2024

[39] [39]

Sapper: A language for hardware-level security policy enforcement

Xun Li, Vineeth Kashyap, Jason K Oberg, Mohit Ti- wari, Vasanth Ram Rajarathinam, Ryan Kastner, Timo- thy Sherwood, Ben Hardekopf, and Frederic T Chong. Sapper: A language for hardware-level security policy enforcement. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pages 97–112, 2014

work page 2014

[40] [40]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Eia: Environmental injection attack on generalist web agents for privacy leakage

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Ji- awei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage. ICLR, 2025

work page 2025

[42] [42]

Automatic and universal prompt injection attacks against large language models,

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv preprint arXiv:2403.04957, 2024

work page arXiv 2024

[43] [43]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection at- tack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. Proceedings 2025 IEEE Symposium on Security and Privacy, 2025

work page 2025

[45] [45]

Llama Prompt Guard 2

Meta. Llama Prompt Guard 2. https://www.llama. com/docs/model-cards-and-prompt-formats/ prompt-guard/, 2025. Accessed: 2025-08-14

work page 2025

[46] [46]

Azure Policy Documentation

Microsoft. Azure Policy Documentation. https://learn.microsoft.com/en-us/azure/ governance/policy/, 2025. Accessed: 2025-04-12

work page 2025

[47] [47]

Use agent mode in VS Code

Microsoft Corporation. Use agent mode in VS Code. https://code.visualstudio.com/docs/ copilot/chat/chat-agent-mode, 2025. Accessed: 2025-08-24

work page 2025

[48] [48]

Adversarial search engine optimization for large language models

Fredrik Nestaas, Edoardo Debenedetti, and Florian Tramèr. Adversarial search engine optimization for large language models. In ICLR, 2025

work page 2025

[49] [49]

Function calling – OpenAI API

OpenAI. Function calling – OpenAI API. https://platform.openai.com/docs/guides/ function-calling, 2025. Accessed: 2025-01-10

work page 2025

[50] [50]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https:// openai.com/index/gpt-4-1/, 2025

work page 2025

[51] [51]

Ignore previous prompt: Attack techniques for language models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. NeurIPS ML Safety Workshop, 2022

work page 2022

[52] [52]

Fine-tuned deberta- v3-base for prompt injection detection

ProtectAI.com. Fine-tuned deberta- v3-base for prompt injection detection. https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection-v2 , 2024

work page 2024

[53] [53]

python-jsonschema/jsonschema – GitHub

python-jsonschema. python-jsonschema/jsonschema – GitHub. https://github.com/ python-jsonschema/jsonschema, 2025. Accessed: 2025-01-10

work page 2025

[54] [54]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Tool- former: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Tool- former: Language models can teach themselves to use tools. In NeurIPS, 2023

work page 2023

[56] [56]

Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular rea- soning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024

work page 2024

[57] [57]

Reflexion: Lan- guage agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In NeurIPS, 2023

work page 2023

[58] [58]

The dual llm pattern for building ai assistants that can resist prompt injec- tion

Simon Willison. The dual llm pattern for building ai assistants that can resist prompt injec- tion. https://simonwillison.net/2023/Apr/25/ dual-llm-pattern/, 2023. Accessed: 2025-08-24

work page 2023

[59] [59]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Jo- hannes Heidecke, and Alex Beutel. The instruction hier- archy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, 16 Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023

work page 2023

[61] [61]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18, 2024

work page 2024

[62] [62]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

work page 2024

[63] [63]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Jun- yang Lin, Robert Brennan, Hao Peng, Heng Ji, and Gra- ham Neubig. Openhands: An open platform for AI ...

work page 2025

[64] [64]

Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. Agentvigil: Generic black-box red- teaming for indirect prompt injection against llm agents. arXiv preprint arXiv:2505.05849, 2025

work page arXiv 2025

[65] [65]

Dissecting adversarial robustness of multimodal lm agents

Chen Henry Wu, Rishi Rajesh Shah, Jing Yu Koh, Russ Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents. In NeurIPS 2024 Workshop on Open-World Agents, 2024

work page 2024

[66] [66]

System-level defense against indirect prompt injection attacks: An information flow control perspective

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv preprint arXiv:2409.19091, 2024

work page arXiv 2024

[67] [67]

A new era in llm security: Exploring security con- cerns in real-world llm-based systems,

Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick Mc- Daniel, and Chaowei Xiao. A new era in llm security: Exploring security concerns in real-world llm-based sys- tems. arXiv preprint arXiv:2402.18649, 2024

work page arXiv 2024

[68] [68]

Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xi- aoyun Zhang, and Chi Wang. Autogen: Enabling next- gen llm applications via multi-agent conversation frame- work. In COLM, 2024

work page 2024

[69] [69]

IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. IsolateGPT: An Execution Isola- tion Architecture for LLM-Based Systems. In Network and Distributed System Security Symposium (NDSS) , 2025

work page 2025

[70] [70]

Ad- vweb: Controllable black-box attacks on vlm-powered web agents

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Ad- vweb: Controllable black-box attacks on vlm-powered web agents. arXiv preprint arXiv:2410.17401, 2024

work page arXiv 2024

[71] [71]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, 2023

work page 2023

[72] [72]

Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formaliz- ing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025

work page 2025

[73] [73]

Attacking vision- language computer agents via pop-ups

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision- language computer agents via pop-ups. arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024

[74] [74]

list_repos

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language mod- els. In USENIX Security Symposium, 2025. A Sample policies Our implementation uses the JSON ecosystem. We give sam- ples of the policies in Figures 13 and 14. B Experiment Details We consistently use gpt-4...

work page 2025