Oracle Poisoning corrupts knowledge graphs used by AI agents via tool calls, leading tested models to accept fabricated claims at 100% under directed queries in a production-scale demonstration.
hub Canonical reference
Securing AI Agents with Information-Flow Control
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach enables us to complete a broad range of tasks with security guarantees. A tutorial to walk readers through the the concepts introduced in the paper can be found at https://github.com/microsoft/fides
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MCPHunt benchmark finds 11.5-41.3% policy-violating credential propagation in multi-server MCP agents across five models, reducible up to 97% by prompt mitigations while retaining most utility.
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.
MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-millisecond overhead.
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.
AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
citing papers explorer
-
Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning
Oracle Poisoning corrupts knowledge graphs used by AI agents via tool calls, leading tested models to accept fabricated claims at 100% under directed queries in a production-scale demonstration.
-
MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents
MCPHunt benchmark finds 11.5-41.3% policy-violating credential propagation in multi-server MCP agents across five models, reducible up to 97% by prompt mitigations while retaining most utility.
-
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
-
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
-
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
-
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
-
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
-
Formal Policy Enforcement for Real-World Agentic Systems
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
-
Securing LLM Agents Need Intent-to-Execution Integrity
The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.
-
MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-millisecond overhead.
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
-
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
-
Alignment Contracts for Agentic Security Systems
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
-
ADR: An Agentic Detection System for Enterprise Agentic AI Security
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
-
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and proposes a new testing method plus a defense.
-
Engineering Robustness into Personal Agents with the AI Workflow Store
AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
Can LLMs Make (Personalized) Access Control Decisions?
LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results when users over-permission.
-
Agent Security is a Systems Problem
The paper argues that agent security is best addressed as a systems problem by applying principles from operating systems, networks, and formal methods rather than relying solely on model robustness improvements.