AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

· 2026 · cs.AI · arXiv 2605.04785

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.

representative citing papers

AIRGuard: Guarding Agent Actions with Runtime Authority Control

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

cs.MA · 2026-06-01 · unverdicted · novelty 4.0

Agentic-J is a multi-agent AI assistant that converts natural language descriptions of biological image analysis tasks into executable, reproducible scripts for ImageJ/Fiji with specialised sub-agents for plugin management, code generation, debugging and reporting.

citing papers explorer

Showing 2 of 2 citing papers.

AIRGuard: Guarding Agent Actions with Runtime Authority Control cs.CR · 2026-05-27 · unverdicted · none · ref 21 · internal anchor
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
Agentic-J: An AI Agent for Biological Microscopy Image Analysis cs.MA · 2026-06-01 · unverdicted · none · ref 32 · internal anchor
Agentic-J is a multi-agent AI assistant that converts natural language descriptions of biological image analysis tasks into executable, reproducible scripts for ImageJ/Fiji with specialised sub-agents for plugin management, code generation, debugging and reporting.

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

fields

years

verdicts

representative citing papers

citing papers explorer