AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu , Qihan Ren , Chen Qian , Shuai Shao , Yuejin Xie , Yu Li , Zhonghao Yang , Haoyu Luo

show 35 more authors

Peng Wang Qingyu Liu Binxin Hu Ling Tang Jilin Mei Dadi Guo Leitao Yuan Junyao Yang Guanxu Chen Qihao Lin Yi Yu Bo Zhang Jiaxuan Guo Jie Zhang Wenqi Shao Huiqi Deng Zhiheng Xi Wenjie Wang Wenxuan Wang Wen Shen Zhikai Chen Haoyu Xie Jialing Tao Juntao Dai Jiaming Ji Zhongjie Ba Linfeng Zhang Yong Liu Quanshi Zhang Lei Zhu Zhihua Wei Hui Xue Chaochao Lu Jing Shao Xia Hu

Authors on Pith no claims yet

classification 💻 cs.AI cs.CCcs.CLcs.CVcs.LG

keywords agentdogagenticsafetyagentguardrailcomplexsecurityacross

0 comments

read the original abstract

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench supplies 1,000 trajectories (503 safe, 497 unsafe) organized by risk source, failure mode, and harm to evaluate long-horizon safety in LLM-based agents.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
HearthNet: Edge Multi-Agent Orchestration for Smart Homes
cs.DC 2026-03 unverdicted novelty 6.0

HearthNet is an edge multi-agent orchestration system that runs role-specialized LLM agents locally to handle natural-language smart-home control, conflict resolution, and failure recovery through MQTT and shared state.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
cs.AI 2026-04 unverdicted novelty 5.0

ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI
cs.AI 2026-04 unverdicted novelty 4.0

The paper presents a layered method to translate governance objectives from standards such as ISO/IEC 42001 into four control layers for agentic AI, with runtime guardrails limited to observable, determinate, and time...