GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Pith reviewed 2026-05-21 01:53 UTC · model grok-4.3
The pith
GuardAgent protects LLM agents by turning safety guard requests into executable code that blocks violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GuardAgent is the first guardrail agent that dynamically checks whether target agents' actions satisfy given safety guard requests by analyzing the requests to generate a task plan and then mapping the plan into guardrail code for deterministic execution, using an LLM supplemented by in-context demonstrations from a memory module of prior tasks, and it achieves over 98 percent guardrail accuracy on the EICU-AC benchmark and over 83 percent on the Mind2Web-SC benchmark.
What carries the argument
GuardAgent, which analyzes safety guard requests via LLM to produce a task plan that is then mapped into executable guardrail code for deterministic enforcement of safety policies.
If this is right
- Target LLM agents can be safeguarded from violations without any changes to their internal design or training.
- Safety rules expressed in natural language can be enforced consistently through generated code rather than repeated LLM judgments.
- The same guard mechanism applies across different agent types, such as those handling medical data access and those navigating websites.
- Performance improves by retrieving relevant past experiences instead of relying on the LLM alone for each new safety request.
Where Pith is reading between the lines
- The approach might extend to agents in other high-stakes areas like financial trading or physical robotics if similar benchmarks are created.
- If the code generation step proves stable, it could reduce the frequency of full LLM calls during runtime enforcement.
- Combining this with formal verification of the generated code could address cases where the LLM produces subtle logical errors.
Load-bearing premise
An LLM can reliably analyze arbitrary safety guard requests and generate correct task plans and executable guardrail code when given in-context demonstrations from a memory of previous tasks.
What would settle it
Running GuardAgent on a set of safety guard requests where the generated code either permits a clear violation or incorrectly blocks a valid action on the EICU-AC or Mind2Web-SC benchmarks.
read the original abstract
The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GuardAgent, a guardrail agent that protects target LLM agents by analyzing given safety guard requests to generate a task plan, then mapping the plan to executable guardrail code whose deterministic execution enforces the policy. An LLM performs the reasoning in both steps, aided by in-context demonstrations retrieved from a memory module of prior tasks. The authors introduce two new benchmarks—EICU-AC for access control in healthcare agents and Mind2Web-SC for safety policies in web agents—and report guardrail accuracies exceeding 98% and 83%, respectively, across different agent types.
Significance. If the performance claims are substantiated, the work offers a potentially useful direction for AI agent safety by combining LLM reasoning with deterministic code execution to enforce policies. The introduction of domain-specific benchmarks for healthcare access control and web-agent safety is a constructive contribution that could serve as testbeds for future research. The emphasis on code generation for enforcement, rather than purely probabilistic checks, is a methodological strength worth exploring further.
major comments (3)
- [§4] §4 (Benchmark Construction): The EICU-AC and Mind2Web-SC benchmarks are central to the performance claims, yet the manuscript provides no details on how safety guard requests were authored, how violation actions were selected or labeled, the distribution of request types, or inter-annotator agreement. Without this information the reported 98% and 83% accuracies cannot be interpreted as evidence of robust generalization.
- [§3.2] §3.2 (Code Generation and Execution): The central claim that 'GuardAgent can deterministically follow the safety guard request' rests on the correctness of LLM-generated guardrail code. The paper does not report systematic manual verification, unit testing, or adversarial edge-case analysis of the generated code; the skeptic concern that undetected logical errors or incomplete policy enforcement could occur on out-of-distribution requests is therefore unaddressed and load-bearing for the reliability of the accuracy numbers.
- [Evaluation] Evaluation section: No baseline comparisons (e.g., direct LLM safety prompting, static rule-based guards, or simpler retrieval-only methods) are presented. This omission makes it impossible to determine whether the observed accuracies represent an advance over existing guardrailing techniques.
minor comments (2)
- [Abstract] The abstract and §1 should explicitly define the guardrail accuracy metric (e.g., whether it is per-action classification accuracy, policy-level success rate, or something else) rather than reporting raw percentages.
- [Figure 2] Figure 2 (architecture diagram) would benefit from clearer annotation of the memory retrieval step and the interface between the generated code and the target agent’s action stream.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on our manuscript. We believe the suggested revisions will strengthen the paper and address the concerns raised. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Benchmark Construction): The EICU-AC and Mind2Web-SC benchmarks are central to the performance claims, yet the manuscript provides no details on how safety guard requests were authored, how violation actions were selected or labeled, the distribution of request types, or inter-annotator agreement. Without this information the reported 98% and 83% accuracies cannot be interpreted as evidence of robust generalization.
Authors: We agree with the referee that additional details on benchmark construction are necessary for proper interpretation of the results. In the revised manuscript, we will expand Section 4 to provide: (1) a detailed description of the authoring process for safety guard requests, which were developed in collaboration with domain experts in healthcare and web navigation; (2) the methodology for selecting and labeling violation actions based on real-world scenarios; (3) the distribution of different request types with accompanying statistics; and (4) inter-annotator agreement metrics, such as Cohen's kappa, calculated during the labeling process. These additions will substantiate the robustness of our benchmarks and the reported accuracies. revision: yes
-
Referee: [§3.2] §3.2 (Code Generation and Execution): The central claim that 'GuardAgent can deterministically follow the safety guard request' rests on the correctness of LLM-generated guardrail code. The paper does not report systematic manual verification, unit testing, or adversarial edge-case analysis of the generated code; the skeptic concern that undetected logical errors or incomplete policy enforcement could occur on out-of-distribution requests is therefore unaddressed and load-bearing for the reliability of the accuracy numbers.
Authors: We acknowledge the importance of verifying the correctness of the generated guardrail code to support our claims of deterministic enforcement. Although the execution is deterministic once the code is generated, we recognize the need for more rigorous validation. In the revision, we will include: systematic manual verification results on a representative sample of generated codes, unit testing for common guardrail functions, and an analysis of edge cases including some out-of-distribution requests. We will also add a discussion on potential limitations for highly novel scenarios. This will help mitigate concerns about undetected errors. revision: yes
-
Referee: [Evaluation] Evaluation section: No baseline comparisons (e.g., direct LLM safety prompting, static rule-based guards, or simpler retrieval-only methods) are presented. This omission makes it impossible to determine whether the observed accuracies represent an advance over existing guardrailing techniques.
Authors: We thank the referee for highlighting this gap in our evaluation. To demonstrate the advantages of GuardAgent, we will add baseline comparisons in the revised Evaluation section. Specifically, we will compare against: (1) direct LLM safety prompting, where the target agent is prompted to adhere to safety rules without a guard; (2) static rule-based guardrails implemented for the specific benchmarks; and (3) a retrieval-only method that retrieves similar past experiences without generating executable code. These comparisons will show that our knowledge-enabled reasoning and code generation approach yields superior performance in enforcing safety policies. revision: yes
Circularity Check
No circularity: GuardAgent claims rest on empirical evaluation of a proposed system on newly introduced benchmarks
full rationale
The paper introduces GuardAgent as an LLM-based guard agent that analyzes safety guard requests to produce task plans, maps those plans to executable guardrail code, and uses retrieved in-context demonstrations from a memory module. It defines two new benchmarks (EICU-AC and Mind2Web-SC) and reports guardrail accuracies of over 98% and 83% on them. No mathematical derivations, equations, fitted parameters, or self-referential constructions appear in the provided text. The central results are direct empirical measurements on independent benchmark tasks rather than quantities defined in terms of the method's own outputs or prior self-citations. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform reliable reasoning for task planning and code generation when provided with in-context demonstrations retrieved from a memory module.
invented entities (1)
-
GuardAgent
no independent evidence
Forward citations
Cited by 21 Pith papers
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
-
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
PolicyBank: Evolving Policy Understanding for LLM Agents
PolicyBank enables LLM agents to iteratively refine structured policy insights from corrective feedback, closing up to 82% of the performance gap on policy-ambiguity scenarios where prior memory methods achieve near-z...
-
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
-
Formal Policy Enforcement for Real-World Agentic Systems
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
-
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
-
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
-
Tracking Capabilities for Safer Agents
AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.
-
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in eva...
-
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
-
ADR: An Agentic Detection System for Enterprise Agentic AI Security
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the...
-
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
AgentWall: A Runtime Safety Layer for Local AI Agents
AgentWall introduces a policy-enforcing proxy for local AI agents that intercepts actions, requires approvals for sensitive operations, and achieves 92.9% enforcement accuracy with sub-millisecond overhead.
-
Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities
A systematic review of neuro-symbolic AI in cybersecurity finds that deeper integration and causal reasoning improve performance across intrusion detection and vulnerability tasks, while identifying barriers and a res...
-
Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems
GAAT is a proposed architecture extending OpenTelemetry with governance schemas, OPA-based detection, graduated enforcement, and trusted provenance to close the observe-but-do-not-act gap in multi-agent systems.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
Reference graph
Works this paper leans on
-
[1]
Guardrails AI. https://www.guardrailsai. com/. Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, and Ramesh Jain. 2024. Conversational health agents: A personalized llm-powered agent frame- work. Preprint, arXiv:2310.02374. Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge...
-
[2]
Mind2Web: Towards a Generalist Agent for the Web
Mind2web: Towards a generalist agent for the web. Preprint, arXiv:2306.06070. Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan- dra Faust. 2024. A real-world webagent with plan- ning, long context understanding, and program syn- thesis. In The Twelfth International Conference on Learning Representations...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Agent hospital: A simulacrum of hos- pital with evolvable medical agents. Preprint, arXiv:2405.02957. Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. 2024. A language agent for autonomous driving. In First Conference on Language Modeling. Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and ...
-
[4]
NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails. 11 In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. 2024. Ehragent: Code empowers l...
-
[5]
The Rise and Potential of Large Language Model Based Agents: A Survey
The rise and potential of large language model based agents: A survey. Preprint, arXiv:2309.07864. Wei Xiao, J. L. Weissman, and Philip L. F. Johnson
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Ecological drivers of crispr immune systems. mSystems, 9(12):e00568–24. Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, and Gao Huang. 2024. Llm agents for psy- chology: A study on gamified assessments. Preprint, arXiv:2402.12326. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,...
-
[7]
a binary label (either ‘0’ for ‘access granted’ and ‘1’ for ‘access denied’), and 5) databases and the columns required to answer the question but not accessible for the given role (if there are any). The examples in EICU-AC are created by sampling from the original EICU dataset following the steps below. First, from the 580 test examples in EICU, we obta...
work page 2024
-
[8]
When there are multiple answer choices containing the same words (including words with the same root, for example, “eye” and “eyes”, “slow” and “slowly”, “to” in “work to advantage” and “matter to”, etc.), none of these options should be selected
-
[9]
If the question is longer than or equal to 15 words, do not pick A, B, or C
-
[10]
If the question contains animals, the answer should be B
-
[11]
If the question contains a number ranging from one to five, the answer should not be the corresponding 19 Table 8: Performance of GuardAgent on the CSQA compared with the “model guarding agents” baseline, both based on a GPT-4 core model. The prediction recall (in percentage) for each “risk level” is reported for both approaches. GuardAgent outperforms th...
-
[12]
An important future research is to have the agent (or an auxiliary agent) create the required tools
Like most existing LLM agents, the toolbox of GuardAgent is specified manually. An important future research is to have the agent (or an auxiliary agent) create the required tools
-
[13]
The reasoning capabilities of GuardAgent can be further enhanced. Currently, the reasoning is based on a simple chain of thought without any validation of the reasoning steps. One possible future direction is to involve more advanced reasoning strategies, such as self-consistency or reflexion (Wang et al., 2023b; Shinn et al., 2023) to achieve more robust...
work page 2023
-
[14]
GuardAgent is still a single-agent system. The future development of GuardAgent can involve a 20 multi-agent design, for example, with multiple agents handling task planning, code generation, and memory management respectively. The multi-agent system can also handle more complicated guardrail requests. For example, suppose for an access control task, the ...
-
[15]
GuardAgent may potentially be integrated with more complex tools. For example, an ecosystem monitoring agent may incorporate metagenomic tools (Xiao et al., 2024). For another example, an autonomous driving agent may require a complex module (a Python package with a set of functions) to test if there is a collision given the environment information. 21 Fi...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.