pith. sign in

arxiv: 2406.09187 · v3 · pith:2AZN4T3Inew · submitted 2024-06-13 · 💻 cs.LG

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Pith reviewed 2026-05-21 01:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM agentssafety guardrailsguard agentaccess controlweb agentssafety policiesbenchmarksknowledge-enabled reasoning
0
0 comments X

The pith

GuardAgent protects LLM agents by turning safety guard requests into executable code that blocks violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GuardAgent as a separate guard agent that watches over target LLM agents to ensure their actions follow given safety rules. It first uses an LLM to break down the safety requests into a step-by-step task plan, then converts that plan into code that runs deterministically to check and stop bad actions. This is tested on new benchmarks for healthcare access control and web agent safety policies, where it stops most violations. A sympathetic reader cares because LLM agents are starting to act independently in real settings, so a reliable external check could prevent harm without rewriting the agents themselves. The method relies on retrieving past examples to help the LLM reason correctly each time.

Core claim

GuardAgent is the first guardrail agent that dynamically checks whether target agents' actions satisfy given safety guard requests by analyzing the requests to generate a task plan and then mapping the plan into guardrail code for deterministic execution, using an LLM supplemented by in-context demonstrations from a memory module of prior tasks, and it achieves over 98 percent guardrail accuracy on the EICU-AC benchmark and over 83 percent on the Mind2Web-SC benchmark.

What carries the argument

GuardAgent, which analyzes safety guard requests via LLM to produce a task plan that is then mapped into executable guardrail code for deterministic enforcement of safety policies.

If this is right

  • Target LLM agents can be safeguarded from violations without any changes to their internal design or training.
  • Safety rules expressed in natural language can be enforced consistently through generated code rather than repeated LLM judgments.
  • The same guard mechanism applies across different agent types, such as those handling medical data access and those navigating websites.
  • Performance improves by retrieving relevant past experiences instead of relying on the LLM alone for each new safety request.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might extend to agents in other high-stakes areas like financial trading or physical robotics if similar benchmarks are created.
  • If the code generation step proves stable, it could reduce the frequency of full LLM calls during runtime enforcement.
  • Combining this with formal verification of the generated code could address cases where the LLM produces subtle logical errors.

Load-bearing premise

An LLM can reliably analyze arbitrary safety guard requests and generate correct task plans and executable guardrail code when given in-context demonstrations from a memory of previous tasks.

What would settle it

Running GuardAgent on a set of safety guard requests where the generated code either permits a clear violation or incorrectly blocks a valid action on the EICU-AC or Mind2Web-SC benchmarks.

read the original abstract

The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes GuardAgent, a guardrail agent that protects target LLM agents by analyzing given safety guard requests to generate a task plan, then mapping the plan to executable guardrail code whose deterministic execution enforces the policy. An LLM performs the reasoning in both steps, aided by in-context demonstrations retrieved from a memory module of prior tasks. The authors introduce two new benchmarks—EICU-AC for access control in healthcare agents and Mind2Web-SC for safety policies in web agents—and report guardrail accuracies exceeding 98% and 83%, respectively, across different agent types.

Significance. If the performance claims are substantiated, the work offers a potentially useful direction for AI agent safety by combining LLM reasoning with deterministic code execution to enforce policies. The introduction of domain-specific benchmarks for healthcare access control and web-agent safety is a constructive contribution that could serve as testbeds for future research. The emphasis on code generation for enforcement, rather than purely probabilistic checks, is a methodological strength worth exploring further.

major comments (3)
  1. [§4] §4 (Benchmark Construction): The EICU-AC and Mind2Web-SC benchmarks are central to the performance claims, yet the manuscript provides no details on how safety guard requests were authored, how violation actions were selected or labeled, the distribution of request types, or inter-annotator agreement. Without this information the reported 98% and 83% accuracies cannot be interpreted as evidence of robust generalization.
  2. [§3.2] §3.2 (Code Generation and Execution): The central claim that 'GuardAgent can deterministically follow the safety guard request' rests on the correctness of LLM-generated guardrail code. The paper does not report systematic manual verification, unit testing, or adversarial edge-case analysis of the generated code; the skeptic concern that undetected logical errors or incomplete policy enforcement could occur on out-of-distribution requests is therefore unaddressed and load-bearing for the reliability of the accuracy numbers.
  3. [Evaluation] Evaluation section: No baseline comparisons (e.g., direct LLM safety prompting, static rule-based guards, or simpler retrieval-only methods) are presented. This omission makes it impossible to determine whether the observed accuracies represent an advance over existing guardrailing techniques.
minor comments (2)
  1. [Abstract] The abstract and §1 should explicitly define the guardrail accuracy metric (e.g., whether it is per-action classification accuracy, policy-level success rate, or something else) rather than reporting raw percentages.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from clearer annotation of the memory retrieval step and the interface between the generated code and the target agent’s action stream.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We believe the suggested revisions will strengthen the paper and address the concerns raised. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction): The EICU-AC and Mind2Web-SC benchmarks are central to the performance claims, yet the manuscript provides no details on how safety guard requests were authored, how violation actions were selected or labeled, the distribution of request types, or inter-annotator agreement. Without this information the reported 98% and 83% accuracies cannot be interpreted as evidence of robust generalization.

    Authors: We agree with the referee that additional details on benchmark construction are necessary for proper interpretation of the results. In the revised manuscript, we will expand Section 4 to provide: (1) a detailed description of the authoring process for safety guard requests, which were developed in collaboration with domain experts in healthcare and web navigation; (2) the methodology for selecting and labeling violation actions based on real-world scenarios; (3) the distribution of different request types with accompanying statistics; and (4) inter-annotator agreement metrics, such as Cohen's kappa, calculated during the labeling process. These additions will substantiate the robustness of our benchmarks and the reported accuracies. revision: yes

  2. Referee: [§3.2] §3.2 (Code Generation and Execution): The central claim that 'GuardAgent can deterministically follow the safety guard request' rests on the correctness of LLM-generated guardrail code. The paper does not report systematic manual verification, unit testing, or adversarial edge-case analysis of the generated code; the skeptic concern that undetected logical errors or incomplete policy enforcement could occur on out-of-distribution requests is therefore unaddressed and load-bearing for the reliability of the accuracy numbers.

    Authors: We acknowledge the importance of verifying the correctness of the generated guardrail code to support our claims of deterministic enforcement. Although the execution is deterministic once the code is generated, we recognize the need for more rigorous validation. In the revision, we will include: systematic manual verification results on a representative sample of generated codes, unit testing for common guardrail functions, and an analysis of edge cases including some out-of-distribution requests. We will also add a discussion on potential limitations for highly novel scenarios. This will help mitigate concerns about undetected errors. revision: yes

  3. Referee: [Evaluation] Evaluation section: No baseline comparisons (e.g., direct LLM safety prompting, static rule-based guards, or simpler retrieval-only methods) are presented. This omission makes it impossible to determine whether the observed accuracies represent an advance over existing guardrailing techniques.

    Authors: We thank the referee for highlighting this gap in our evaluation. To demonstrate the advantages of GuardAgent, we will add baseline comparisons in the revised Evaluation section. Specifically, we will compare against: (1) direct LLM safety prompting, where the target agent is prompted to adhere to safety rules without a guard; (2) static rule-based guardrails implemented for the specific benchmarks; and (3) a retrieval-only method that retrieves similar past experiences without generating executable code. These comparisons will show that our knowledge-enabled reasoning and code generation approach yields superior performance in enforcing safety policies. revision: yes

Circularity Check

0 steps flagged

No circularity: GuardAgent claims rest on empirical evaluation of a proposed system on newly introduced benchmarks

full rationale

The paper introduces GuardAgent as an LLM-based guard agent that analyzes safety guard requests to produce task plans, maps those plans to executable guardrail code, and uses retrieved in-context demonstrations from a memory module. It defines two new benchmarks (EICU-AC and Mind2Web-SC) and reports guardrail accuracies of over 98% and 83% on them. No mathematical derivations, equations, fitted parameters, or self-referential constructions appear in the provided text. The central results are direct empirical measurements on independent benchmark tasks rather than quantities defined in terms of the method's own outputs or prior self-citations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the assumption that current LLMs can serve as reliable planners and code generators for safety enforcement when given retrieved examples; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLMs can perform reliable reasoning for task planning and code generation when provided with in-context demonstrations retrieved from a memory module.
    The two-step process explicitly uses an LLM as the reasoning component supplemented by retrieved demonstrations.
invented entities (1)
  • GuardAgent no independent evidence
    purpose: A dedicated guard agent that converts safety requests into executable code to deterministically enforce compliance on target agents.
    Introduced as the core contribution; no independent falsifiable evidence outside the paper's own benchmarks is provided.

pith-pipeline@v0.9.0 · 5750 in / 1408 out tokens · 54111 ms · 2026-05-21T01:53:23.810958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

    cs.CY 2026-04 accept novelty 8.0

    This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...

  2. GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

    cs.CR 2026-01 unverdicted novelty 8.0

    GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

  3. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

  4. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

  5. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  6. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  7. PolicyBank: Evolving Policy Understanding for LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    PolicyBank enables LLM agents to iteratively refine structured policy insights from corrective feedback, closing up to 82% of the performance gap on policy-ambiguity scenarios where prior memory methods achieve near-z...

  8. Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

    cs.RO 2026-04 unverdicted novelty 7.0

    A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.

  9. Formal Policy Enforcement for Real-World Agentic Systems

    cs.CR 2026-02 unverdicted novelty 7.0

    FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.

  10. Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

    cs.CR 2026-05 unverdicted novelty 6.0

    ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...

  11. SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.

  12. Tracking Capabilities for Safer Agents

    cs.AI 2026-03 unverdicted novelty 6.0

    AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.

  13. From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

    cs.AI 2025-10 unverdicted novelty 6.0

    Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.

  14. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    cs.AI 2025-03 unverdicted novelty 6.0

    AgentSpec introduces a customizable DSL for runtime enforcement of safety constraints on LLM agents, achieving over 90% prevention of unsafe code actions, zero hazardous embodied actions, and 100% AV compliance in eva...

  15. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    cs.LG 2024-10 accept novelty 6.0

    AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

  16. ADR: An Agentic Detection System for Enterprise Agentic AI Security

    cs.AI 2026-05 unverdicted novelty 5.0

    ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the...

  17. SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

    cs.CR 2026-05 unverdicted novelty 5.0

    SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.

  18. Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

    cs.SE 2026-04 unverdicted novelty 5.0

    Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.

  19. AgentWall: A Runtime Safety Layer for Local AI Agents

    cs.AI 2026-03 unverdicted novelty 5.0

    AgentWall introduces a policy-enforcing proxy for local AI agents that intercepts actions, requires approvals for sensitive operations, and achieves 92.9% enforcement accuracy with sub-millisecond overhead.

  20. Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

    cs.CR 2026-01 unverdicted novelty 5.0

    Red-teaming of the Agent Payments Protocol reveals vulnerabilities to direct and indirect prompt injection, with Branded Whisper and Vault Whisper attacks enabling product ranking manipulation and sensitive data extraction.

  21. Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities

    cs.CR 2025-09 unverdicted novelty 5.0

    A systematic review of neuro-symbolic AI in cybersecurity finds that deeper integration and causal reasoning improve performance across intrusion detection and vulnerability tasks, while identifying barriers and a res...

  22. Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

    cs.MA 2026-04 unverdicted novelty 4.0

    GAAT is a proposed architecture extending OpenTelemetry with governance schemas, OPA-based detection, graduated enforcement, and trusted provenance to close the observe-but-do-not-act gap in multi-agent systems.

  23. Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

    cs.AI 2025-10 unverdicted novelty 4.0

    A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 23 Pith papers · 2 internal anchors

  1. [1]

    https://www.guardrailsai

    Guardrails AI. https://www.guardrailsai. com/. Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, and Ramesh Jain. 2024. Conversational health agents: A personalized llm-powered agent frame- work. Preprint, arXiv:2310.02374. Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge...

  2. [2]

    Mind2Web: Towards a Generalist Agent for the Web

    Mind2web: Towards a generalist agent for the web. Preprint, arXiv:2306.06070. Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksan- dra Faust. 2024. A real-world webagent with plan- ning, long context understanding, and program syn- thesis. In The Twelfth International Conference on Learning Representations...

  3. [3]

    Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957,

    Agent hospital: A simulacrum of hos- pital with evolvable medical agents. Preprint, arXiv:2405.02957. Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. 2024. A language agent for autonomous driving. In First Conference on Language Modeling. Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and ...

  4. [4]

    11 In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails. 11 In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. 2024. Ehragent: Code empowers l...

  5. [5]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    The rise and potential of large language model based agents: A survey. Preprint, arXiv:2309.07864. Wei Xiao, J. L. Weissman, and Philip L. F. Johnson

  6. [6]

    mSystems, 9(12):e00568–24

    Ecological drivers of crispr immune systems. mSystems, 9(12):e00568–24. Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, and Gao Huang. 2024. Llm agents for psy- chology: A study on gamified assessments. Preprint, arXiv:2402.12326. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,...

  7. [7]

    model guarding agent

    a binary label (either ‘0’ for ‘access granted’ and ‘1’ for ‘access denied’), and 5) databases and the columns required to answer the question but not accessible for the given role (if there are any). The examples in EICU-AC are created by sampling from the original EICU dataset following the steps below. First, from the 580 test examples in EICU, we obta...

  8. [8]

    eye” and “eyes

    When there are multiple answer choices containing the same words (including words with the same root, for example, “eye” and “eyes”, “slow” and “slowly”, “to” in “work to advantage” and “matter to”, etc.), none of these options should be selected

  9. [9]

    If the question is longer than or equal to 15 words, do not pick A, B, or C

  10. [10]

    If the question contains animals, the answer should be B

  11. [11]

    model guarding agents

    If the question contains a number ranging from one to five, the answer should not be the corresponding 19 Table 8: Performance of GuardAgent on the CSQA compared with the “model guarding agents” baseline, both based on a GPT-4 core model. The prediction recall (in percentage) for each “risk level” is reported for both approaches. GuardAgent outperforms th...

  12. [12]

    An important future research is to have the agent (or an auxiliary agent) create the required tools

    Like most existing LLM agents, the toolbox of GuardAgent is specified manually. An important future research is to have the agent (or an auxiliary agent) create the required tools

  13. [13]

    Currently, the reasoning is based on a simple chain of thought without any validation of the reasoning steps

    The reasoning capabilities of GuardAgent can be further enhanced. Currently, the reasoning is based on a simple chain of thought without any validation of the reasoning steps. One possible future direction is to involve more advanced reasoning strategies, such as self-consistency or reflexion (Wang et al., 2023b; Shinn et al., 2023) to achieve more robust...

  14. [14]

    faculty members from colleges A and B, and graduate assistants from college C and department a of college D cannot access database α

    GuardAgent is still a single-agent system. The future development of GuardAgent can involve a 20 multi-agent design, for example, with multiple agents handling task planning, code generation, and memory management respectively. The multi-agent system can also handle more complicated guardrail requests. For example, suppose for an access control task, the ...

  15. [15]

    pseudo access control

    GuardAgent may potentially be integrated with more complex tools. For example, an ecosystem monitoring agent may incorporate metagenomic tools (Xiao et al., 2024). For another example, an autonomous driving agent may require a complex module (a Python package with a set of functions) to test if there is a collision given the environment information. 21 Fi...