arxiv: 2604.15579 · v1 · submitted 2026-04-16 · 💻 cs.SE · cs.AI· cs.CR

Recognition: unknown

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong , Yining She , Eunsuk Kang , Christopher S. Timperley , Christian K\"astner

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords AI agentssymbolic guardrailssafety policiessecurity guaranteesdomain-specific agentsagent benchmarkspolicy enforcementtool use

0 comments

The pith

Symbolic guardrails can enforce 74 percent of the concrete safety policies required by domain-specific AI agents while preserving task success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agents that use tools risk causing privacy breaches or financial harm in business settings when they act incorrectly. The paper investigates symbolic guardrails as a way to enforce explicit policies and deliver guarantees that training or neural methods cannot provide. A review of 80 benchmarks shows that most lack concrete policies and instead rely on vague goals, but among the specified requirements 74 percent can be checked with simple symbolic mechanisms. Tests across three agent benchmarks confirm that adding these guardrails raises safety and security scores without lowering the rate at which agents complete their tasks.

Core claim

Symbolic guardrails enforce safety and security policies for AI agents by inserting low-cost, verifiable checks on tool calls and actions. The study finds that 85 percent of 80 examined benchmarks use only high-level goals rather than concrete policies, yet 74 percent of the policies that are stated can be implemented symbolically. On τ²-Bench, CAR-bench, and MedAgentBench the guardrails measurably reduce unsafe behavior while agent utility remains unchanged.

What carries the argument

Symbolic guardrails: programmatic or logical checks placed before tool execution that verify agent actions against explicitly stated policies.

If this is right

Domain-specific agents can receive stronger safety guarantees than general agents because their policies are easier to formalize.
Simple, low-cost symbolic checks suffice for the majority of currently stated policy requirements.
Benchmarks should shift from vague goals to explicit, machine-checkable policies to enable reliable evaluation.
Symbolic guardrails can be layered on top of existing agents without retraining or performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If policies are written first, many new agent applications could adopt guardrails from the start rather than retrofitting them.
The approach may reduce the need for repeated safety fine-tuning when requirements change.
Extending the method to multi-agent systems would require handling interactions between separate policy sets.

Load-bearing premise

The concrete policies extracted from the 80 benchmarks are representative of the safety requirements that matter in actual domain-specific deployments.

What would settle it

A controlled experiment on one of the evaluated benchmarks in which a safety violation allowed by the original policy is still executed after the corresponding symbolic guardrail is added.

Figures

Figures reproduced from arXiv: 2604.15579 by Christian K\"astner, Christopher S. Timperley, Eunsuk Kang, Yining Hong, Yining She.

**Figure 2.** Figure 2: Comparison between general-purpose AI agents [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of safety or security policy enforce [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of applicable symbolic guardrails for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on $\tau^2$-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that symbolic guardrails provide a practical path to strong safety and security guarantees for domain-specific AI agents. Through a three-part study, it reviews 80 benchmarks to identify policies (finding 85% lack concrete requirements and rely on underspecified goals), analyzes that 74% of specified policy requirements can be enforced symbolically via simple low-cost mechanisms, and evaluates the approach on τ²-Bench, CAR-bench, and MedAgentBench, reporting safety/security improvements without utility loss. Code and artifacts are released.

Significance. If the quantitative claims and evaluations hold under scrutiny, this work is significant for demonstrating that symbolic methods can deliver enforceable guarantees where neural guardrails and training-based approaches cannot, especially in domain-specific settings. The benchmark review usefully exposes underspecification issues, the no-utility-loss result is practically relevant, and the public code release supports reproducibility and extension.

major comments (2)

Abstract: the 85% and 74% quantitative findings are load-bearing for the central claim, yet the abstract (and by extension the described study) provides no details on the systematic review methodology, policy extraction criteria, or decision procedure for determining enforceability by symbolic guardrails, preventing verification of these percentages.
Evaluation sections on τ²-Bench, CAR-bench, and MedAgentBench: the 'guarantee' language is not supported because no formal verification, model checking, exhaustive testing, or soundness argument is provided for the guardrail implementations themselves; ad-hoc rule additions could fail to enforce policies correctly while still reporting success.

minor comments (1)

The description of the exact symbolic mechanisms (e.g., rule syntax or code patterns) used in the three benchmark evaluations could be expanded for clarity and to allow assessment of generality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive review, which highlights the potential significance of symbolic guardrails while identifying areas for clarification. We address each major comment below with specific plans for revision.

read point-by-point responses

Referee: Abstract: the 85% and 74% quantitative findings are load-bearing for the central claim, yet the abstract (and by extension the described study) provides no details on the systematic review methodology, policy extraction criteria, or decision procedure for determining enforceability by symbolic guardrails, preventing verification of these percentages.

Authors: We agree that the abstract omits key methodological details required to substantiate the 85% and 74% figures. In the revised manuscript, we will expand the abstract with a concise description of the systematic review (including benchmark selection criteria, policy extraction process, and enforceability decision rules). We will also add a dedicated methods subsection (likely in Section 3) that fully documents the review protocol, provides concrete examples of policy extraction and symbolic classification decisions, and explains the decision procedure. These changes will allow independent verification of the reported percentages without altering the core claims. revision: yes
Referee: Evaluation sections on τ²-Bench, CAR-bench, and MedAgentBench: the 'guarantee' language is not supported because no formal verification, model checking, exhaustive testing, or soundness argument is provided for the guardrail implementations themselves; ad-hoc rule additions could fail to enforce policies correctly while still reporting success.

Authors: We acknowledge the validity of this critique. The paper's use of 'guarantee' and similar terms in the title, abstract, and evaluation sections implies formal assurances that our empirical results and rule-based implementations do not rigorously establish. While the guardrails consist of deterministic, inspectable rules that enforce policies by construction when correctly implemented, we provide no formal verification, model checking, or exhaustive soundness proof, leaving room for implementation errors in ad-hoc rules. In the revision, we will replace unqualified 'guarantee' language with precise alternatives such as 'enforce' or 'deliver enforceable constraints' across the manuscript. We will add an explicit limitations discussion noting the lack of formal verification and the manual nature of rule specification, while underscoring that the open-sourced code enables community review and testing. This preserves the practical contribution without overstating formal properties. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical analysis

full rationale

The paper conducts a systematic review of 80 external benchmarks to catalog policies, determines which requirements are amenable to symbolic enforcement, and measures effects on three independent evaluation suites (τ²-Bench, CAR-bench, MedAgentBench). No equations, derivations, fitted parameters, or predictions appear in the work. No self-citations are invoked to establish uniqueness, ansatzes, or load-bearing premises that would reduce the central claims to tautologies. All supporting artifacts are released externally, so the reported findings rest on observable data rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the representativeness of the 80 benchmarks and the feasibility of translating policies into symbolic rules without loss of intent or utility.

axioms (2)

domain assumption The 80 state-of-the-art benchmarks reviewed capture the relevant range of safety and security policies for AI agents.
The systematic review and subsequent analysis depend on this selection being comprehensive.
domain assumption Symbolic mechanisms can enforce the identified policy requirements without compromising agent success rates in practice.
Central to the claim that guardrails improve safety without sacrificing utility.

pith-pipeline@v0.9.0 · 5561 in / 1194 out tokens · 36125 ms · 2026-05-10T10:11:53.782431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 38 canonical work pages · 11 internal anchors

[1]

Nadya Abaev, Denis Klimov, Gerard Levinov, David Mimran, Yuval Elovici, and Asaf Shabtai. 2026. AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior. arXiv:2601.10440 [cs.CR] https://arxiv.org/abs/2601.10440

work page arXiv 2026
[2]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrik- son, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Measur- ing Harmfulness of LLM Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/foru...

2025
[3]

2024.Model Context Protocol

Anthropic. 2024.Model Context Protocol. https://github.com/modelcontextproto col

2024
[4]

Anthropic. 2026. Claude. https://www.anthropic.com/claude

2026
[5]

2026.Claude Code Overview

Anthropic. 2026.Claude Code Overview. https://docs.anthropic.com/en/docs/cla ude-code/overview Accessed: 2026-03-25

2026
[6]

2026.Cursor: The AI Code Editor

Anysphere, Inc. 2026.Cursor: The AI Code Editor. https://cursor.com

2026
[7]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, et al. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs.CL] https://arxiv.org/abs/2204.05862

work page Pith review arXiv 2022
[8]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 [cs.CL] https://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 2025. 𝜏 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI] https://arxiv.org/abs/2506.07982

work page internal anchor Pith review arXiv 2025
[10]

Tianyu Chen, Chujia Hu, Ge Gao, Dongrui Liu, Xia Hu, and Wenjie Wang. 2026. LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios. arXiv:2602.03255 [cs.AI] https://arxiv.org/abs/2602.03255

work page arXiv 2026
[11]

Zhaorun Chen, Mintong Kang, and Bo Li. 2025. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Fe- lix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Je...

2025
[12]

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abra- ham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, and Joshua Saxe
[13]

LlamaFirewall: An open source guardrail system for building secure AI agents

LlamaFirewall: An open source guardrail system for building secure AI agents. arXiv:2505.03574 [cs.CR] https://arxiv.org/abs/2505.03574

work page arXiv
[14]

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russi- novich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. 2025. Securing AI Agents with Information-Flow Control. arXiv:2505.23643 [cs.CR] https://arxiv.org/abs/2505.23643

work page internal anchor Pith review arXiv 2025
[15]

Alejandro Cuadron, Pengfei Yu, Yang Liu, and Arpit Gupta. 2025. SABER: Small Actions, Big Errors - Safeguarding Mutating Steps in LLM Agents. arXiv:2512.07850 [cs.LG] https://arxiv.org/abs/2512.07850

work page arXiv 2025
[16]

Jian Cui, Zichuan Li, Luyi Xing, and Xiaojing Liao. 2026. Maris: A Formally Verifiable Privacy Policy Enforcement Paradigm for Multi-Agent Collaboration Systems. arXiv:2505.04799 [cs.CR] https://arxiv.org/abs/2505.04799

work page arXiv 2026
[17]

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. 2025. Defeating Prompt Injections by Design. arXiv:2503.18813 [cs.CR] https://arxiv.org/abs/2503.18813

work page internal anchor Pith review arXiv 2025
[18]

Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2025. AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways.ACM Comput. Surv.57, 7, Article 182 (Feb. 2025), 36 pages. doi:10.1145/3716628

work page doi:10.1145/3716628 2025
[19]

Dorothy E. Denning. 1976. A lattice model of secure information flow.Commun. ACM19, 5 (May 1976), 236–243. doi:10.1145/360051.360056

work page doi:10.1145/360051.360056 1976
[20]

Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner. 2026. Towards Verifiably Safe Tool Use for LLM Agents. arXiv:2601.08012 [cs.SE] https://arxiv.org/abs/2601.08012

work page arXiv 2026
[21]

Katia Romero Felizardo, Emilia Mendes, Marcos Kalinowski, Érica Ferreira Souza, and Nandamudi L Vijaykumar. 2016. Using forward snowballing to update systematic reviews in software engineering. InProceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 1–6

2016
[22]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.078...

work page internal anchor Pith review arXiv 2022
[23]

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...

work page internal anchor Pith review arXiv 2022
[24]

Amr Gomaa, Ahmed Salem, and Sahar Abdelnabi. 2026. ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 3246–

2026
[25]

doi:10.18653/v1/2026.findings-eacl.170

work page doi:10.18653/v1/2026.findings-eacl.170 2026
[26]

William G. J. Halfond, Jeremy Viegas, and Alessandro Orso. 2006. A Classification of SQL Injection Attacks and Countermeasures. InInternational Symposium on Signals, Systems, and Electronics. https://api.semanticscholar.org/CorpusID: 5969227

2006
[27]

Hippocratic AI. 2026. Hippocratic AI: Home. https://hippocraticai.com/

2026
[28]

Insightai: Root cause analysis in large log files with private data using large language model

Yining Hong, Christopher S. Timperley, and Christian Kästner. 2025. From Haz- ard Identification to Controller Design: Proactive and LLM-Supported Safety Engineering for ML-Powered Systems. In2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). 113–118. doi:10.1109/CAIN66642.2025.00021

work page doi:10.1109/cain66642.2025.00021 2025
[29]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology(2025)

2025
[30]

Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien- Sheng Wu. 2026. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions.Transactions on Machine Learning Research(2026). https://openreview.net/forum?id=EPlpe3Fx1x

2026
[31]

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. 2025. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents.Nejm Ai2, 9 (2025), AIdbp2500144

2025
[32]

2025.Machine Learning in Production: From Models to Products

Christian Kaestner. 2025.Machine Learning in Production: From Models to Products. MIT Press

2025
[33]

Adharsh Kamath, Sishen Zhang, Calvin Xu, Shubham Ugare, Gagandeep Singh, and Sasa Misailovic. 2025. Enforcing Temporal Constraints for LLM Agents. arXiv:2512.23738 [cs.PL] https://arxiv.org/abs/2512.23738

work page arXiv 2025
[34]

Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. 2025. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents. arXiv:2503.15547 [cs.CR] https: //arxiv.org/abs/2503.15547

work page arXiv 2025
[35]

Johannes Kirmayr, Lukas Stappen, and Elisabeth André. 2026. CAR-bench: Eval- uating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty. arXiv:2601.22027 [cs.AI] https://arxiv.org/abs/2601.22027

work page arXiv 2026
[36]

Barbara Kitchenham, Stuart Charters, et al . 2007. Guidelines for performing systematic literature reviews in software engineering. (2007)

2007
[37]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Fer- ret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Confer- ence on Machine Learning(Vienna, Austria)(ICML’2...

2024
[38]

Bradley Knox, and Kimin Lee

Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee
[39]

2026), 37565–37573

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control.Proceedings of the AAAI Conference on Artificial Intelligence40, 44 (Mar. 2026), 37565–37573. doi:10.1609/aaai.v40i44.41090

work page doi:10.1609/aaai.v40i44.41090 2026
[40]

Jingdi Lei, Varun Gumma, Rishabh Bhardwaj, Seok Min Lim, Chuan Li, Amir Zadeh, and Soujanya Poria. 2026. OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! arXiv:2509.26495 [cs.AI] https://arxiv.or g/abs/2509.26495

work page arXiv 2026
[41]

Ido Levy, Ben wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov
[42]

InThe Fourteenth International Conference on Learning Representations

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustwor- thiness in Web Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=MuCDzH0ctf
[43]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

2020
[44]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/pre sentation/liu-yupei

2024
[45]

Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. 2025. AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohamma...

2025
[46]

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. 2026. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security8, 3-4 (2026), 1–240

2026
[47]

2023.Getting started with Copilot on Windows

Microsoft. 2023.Getting started with Copilot on Windows. https://support.micros oft.com/en-us/topic/getting-started-with-copilot-on-windows-1159c61f-86c3- 4755-bf83-7fbff7e0982d

2023
[48]

2025.GitHub MCP Exploited: Accessing private repositories via MCP

Marco Milanta and Luca Beurer-Kellner. 2025.GitHub MCP Exploited: Accessing private repositories via MCP. https://invariantlabs.ai/blog/mcp- github- vulnerability

2025
[49]

2025.ChatGPT agent

OpenAI. 2025.ChatGPT agent. https://chatgpt.com/features/agent

2025
[50]

OpenAI. 2026. ChatGPT. https://chatgpt.com/. Large language model accessed March 20, 2026

2026
[51]

OpenClaw. 2026. OpenClaw — Personal AI Assistant. https://openclaw.ai/. Accessed: 2026-03-12

2026
[52]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human...

2022
[53]

Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, and Somesh Jha. 2026. Policy Compiler for Secure Agentic Systems. arXiv:2602.16708 [cs.CR] https://arxiv.org/abs/2602.16708

work page internal anchor Pith review arXiv 2026
[54]

2026.Meta AI Alignment Director’s OpenClaw Email Deletion Incident Exposes the Real Agent Safety Boundary

penligent. 2026.Meta AI Alignment Director’s OpenClaw Email Deletion Incident Exposes the Real Agent Safety Boundary. https://www.penligent.ai/hackinglabs /meta-ai-alignment-directors-openclaw-email-deletion-incident-exposes-the- real-agent-safety-boundary/

2026
[55]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computa...

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[56]

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, and Songlin Hu. 2026. Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation. arXiv:2512.16310 [cs.CR] https://arxiv.org/abs/2512.16310

work page arXiv 2026
[57]

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Associat...

work page doi:10.18653/v1/2023.emnlp-demo.40 2023
[58]

Ravi Sandhu, David Ferraiolo, Richard Kuhn, et al. 2000. The NIST model for role- based access control: towards a unified standard. InACM workshop on Role-based access control, Vol. 10

2000
[59]

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational ...

work page doi:10.18653/v1/2021.emnlp- 2021
[60]

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024. PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 89373–89407. doi:10.52202/079017-2837

work page doi:10.52202/079017-2837 2024
[61]

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. arXiv:2504.11703 [cs.CR] https://arxiv.org/abs/2504.11703

work page internal anchor Pith review arXiv 2025
[62]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learn- ing. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Asso- ciates, Inc., 8634–8652. https://proceedings.neurips...

2023
[63]

Sierra. 2026. Meet your agent. https://sierra.ai/product/meet-your-agent. Accessed: 2026-03-12. Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility Conference’17, July 2017, Washington, DC, USA

2026
[64]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 3008–3021. https:...

2020
[65]

2025.Announcing the agent2agent protocol (A2A)

Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. 2025.Announcing the agent2agent protocol (A2A). https://developers.googleblog.com/en/a2a-a-new- era-of-agent-interoperability/

2025
[66]

2001.Building secure software: how to avoid security problems the right way

John Viega and Gary R McGraw. 2001.Building secure software: how to avoid security problems the right way. Pearson Education

2001
[67]

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. 2026. OpenAgentSafety: A Comprehensive Framework For Evaluating Real-World AI Agent Safety. InThe Fourteenth International Conference on Learning Representations. https://openre view.net/forum?id=xggSxCFQbA

2026
[68]

David A Wagner, Jeffrey S Foster, Eric A Brewer, and Alexander Aiken. 2000. A first step towards automated detection of buffer overrun vulnerabilities.. InNDSS, Vol. 20. 0

2000
[69]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.AI] https://arxiv.org/abs/2503.18666

work page internal anchor Pith review arXiv 2025
[70]

Shouju Wang and Haopeng Zhang. 2026. MPCI-Bench: A Benchmark for Mul- timodal Pairwise Contextual Integrity Evaluation of Language Model Agents. arXiv:2601.08235 [cs.AI] https://arxiv.org/abs/2601.08235

work page arXiv 2026
[71]

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective. arXiv:2409.19091 [cs.CR] https://arxiv.org/abs/2409.19091

work page arXiv 2024
[72]

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Ji- awei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. 2025. GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning. arXiv:2406.09187 [cs.LG] https://arxiv.org/abs/2406.09187

work page arXiv 2025
[73]

Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Käst- ner, and Tongshuang Wu. 2025. What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts. arXiv:2505.13360 [cs.CL] https://arxiv.org/abs/2505.13360

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI] https://arxiv.org/abs/2406.12045

work page internal anchor Pith review arXiv 2024
[75]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X

2023
[76]

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. 2024. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). As...

work page doi:10.18653/v1/20 2024
[77]

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2389, 19 pages

2024
[78]

Titzer, Heather Miller, and Phillip B

Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. 2025. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage. arXiv:2502.08966 [cs.CR] https: //arxiv.org/abs/2502.08966

work page arXiv 2025
[79]

Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oru- ganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Narayanaraju, and Xin Eric Wang. 2026. SafePro: Evaluating the Safety of Professional-Level AI Agents. arXiv:2601.06663 [cs.AI] https://arxiv.org/abs/2601.06663

work page arXiv 2026