Recognition: unknown
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Pith reviewed 2026-05-10 10:11 UTC · model grok-4.3
The pith
Symbolic guardrails can enforce 74 percent of the concrete safety policies required by domain-specific AI agents while preserving task success rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Symbolic guardrails enforce safety and security policies for AI agents by inserting low-cost, verifiable checks on tool calls and actions. The study finds that 85 percent of 80 examined benchmarks use only high-level goals rather than concrete policies, yet 74 percent of the policies that are stated can be implemented symbolically. On τ²-Bench, CAR-bench, and MedAgentBench the guardrails measurably reduce unsafe behavior while agent utility remains unchanged.
What carries the argument
Symbolic guardrails: programmatic or logical checks placed before tool execution that verify agent actions against explicitly stated policies.
If this is right
- Domain-specific agents can receive stronger safety guarantees than general agents because their policies are easier to formalize.
- Simple, low-cost symbolic checks suffice for the majority of currently stated policy requirements.
- Benchmarks should shift from vague goals to explicit, machine-checkable policies to enable reliable evaluation.
- Symbolic guardrails can be layered on top of existing agents without retraining or performance loss.
Where Pith is reading between the lines
- If policies are written first, many new agent applications could adopt guardrails from the start rather than retrofitting them.
- The approach may reduce the need for repeated safety fine-tuning when requirements change.
- Extending the method to multi-agent systems would require handling interactions between separate policy sets.
Load-bearing premise
The concrete policies extracted from the 80 benchmarks are representative of the safety requirements that matter in actual domain-specific deployments.
What would settle it
A controlled experiment on one of the evaluated benchmarks in which a safety violation allowed by the original policy is still executed after the corresponding symbolic guardrail is added.
Figures
read the original abstract
AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on $\tau^2$-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that symbolic guardrails provide a practical path to strong safety and security guarantees for domain-specific AI agents. Through a three-part study, it reviews 80 benchmarks to identify policies (finding 85% lack concrete requirements and rely on underspecified goals), analyzes that 74% of specified policy requirements can be enforced symbolically via simple low-cost mechanisms, and evaluates the approach on τ²-Bench, CAR-bench, and MedAgentBench, reporting safety/security improvements without utility loss. Code and artifacts are released.
Significance. If the quantitative claims and evaluations hold under scrutiny, this work is significant for demonstrating that symbolic methods can deliver enforceable guarantees where neural guardrails and training-based approaches cannot, especially in domain-specific settings. The benchmark review usefully exposes underspecification issues, the no-utility-loss result is practically relevant, and the public code release supports reproducibility and extension.
major comments (2)
- Abstract: the 85% and 74% quantitative findings are load-bearing for the central claim, yet the abstract (and by extension the described study) provides no details on the systematic review methodology, policy extraction criteria, or decision procedure for determining enforceability by symbolic guardrails, preventing verification of these percentages.
- Evaluation sections on τ²-Bench, CAR-bench, and MedAgentBench: the 'guarantee' language is not supported because no formal verification, model checking, exhaustive testing, or soundness argument is provided for the guardrail implementations themselves; ad-hoc rule additions could fail to enforce policies correctly while still reporting success.
minor comments (1)
- The description of the exact symbolic mechanisms (e.g., rule syntax or code patterns) used in the three benchmark evaluations could be expanded for clarity and to allow assessment of generality.
Simulated Author's Rebuttal
We thank the referee for their constructive and positive review, which highlights the potential significance of symbolic guardrails while identifying areas for clarification. We address each major comment below with specific plans for revision.
read point-by-point responses
-
Referee: Abstract: the 85% and 74% quantitative findings are load-bearing for the central claim, yet the abstract (and by extension the described study) provides no details on the systematic review methodology, policy extraction criteria, or decision procedure for determining enforceability by symbolic guardrails, preventing verification of these percentages.
Authors: We agree that the abstract omits key methodological details required to substantiate the 85% and 74% figures. In the revised manuscript, we will expand the abstract with a concise description of the systematic review (including benchmark selection criteria, policy extraction process, and enforceability decision rules). We will also add a dedicated methods subsection (likely in Section 3) that fully documents the review protocol, provides concrete examples of policy extraction and symbolic classification decisions, and explains the decision procedure. These changes will allow independent verification of the reported percentages without altering the core claims. revision: yes
-
Referee: Evaluation sections on τ²-Bench, CAR-bench, and MedAgentBench: the 'guarantee' language is not supported because no formal verification, model checking, exhaustive testing, or soundness argument is provided for the guardrail implementations themselves; ad-hoc rule additions could fail to enforce policies correctly while still reporting success.
Authors: We acknowledge the validity of this critique. The paper's use of 'guarantee' and similar terms in the title, abstract, and evaluation sections implies formal assurances that our empirical results and rule-based implementations do not rigorously establish. While the guardrails consist of deterministic, inspectable rules that enforce policies by construction when correctly implemented, we provide no formal verification, model checking, or exhaustive soundness proof, leaving room for implementation errors in ad-hoc rules. In the revision, we will replace unqualified 'guarantee' language with precise alternatives such as 'enforce' or 'deliver enforceable constraints' across the manuscript. We will add an explicit limitations discussion noting the lack of formal verification and the manual nature of rule specification, while underscoring that the open-sourced code enables community review and testing. This preserves the practical contribution without overstating formal properties. revision: partial
Circularity Check
No circularity in empirical analysis
full rationale
The paper conducts a systematic review of 80 external benchmarks to catalog policies, determines which requirements are amenable to symbolic enforcement, and measures effects on three independent evaluation suites (τ²-Bench, CAR-bench, MedAgentBench). No equations, derivations, fitted parameters, or predictions appear in the work. No self-citations are invoked to establish uniqueness, ansatzes, or load-bearing premises that would reduce the central claims to tautologies. All supporting artifacts are released externally, so the reported findings rest on observable data rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 80 state-of-the-art benchmarks reviewed capture the relevant range of safety and security policies for AI agents.
- domain assumption Symbolic mechanisms can enforce the identified policy requirements without compromising agent success rates in practice.
Reference graph
Works this paper leans on
- [1]
-
[2]
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrik- son, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Measur- ing Harmfulness of LLM Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/foru...
2025
-
[3]
2024.Model Context Protocol
Anthropic. 2024.Model Context Protocol. https://github.com/modelcontextproto col
2024
-
[4]
Anthropic. 2026. Claude. https://www.anthropic.com/claude
2026
-
[5]
2026.Claude Code Overview
Anthropic. 2026.Claude Code Overview. https://docs.anthropic.com/en/docs/cla ude-code/overview Accessed: 2026-03-25
2026
-
[6]
2026.Cursor: The AI Code Editor
Anysphere, Inc. 2026.Cursor: The AI Code Editor. https://cursor.com
2026
-
[7]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, et al. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs.CL] https://arxiv.org/abs/2204.05862
work page Pith review arXiv 2022
-
[8]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 [cs.CL] https://arxiv.org...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 2025. 𝜏 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI] https://arxiv.org/abs/2506.07982
work page internal anchor Pith review arXiv 2025
- [10]
-
[11]
Zhaorun Chen, Mintong Kang, and Bo Li. 2025. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Fe- lix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Je...
2025
-
[12]
Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abra- ham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, and Joshua Saxe
-
[13]
LlamaFirewall: An open source guardrail system for building secure AI agents
LlamaFirewall: An open source guardrail system for building secure AI agents. arXiv:2505.03574 [cs.CR] https://arxiv.org/abs/2505.03574
-
[14]
Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russi- novich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. 2025. Securing AI Agents with Information-Flow Control. arXiv:2505.23643 [cs.CR] https://arxiv.org/abs/2505.23643
work page internal anchor Pith review arXiv 2025
- [15]
- [16]
-
[17]
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. 2025. Defeating Prompt Injections by Design. arXiv:2503.18813 [cs.CR] https://arxiv.org/abs/2503.18813
work page internal anchor Pith review arXiv 2025
-
[18]
Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2025. AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways.ACM Comput. Surv.57, 7, Article 182 (Feb. 2025), 36 pages. doi:10.1145/3716628
-
[19]
Dorothy E. Denning. 1976. A lattice model of secure information flow.Commun. ACM19, 5 (May 1976), 236–243. doi:10.1145/360051.360056
- [20]
-
[21]
Katia Romero Felizardo, Emilia Mendes, Marcos Kalinowski, Érica Ferreira Souza, and Nandamudi L Vijaykumar. 2016. Using forward snowballing to update systematic reviews in software engineering. InProceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 1–6
2016
-
[22]
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.078...
work page internal anchor Pith review arXiv 2022
-
[23]
Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...
work page internal anchor Pith review arXiv 2022
-
[24]
Amr Gomaa, Ahmed Salem, and Sahar Abdelnabi. 2026. ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 3246–
2026
-
[25]
doi:10.18653/v1/2026.findings-eacl.170
-
[26]
William G. J. Halfond, Jeremy Viegas, and Alessandro Orso. 2006. A Classification of SQL Injection Attacks and Countermeasures. InInternational Symposium on Signals, Systems, and Electronics. https://api.semanticscholar.org/CorpusID: 5969227
2006
-
[27]
Hippocratic AI. 2026. Hippocratic AI: Home. https://hippocraticai.com/
2026
-
[28]
Insightai: Root cause analysis in large log files with private data using large language model
Yining Hong, Christopher S. Timperley, and Christian Kästner. 2025. From Haz- ard Identification to Controller Design: Proactive and LLM-Supported Safety Engineering for ML-Powered Systems. In2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). 113–118. doi:10.1109/CAIN66642.2025.00021
-
[29]
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.ACM Transactions on Software Engineering and Methodology(2025)
2025
-
[30]
Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien- Sheng Wu. 2026. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions.Transactions on Machine Learning Research(2026). https://openreview.net/forum?id=EPlpe3Fx1x
2026
-
[31]
Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. 2025. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents.Nejm Ai2, 9 (2025), AIdbp2500144
2025
-
[32]
2025.Machine Learning in Production: From Models to Products
Christian Kaestner. 2025.Machine Learning in Production: From Models to Products. MIT Press
2025
- [33]
- [34]
- [35]
-
[36]
Barbara Kitchenham, Stuart Charters, et al . 2007. Guidelines for performing systematic literature reviews in software engineering. (2007)
2007
-
[37]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Fer- ret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. InProceedings of the 41st International Confer- ence on Machine Learning(Vienna, Austria)(ICML’2...
2024
-
[38]
Bradley Knox, and Kimin Lee
Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee
-
[39]
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control.Proceedings of the AAAI Conference on Artificial Intelligence40, 44 (Mar. 2026), 37565–37573. doi:10.1609/aaai.v40i44.41090
- [40]
-
[41]
Ido Levy, Ben wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov
-
[42]
InThe Fourteenth International Conference on Learning Representations
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustwor- thiness in Web Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=MuCDzH0ctf
-
[43]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
2020
-
[44]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/pre sentation/liu-yupei
2024
-
[45]
Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. 2025. AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohamma...
2025
-
[46]
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. 2026. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security8, 3-4 (2026), 1–240
2026
-
[47]
2023.Getting started with Copilot on Windows
Microsoft. 2023.Getting started with Copilot on Windows. https://support.micros oft.com/en-us/topic/getting-started-with-copilot-on-windows-1159c61f-86c3- 4755-bf83-7fbff7e0982d
2023
-
[48]
2025.GitHub MCP Exploited: Accessing private repositories via MCP
Marco Milanta and Luca Beurer-Kellner. 2025.GitHub MCP Exploited: Accessing private repositories via MCP. https://invariantlabs.ai/blog/mcp- github- vulnerability
2025
-
[49]
2025.ChatGPT agent
OpenAI. 2025.ChatGPT agent. https://chatgpt.com/features/agent
2025
-
[50]
OpenAI. 2026. ChatGPT. https://chatgpt.com/. Large language model accessed March 20, 2026
2026
-
[51]
OpenClaw. 2026. OpenClaw — Personal AI Assistant. https://openclaw.ai/. Accessed: 2026-03-12
2026
-
[52]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human...
2022
-
[53]
Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, and Somesh Jha. 2026. Policy Compiler for Secure Agentic Systems. arXiv:2602.16708 [cs.CR] https://arxiv.org/abs/2602.16708
work page internal anchor Pith review arXiv 2026
-
[54]
2026.Meta AI Alignment Director’s OpenClaw Email Deletion Incident Exposes the Real Agent Safety Boundary
penligent. 2026.Meta AI Alignment Director’s OpenClaw Email Deletion Incident Exposes the Real Agent Safety Boundary. https://www.penligent.ai/hackinglabs /meta-ai-alignment-directors-openclaw-email-deletion-incident-exposes-the- real-agent-safety-boundary/
2026
-
[55]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computa...
- [56]
-
[57]
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Associat...
-
[58]
Ravi Sandhu, David Ferraiolo, Richard Kuhn, et al. 2000. The NIST model for role- based access control: towards a unified standard. InACM workshop on Role-based access control, Vol. 10
2000
-
[59]
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational ...
-
[60]
Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024. PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 89373–89407. doi:10.52202/079017-2837
-
[61]
Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. arXiv:2504.11703 [cs.CR] https://arxiv.org/abs/2504.11703
work page internal anchor Pith review arXiv 2025
-
[62]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learn- ing. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Asso- ciates, Inc., 8634–8652. https://proceedings.neurips...
2023
-
[63]
Sierra. 2026. Meet your agent. https://sierra.ai/product/meet-your-agent. Accessed: 2026-03-12. Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility Conference’17, July 2017, Washington, DC, USA
2026
-
[64]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 3008–3021. https:...
2020
-
[65]
2025.Announcing the agent2agent protocol (A2A)
Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. 2025.Announcing the agent2agent protocol (A2A). https://developers.googleblog.com/en/a2a-a-new- era-of-agent-interoperability/
2025
-
[66]
2001.Building secure software: how to avoid security problems the right way
John Viega and Gary R McGraw. 2001.Building secure software: how to avoid security problems the right way. Pearson Education
2001
-
[67]
Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. 2026. OpenAgentSafety: A Comprehensive Framework For Evaluating Real-World AI Agent Safety. InThe Fourteenth International Conference on Learning Representations. https://openre view.net/forum?id=xggSxCFQbA
2026
-
[68]
David A Wagner, Jeffrey S Foster, Eric A Brewer, and Alexander Aiken. 2000. A first step towards automated detection of buffer overrun vulnerabilities.. InNDSS, Vol. 20. 0
2000
-
[69]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.AI] https://arxiv.org/abs/2503.18666
work page internal anchor Pith review arXiv 2025
- [70]
- [71]
-
[72]
Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Ji- awei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. 2025. GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning. arXiv:2406.09187 [cs.LG] https://arxiv.org/abs/2406.09187
-
[73]
Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Käst- ner, and Tongshuang Wu. 2025. What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts. arXiv:2505.13360 [cs.CL] https://arxiv.org/abs/2505.13360
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI] https://arxiv.org/abs/2406.12045
work page internal anchor Pith review arXiv 2024
-
[75]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
2023
-
[76]
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. 2024. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). As...
-
[77]
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2389, 19 pages
2024
-
[78]
Titzer, Heather Miller, and Phillip B
Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. 2025. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage. arXiv:2502.08966 [cs.CR] https: //arxiv.org/abs/2502.08966
-
[79]
Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oru- ganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Narayanaraju, and Xin Eric Wang. 2026. SafePro: Evaluating the Safety of Professional-Level AI Agents. arXiv:2601.06663 [cs.AI] https://arxiv.org/abs/2601.06663
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.