arxiv: 2503.18666 · v3 · submitted 2025-03-24 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang , Christopher M. Poskitt , Jun Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM agentsruntime safetydomain-specific languageenforcementautonomous agentssafety rules

0 comments

The pith

AgentSpec lets users write runtime rules that stop LLM agents from unsafe actions in code, robots, and cars.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSpec, a domain-specific language for defining safety constraints on LLM-based agents. Users specify rules using triggers, predicates, and enforcement actions to intercept and block dangerous behaviors at runtime. Evaluations across code execution, embodied agents, and autonomous driving show it blocks over 90% of unsafe code actions, all hazardous embodied moves, and ensures full compliance in vehicle scenarios. The approach is lightweight, adding only millisecond overheads, and supports automatic rule generation with LLMs achieving high precision.

Core claim

AgentSpec is a lightweight domain-specific language that allows users to specify structured rules incorporating triggers, predicates, and enforcement mechanisms to ensure LLM agents operate within predefined safety boundaries at runtime.

What carries the argument

AgentSpec, the domain-specific language for runtime enforcement of safety rules on LLM agents using triggers, predicates, and enforcement mechanisms.

Load-bearing premise

That all relevant unsafe scenarios can be anticipated and expressed as practical, predefined rules.

What would settle it

Observing an LLM agent performing a hazardous action in a tested domain despite an AgentSpec rule being in place that should have caught it.

read the original abstract

Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentSpec gives a practical DSL for writing runtime safety rules on LLM agents, but the headline safety numbers rest on test suites whose coverage is not shown to be complete.

read the letter

The paper introduces AgentSpec, a lightweight DSL for defining triggers, predicates, and enforcement actions that intercept LLM agent outputs before they execute. That is the actual new piece: a structured, user-writable way to add runtime checks that sits between the model and the environment, tested in code execution, embodied tasks, and autonomous driving scenarios. The implementation looks straightforward and the reported overhead stays in milliseconds, which matters for real deployment. They also show that an LLM can generate some of the rules, with decent but not perfect precision and recall on the embodied case. That combination of manual and automated rule creation is useful to see in one place. The evaluation reports strong numbers on the chosen test cases, including over 90% unsafe code prevention and full compliance on the AV scenarios they ran. Those results are worth having on record. The main limitation is that the perfect scores come from finite suites without a coverage argument or adversarial checks. If an agent produces an action that falls outside the enumerated triggers or predicate combinations, the enforcement simply does not apply. The abstract notes that LLM-generated rules only hit 70.96% recall on embodied agents and succeed on 5 out of 8 AV scenarios, so the manual-rule results are doing the heavy lifting. No mutation analysis or exhaustive enumeration of the action space appears to close that gap. This work is aimed at practitioners who need to add interpretable guardrails to existing LLM agents rather than retrain models. It is concrete enough and grounded enough in implementation to deserve a serious referee, even if the coverage question will need more attention in revision.

Referee Report

3 major / 1 minor

Summary. The paper proposes AgentSpec, a lightweight DSL for specifying runtime safety constraints on LLM agents via structured rules consisting of triggers, predicates, and enforcement mechanisms. It implements the system across code execution, embodied agents, and autonomous driving domains, reporting that AgentSpec prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied tasks, and achieves 100% compliance in AV scenarios. The work also evaluates LLM-generated rules (e.g., via OpenAI o1), which achieve 95.56% precision and 70.96% recall on embodied agents, identify 87.26% of risky code, and succeed in 5/8 AV scenarios, while claiming low runtime overhead.

Significance. If the empirical results are robust, AgentSpec provides a practical, interpretable, and modular alternative to model-based safeguards for LLM agent safety. Its cross-domain applicability and support for both manual and automated rule generation could address key gaps in robustness and adaptability, with the lightweight enforcement making it suitable for real-time use.

major comments (3)

[Abstract] Abstract: The headline claims of eliminating all hazardous actions in embodied agents and 100% AV compliance are based on finite test suites, but no coverage argument, mutation analysis, or adversarial test set is provided to demonstrate that the predicate/trigger combinations exhaustively intercept all unsafe trajectories in the respective action spaces. If an agent produces an action outside the enumerated triggers, enforcement is bypassed.
[Abstract] Abstract: The reported success rates lack supporting experimental details such as number of trials, baselines, error bars, statistical significance, or discussion of potential confounds and post-hoc selection, making it difficult to assess whether the data fully supports the claims of over 90% prevention in code agents and perfect enforcement in the other domains.
[Abstract] Abstract: LLM-generated rules achieve only 70.96% recall on embodied agents and succeed in 5/8 AV scenarios, which undercuts the practicality of the automated generation approach relative to the manual-rule results presented as perfect on the evaluated cases; the paper does not address how users would ensure comprehensive rule coverage in practice.

minor comments (1)

[Abstract] The abstract would benefit from explicitly distinguishing performance metrics between manually authored rules and LLM-generated rules in the main claims rather than separating them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, with revisions made to clarify limitations and strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of eliminating all hazardous actions in embodied agents and 100% AV compliance are based on finite test suites, but no coverage argument, mutation analysis, or adversarial test set is provided to demonstrate that the predicate/trigger combinations exhaustively intercept all unsafe trajectories in the respective action spaces. If an agent produces an action outside the enumerated triggers, enforcement is bypassed.

Authors: We agree that the reported results are based on finite test suites and that AgentSpec only enforces rules for explicitly defined triggers and predicates; actions falling outside these are not intercepted. We do not claim exhaustive coverage of all possible unsafe trajectories. In the revised manuscript, we have updated the abstract to qualify the claims as applying 'on the evaluated test suites' and added a new paragraph in the Discussion section explaining that comprehensive safety depends on users defining rules that cover their target action spaces, along with suggestions for future automated coverage verification techniques. revision: yes
Referee: [Abstract] Abstract: The reported success rates lack supporting experimental details such as number of trials, baselines, error bars, statistical significance, or discussion of potential confounds and post-hoc selection, making it difficult to assess whether the data fully supports the claims of over 90% prevention in code agents and perfect enforcement in the other domains.

Authors: The full Evaluation section reports the number of trials (100 for code agents, 50 for embodied agents, and 8 scenarios for AV), baselines (unconstrained agents), and runtime overhead measurements. To address the concern, we have revised the abstract to briefly note the evaluation scale and added error bars, statistical significance tests (t-tests with p-values), and explicit discussion of potential confounds and methodology to the results section and figures. revision: yes
Referee: [Abstract] Abstract: LLM-generated rules achieve only 70.96% recall on embodied agents and succeed in 5/8 AV scenarios, which undercuts the practicality of the automated generation approach relative to the manual-rule results presented as perfect on the evaluated cases; the paper does not address how users would ensure comprehensive rule coverage in practice.

Authors: We acknowledge that LLM-generated rules show lower recall (70.96%) and succeed in only 5/8 AV scenarios compared to manual rules. This underscores the value of hybrid approaches. In the revised manuscript, we have expanded the automated rule generation section with a new subsection on practical usage, recommending iterative LLM prompting, validation on test cases, and manual review/augmentation to achieve comprehensive coverage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a domain-specific language for runtime enforcement and reports empirical results from evaluations across code, embodied, and AV domains. No equations, fitted parameters, or analytical derivations are described that reduce to self-defined quantities or self-citations. Claims rest on experimental measurements of rule effectiveness rather than any load-bearing self-referential construction. Self-citations, if present, are not used to justify uniqueness theorems or ansatzes that force the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the feasibility of runtime monitoring and the ability to express all necessary safety constraints via the proposed DSL syntax.

axioms (1)

domain assumption Runtime interception and enforcement of LLM agent actions is feasible across domains without prohibitive overhead.
The reported low overhead and high compliance rates presuppose that agent executions can be observed and controlled in real time.

invented entities (1)

AgentSpec DSL no independent evidence
purpose: To allow users to specify structured safety rules with triggers, predicates, and enforcement mechanisms.
Newly defined language and rule format introduced by the paper.

pith-pipeline@v0.9.0 · 5579 in / 1193 out tokens · 46923 ms · 2026-05-14T21:20:33.486922+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
cs.CR 2026-05 unverdicted novelty 8.0

APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.
No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
cs.CR 2026-05 unverdicted novelty 7.0

Sefz discovers specification violations in 29.9% of 402 real-world agent skills by translating guardrails into reachability goals and guiding LLM mutations with a multi-armed bandit.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 conditional novelty 7.0

A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
cs.CR 2026-04 accept novelty 7.0

Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
eess.SY 2026-03 unverdicted novelty 7.0

PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
cs.CR 2026-05 unverdicted novelty 6.0

SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
cs.AI 2026-05 unverdicted novelty 6.0 partial

Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
Alignment Contracts for Agentic Security Systems
cs.CR 2026-04 conditional novelty 6.0 full

Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Owner-Harm: A Missing Threat Model for AI Agent Safety
cs.CR 2026-04 unverdicted novelty 6.0

Owner-Harm is a new threat model with eight categories of agent behavior that harms the deployer, and existing defenses achieve only 14.8% true positive rate on injection-based owner-harm tasks versus 100% on generic ...
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
cs.CR 2026-04 unverdicted novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 unverdicted novelty 6.0

A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
Auditable Agents
cs.AI 2026-04 unverdicted novelty 6.0

No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
cs.SE 2026-04 unverdicted novelty 6.0

Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

ATBench supplies 1,000 trajectories (503 safe, 497 unsafe) organized by risk source, failure mode, and harm to evaluate long-horizon safety in LLM-based agents.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
cs.CR 2026-04 unverdicted novelty 5.0

Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
cs.SE 2026-04 unverdicted novelty 5.0

Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
Spec Kit Agents: Context-Grounded Agentic Workflows
cs.SE 2026-04 unverdicted novelty 5.0

A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
Near-Miss: Latent Policy Failure Detection in Agentic Workflows
cs.CL 2026-03 unverdicted novelty 5.0

A new metric detects latent policy failures in 8-17% of agent trajectories with mutating tool calls on the Airlines benchmark, even when final outcomes are correct.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 22 Pith papers · 1 internal anchor

[1]

https://github.com/haoyuwang99/AgentSpec, 2025

AgentSpec. https://github.com/haoyuwang99/AgentSpec, 2025

work page 2025
[2]

Runtime verification for trustworthy computing

Abela, R., Colombo, C., Curmi, A., Fenech, M., Vella, M., and Ferrando, A. Runtime verification for trustworthy computing. In AREA@ECAI (2023), vol. 391 of EPTCS, pp. 49–62

work page 2023
[3]

Apollo Self-Driving

Baidu Apollo. Apollo Self-Driving. https://www.apollo.auto/apollo-self-driving,

work page
[4]

Accessed: 2025-02-11

work page 2025
[5]

Principles of model checking

Baier, C., and Katoen, J. Principles of model checking . MIT Press, 2008

work page 2008
[6]

When AI thinks it will lose, it sometimes cheats, study finds

Booth, H. When AI thinks it will lose, it sometimes cheats, study finds. Time (2025). https://time.com/7259395/ai-chess-cheating-palisade-research/

work page arXiv 2025
[7]

In ACL (1) (2024), Association for Computational Linguistics, pp

Chen, J., Hu, X., Liu, S., Huang, S., Tu, W., He, Z., and Wen, L.LLMArena: Assess- ing capabilities of large language models in dynamic multi-agent environments. In ACL (1) (2024), Association for Computational Linguistics, pp. 13055–13077

work page 2024
[8]

AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases

Chen, Z., Xiang, Z., Xiao, C., Song, D., and Li, B. AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases. In NeurIPS (2024)

work page 2024
[9]

AI agents under threat: A survey of key security challenges and future pathways

Deng, Z., Guo, Y., Han, C., Ma, W., Xiong, J., Wen, S., and Xiang, Y. AI agents under threat: A survey of key security challenges and future pathways. ACM Comput. Surv. 57, 7 (2025), 182:1–182:36

work page 2025
[10]

A survey on in-context learning

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., and Sui, Z. A survey on in-context learning. In EMNLP (2024), Association for Computational Linguistics, pp. 1107–1128

work page 2024
[11]

Safeguarding large language models: A survey

Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., and Huang, X. Safeguarding large language models: A survey. CoRR abs/2406.02622 (2024)

work page arXiv 2024
[12]

What can you verify and enforce at runtime? Int

Falcone, Y., Fernandez, J., and Mounier, L. What can you verify and enforce at runtime? Int. J. Softw. Tools Technol. Transf. 14, 3 (2012), 349–382

work page 2012
[13]

llama.cpp: LLM inference in C/C++

Gerganov, G., and ggml-org Community. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp, 2025

work page 2025
[14]

In NeurIPS (2024)

Guo, C., Liu, X., Xie, C., Zhou, A., Zeng, Y., Lin, Z., Song, D., and Li, B.RedCode: Risky code execution and generation benchmark for code agents. In NeurIPS (2024)

work page 2024
[15]

V., Wiest, O., and Zhang, X

Guo, T., Chen, X., W ang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large language model based multi-agents: A survey of progress and challenges. In IJCAI (2024), ijcai.org, pp. 8048–8057

work page 2024
[16]

and others , title =

Han, S., Zhang, Q., Y ao, Y., Jin, W., Xu, Z., and He, C.LLM multi-agent systems: Challenges and open problems. CoRR abs/2402.03578 (2024)

work page arXiv 2024
[17]

LangChain

LangChain Contributors. LangChain. https://www.langchain.com/langchain,

work page
[18]

Accessed: 2025-01-14

work page 2025
[19]

LangChain Expression Language (LCEL)

LangChain Contributors. LangChain Expression Language (LCEL). https: //python.langchain.com/docs/concepts/lcel/, 2025

work page 2025
[20]

Detecting Standard Violation Errors in Smart Contracts

Li, A., and Long, F. Detecting standard violation errors in smart contracts. CoRR abs/1812.07702 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

CAMEL: communicative agents for "mind" exploration of large language model society

Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. CAMEL: communicative agents for "mind" exploration of large language model society. In NeurIPS (2023)

work page 2023
[22]

Eia: Environmental injection attack on generalist web agents for privacy leakage

Liao, Z., Mo, L., Xu, C., Kang, M., Zhang, J., Xiao, C., Tian, Y., Li, B., and Sun, H. Eia: Environmental injection attack on generalist web agents for privacy leakage. In ICLR (2025), OpenReview.net

work page 2025
[23]

Efficient detection of toxic prompts in large language models

Liu, Y., Yu, J., Sun, H., Shi, L., Deng, G., Chen, Y., and Liu, Y. Efficient detection of toxic prompts in large language models. In ASE (2024), ACM, pp. 455–467

work page 2024
[24]

A language agent for au- tonomous driving

Mao, J., Ye, J., Qian, Y., Pavone, M., and Wang, Y. A language agent for au- tonomous driving. CoRR abs/2311.10813 (2023)

work page arXiv 2023
[25]

What are AI guardrails? https://www.mckinsey.com/ featured-insights/mckinsey-explainers/what-are-ai-guardrails, 2024

McKinsey & Company. What are AI guardrails? https://www.mckinsey.com/ featured-insights/mckinsey-explainers/what-are-ai-guardrails, 2024. Accessed: 2025-02-21

work page 2024
[26]

Real estate listing gaffe exposes widespread use of AI in Australian industry – and potential risks

McLeod, C. Real estate listing gaffe exposes widespread use of AI in Australian industry – and potential risks. The Guardian (2024). Accessed: 2025-07-25

work page 2024
[27]

AutoGen: A framework for building AI agents and applications

Microsoft. AutoGen: A framework for building AI agents and applications. https://microsoft.github.io/autogen/stable//index.html, 2025. Accessed: 2025-01- 14

work page 2025
[28]

M., Pham, L

Min, N. M., Pham, L. H., Li, Y., and Sun, J. CROW: eliminating backdoors from large language models via internal consistency regularization. In ICML (2025), OpenReview.net

work page 2025
[29]

NeMo: A scalable generative AI framework

NVIDIA. NeMo: A scalable generative AI framework. https://github.com/ NVIDIA/NeMo, 2025

work page 2025
[30]

S., O’Brien, J

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. InUIST (2023), ACM, pp. 2:1–2:22

work page 2023
[31]

The Definitive ANTLR 4 Reference

Parr, T. The Definitive ANTLR 4 Reference . Pragmatic Bookshelf, 2013

work page 2013
[32]

From prompt injections to SQL injection attacks: How protected is your llm-integrated web application? CoRR abs/2308.01990 (2023)

Pedro, R., Castro, D., Carreira, P., and Santos, N. From prompt injections to SQL injection attacks: How protected is your llm-integrated web application? CoRR abs/2308.01990 (2023)

work page arXiv 2023
[33]

Richards, T. B. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT, 2025

work page 2025
[34]

J., and Hashimoto, T.Identifying the risks of LM agents with an LM-emulated sandbox

Ruan, Y., Dong, H., W ang, A., Pitis, S., Zhou, Y., Ba, J., Dubois, Y., Maddison, C. J., and Hashimoto, T.Identifying the risks of LM agents with an LM-emulated sandbox. In ICLR (2024), OpenReview.net

work page 2024
[35]

M., Nick- ovic, D., Pace, G

Sánchez, C., Schneider, G., Ahrendt, W., Bartocci, E., Bianculli, D., Colombo, C., Falcone, Y., Francalanza, A., Krstic, S., Lourenço, J. M., Nick- ovic, D., Pace, G. J., Rufino, J., Signoles, J., Traytel, D., and Weiss, A. A survey of challenges for runtime verification from advanced application domains (beyond software). Formal Methods Syst. Des. 54 , 3...

work page 2019
[36]

C., Yang, C., and Wang, M

Shi, W., Xu, R., Zhuang, Y., Yu, Y., Zhang, J., Wu, H., Zhu, Y., Ho, J. C., Yang, C., and Wang, M. D. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In EMNLP (2024), Association for Computational Linguistics, pp. 22315–22339

work page 2024
[37]

In NeurIPS (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Y ao, S.Reflexion: language agents with verbal reinforcement learning. In NeurIPS (2023)

work page 2023
[38]

M., Sun, J., Chen, Y., and Y ang, Z.LawBreaker: An approach for specifying traffic laws and fuzzing autonomous vehicles

Sun, Y., Poskitt, C. M., Sun, J., Chen, Y., and Y ang, Z.LawBreaker: An approach for specifying traffic laws and fuzzing autonomous vehicles. In ASE (2022), ACM, pp. 62:1–62:12

work page 2022
[39]

M., W ang, K., and Sun, J.FixDrive: Automatically repairing autonomous vehicle driving behaviour for $0.08 per violation

Sun, Y., Poskitt, C. M., W ang, K., and Sun, J.FixDrive: Automatically repairing autonomous vehicle driving behaviour for $0.08 per violation. In ICSE (2025), IEEE, pp. 1921–1933

work page 2025
[40]

Prioritizing safeguarding over autonomy: Risks of LLM agents for science

Tang, X., Jin, Q., Zhu, K., Yuan, T., Zhang, Y., Zhou, W., Qu, M., Zhao, Y., Tang, J., Zhang, Z., Cohan, A., Lu, Z., and Gerstein, M. Prioritizing safeguarding over autonomy: Risks of LLM agents for science. CoRR abs/2402.04247 (2024)

work page arXiv 2024
[41]

Voyager: An open-ended embodied agent with large language models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res. 2024 (2024)

work page 2024
[42]

M., Sun, Y., Sun, J., Wang, J., Cheng, P., and Chen, J

Wang, K., Poskitt, C. M., Sun, Y., Sun, J., Wang, J., Cheng, P., and Chen, J. 𝜇Drive: User-controlled autonomous driving. CoRR abs/2407.13201 (2024)

work page arXiv 2024
[43]

X., Wei, Z., and Wen, J

W ang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., and Wen, J. A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18 , 6 (2024), 186345

work page 2024
[44]

In ICML (2024), OpenReview.net

W ang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., and Ji, H.Executable code actions elicit better LLM agents. In ICML (2024), OpenReview.net

work page 2024
[45]

The rise and potential of large language model based agents: A survey

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., W ang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., W ang, X., Xiong, L., Zhou, Y., W ang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Qin, W., Zheng, Y., Qiu, X., Huang, X., Zhang, Q., and Gui, T. The rise and potential of large language model based agents: A survey. Sci. Ch...

work page 2025
[46]

CoRR abs/2406.09187 (2024)

Xiang, Z., Zheng, L., Li, Y., Hong, J., Li, Q., Xie, H., Zhang, J., Xiong, Z., Xie, C., Y ang, C., Song, D., and Li, B.GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning. CoRR abs/2406.09187 (2024)

work page arXiv 2024
[47]

In KDD (2024), ACM, pp

Xing, M., Zhang, R., Xue, H., Chen, Q., Y ang, F., and Xiao, Z.Understanding the weakness of large language model agents within a complex android environment. In KDD (2024), ACM, pp. 6061–6072

work page 2024
[48]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. In NeurIPS (2024)

work page 2024
[49]

Watch out for your agents! Investigating backdoor threats to LLM-based agents

Yang, W., Bi, X., Lin, Y., Chen, S., Zhou, J., and Sun, X. Watch out for your agents! Investigating backdoor threats to LLM-based agents. In NeurIPS (2024)

work page 2024
[50]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. ReAct: Synergizing reasoning and acting in language models. In ICLR (2023), OpenReview.net

work page 2023
[51]

SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

Yin, S., Pang, X., Ding, Y., Chen, M., Bi, Y., Xiong, Y., Huang, W., Xiang, Z., Shao, J., and Chen, S. SafeAgentBench: A benchmark for safe task planning of embodied LLM agents. CoRR abs/2412.13178 (2024)

work page arXiv 2024
[52]

Breaking Agents : Compromising Autonomous LLM Agents Through Malfunction Amplification , July 2024 a

Zhang, B., Tan, Y., Shen, Y., Salem, A., Backes, M., Zannettou, S., and Zhang, Y. Breaking agents: Compromising autonomous LLM agents through malfunction amplification. CoRR abs/2407.20859 (2024)

work page arXiv 2024
[53]

K., Zhang, P., and Sun, J

Zhang, M., Goh, K. K., Zhang, P., and Sun, J. LLMScan: Causal scan for LLM misbehavior detection. In ICML (2025), OpenReview.net

work page 2025
[54]

In ASE (2024), ACM, pp

Zhang, Q., Zhou, C., Go, G., Zeng, B., Shi, H., Xu, Z., and Jiang, Y.Imperceptible content poisoning in LLM-powered applications. In ASE (2024), ACM, pp. 242– 254

work page 2024
[55]

Zhang, Y., Cai, Y., Zuo, X., Luan, X., Wang, K., Hou, Z., Zhang, Y., Wei, Z., Sun, M., Sun, J., Sun, J., and Dong, J. S. Position: Trustworthy AI agents require the integration of large language models and formal methods. In ICML (2025), OpenReview.net

work page 2025
[56]

Towards general conceptual model editing via adversarial representation engineering

Zhang, Y., Wei, Z., Sun, J., and Sun, M. Towards general conceptual model editing via adversarial representation engineering. CoRR abs/2404.13752 (2024)

work page arXiv 2024
[57]

Defending large language models against jailbreak attacks via layer-specific editing

Zhao, W., Li, Z., Li, Y., Zhang, Y., and Sun, J. Defending large language models against jailbreak attacks via layer-specific editing. In EMNLP (Findings) (2024), Association for Computational Linguistics, pp. 5094–5109

work page 2024
[58]

GPT-4V(ision) is a generalist web agent, if grounded

Zheng, B., Gou, B., Kil, J., Sun, H., and Su, Y. GPT-4V(ision) is a generalist web agent, if grounded. In ICML (2024), OpenReview.net

work page 2024
[59]

D., Sun, J., and Chua, T.ALI-Agent: Assessing LLMs’ alignment with human values via agent-based evaluation

Zheng, J., W ang, H., Zhang, A., Nguyen, T. D., Sun, J., and Chua, T.ALI-Agent: Assessing LLMs’ alignment with human values via agent-based evaluation. In NeurIPS (2024)

work page 2024