Recognition: 2 theorem links
· Lean TheoremAgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3
The pith
AgentSpec lets users write runtime rules that stop LLM agents from unsafe actions in code, robots, and cars.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentSpec is a lightweight domain-specific language that allows users to specify structured rules incorporating triggers, predicates, and enforcement mechanisms to ensure LLM agents operate within predefined safety boundaries at runtime.
What carries the argument
AgentSpec, the domain-specific language for runtime enforcement of safety rules on LLM agents using triggers, predicates, and enforcement mechanisms.
Load-bearing premise
That all relevant unsafe scenarios can be anticipated and expressed as practical, predefined rules.
What would settle it
Observing an LLM agent performing a hazardous action in a tested domain despite an AgentSpec rule being in place that should have caught it.
read the original abstract
Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentSpec, a lightweight DSL for specifying runtime safety constraints on LLM agents via structured rules consisting of triggers, predicates, and enforcement mechanisms. It implements the system across code execution, embodied agents, and autonomous driving domains, reporting that AgentSpec prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied tasks, and achieves 100% compliance in AV scenarios. The work also evaluates LLM-generated rules (e.g., via OpenAI o1), which achieve 95.56% precision and 70.96% recall on embodied agents, identify 87.26% of risky code, and succeed in 5/8 AV scenarios, while claiming low runtime overhead.
Significance. If the empirical results are robust, AgentSpec provides a practical, interpretable, and modular alternative to model-based safeguards for LLM agent safety. Its cross-domain applicability and support for both manual and automated rule generation could address key gaps in robustness and adaptability, with the lightweight enforcement making it suitable for real-time use.
major comments (3)
- [Abstract] Abstract: The headline claims of eliminating all hazardous actions in embodied agents and 100% AV compliance are based on finite test suites, but no coverage argument, mutation analysis, or adversarial test set is provided to demonstrate that the predicate/trigger combinations exhaustively intercept all unsafe trajectories in the respective action spaces. If an agent produces an action outside the enumerated triggers, enforcement is bypassed.
- [Abstract] Abstract: The reported success rates lack supporting experimental details such as number of trials, baselines, error bars, statistical significance, or discussion of potential confounds and post-hoc selection, making it difficult to assess whether the data fully supports the claims of over 90% prevention in code agents and perfect enforcement in the other domains.
- [Abstract] Abstract: LLM-generated rules achieve only 70.96% recall on embodied agents and succeed in 5/8 AV scenarios, which undercuts the practicality of the automated generation approach relative to the manual-rule results presented as perfect on the evaluated cases; the paper does not address how users would ensure comprehensive rule coverage in practice.
minor comments (1)
- [Abstract] The abstract would benefit from explicitly distinguishing performance metrics between manually authored rules and LLM-generated rules in the main claims rather than separating them.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below, with revisions made to clarify limitations and strengthen the presentation of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of eliminating all hazardous actions in embodied agents and 100% AV compliance are based on finite test suites, but no coverage argument, mutation analysis, or adversarial test set is provided to demonstrate that the predicate/trigger combinations exhaustively intercept all unsafe trajectories in the respective action spaces. If an agent produces an action outside the enumerated triggers, enforcement is bypassed.
Authors: We agree that the reported results are based on finite test suites and that AgentSpec only enforces rules for explicitly defined triggers and predicates; actions falling outside these are not intercepted. We do not claim exhaustive coverage of all possible unsafe trajectories. In the revised manuscript, we have updated the abstract to qualify the claims as applying 'on the evaluated test suites' and added a new paragraph in the Discussion section explaining that comprehensive safety depends on users defining rules that cover their target action spaces, along with suggestions for future automated coverage verification techniques. revision: yes
-
Referee: [Abstract] Abstract: The reported success rates lack supporting experimental details such as number of trials, baselines, error bars, statistical significance, or discussion of potential confounds and post-hoc selection, making it difficult to assess whether the data fully supports the claims of over 90% prevention in code agents and perfect enforcement in the other domains.
Authors: The full Evaluation section reports the number of trials (100 for code agents, 50 for embodied agents, and 8 scenarios for AV), baselines (unconstrained agents), and runtime overhead measurements. To address the concern, we have revised the abstract to briefly note the evaluation scale and added error bars, statistical significance tests (t-tests with p-values), and explicit discussion of potential confounds and methodology to the results section and figures. revision: yes
-
Referee: [Abstract] Abstract: LLM-generated rules achieve only 70.96% recall on embodied agents and succeed in 5/8 AV scenarios, which undercuts the practicality of the automated generation approach relative to the manual-rule results presented as perfect on the evaluated cases; the paper does not address how users would ensure comprehensive rule coverage in practice.
Authors: We acknowledge that LLM-generated rules show lower recall (70.96%) and succeed in only 5/8 AV scenarios compared to manual rules. This underscores the value of hybrid approaches. In the revised manuscript, we have expanded the automated rule generation section with a new subsection on practical usage, recommending iterative LLM prompting, validation on test cases, and manual review/augmentation to achieve comprehensive coverage. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a domain-specific language for runtime enforcement and reports empirical results from evaluations across code, embodied, and AV domains. No equations, fitted parameters, or analytical derivations are described that reduce to self-defined quantities or self-citations. Claims rest on experimental measurements of rule effectiveness rather than any load-bearing self-referential construction. Self-citations, if present, are not used to justify uniqueness theorems or ansatzes that force the central results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Runtime interception and enforcement of LLM agent actions is feasible across domains without prohibitive overhead.
invented entities (1)
-
AgentSpec DSL
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
APIOT is the first LLM framework to complete the full autonomous discovery-to-remediation cycle on bare-metal OT devices, reaching 90% success across 290 runs on Zephyr RTOS.
-
No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
Sefz discovers specification violations in 29.9% of 402 real-world agent skills by translating guardrails into reachability goals and guiding LLM mutations with a multi-armed bandit.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
-
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
-
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
-
Alignment Contracts for Agentic Security Systems
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
Owner-Harm: A Missing Threat Model for AI Agent Safety
Owner-Harm is a new threat model with eight categories of agent behavior that harms the deployer, and existing defenses achieve only 14.8% true positive rate on injection-based owner-harm tasks versus 100% on generic ...
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
-
Auditable Agents
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
-
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench supplies 1,000 trajectories (503 safe, 497 unsafe) organized by risk source, failure mode, and harm to evaluate long-horizon safety in LLM-based agents.
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
-
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
Spec Kit Agents: Context-Grounded Agentic Workflows
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
-
Near-Miss: Latent Policy Failure Detection in Agentic Workflows
A new metric detects latent policy failures in 8-17% of agent trajectories with mutating tool calls on the Airlines benchmark, even when final outcomes are correct.
Reference graph
Works this paper leans on
-
[1]
https://github.com/haoyuwang99/AgentSpec, 2025
AgentSpec. https://github.com/haoyuwang99/AgentSpec, 2025
work page 2025
-
[2]
Runtime verification for trustworthy computing
Abela, R., Colombo, C., Curmi, A., Fenech, M., Vella, M., and Ferrando, A. Runtime verification for trustworthy computing. In AREA@ECAI (2023), vol. 391 of EPTCS, pp. 49–62
work page 2023
-
[3]
Baidu Apollo. Apollo Self-Driving. https://www.apollo.auto/apollo-self-driving,
-
[4]
Accessed: 2025-02-11
work page 2025
-
[5]
Baier, C., and Katoen, J. Principles of model checking . MIT Press, 2008
work page 2008
-
[6]
When AI thinks it will lose, it sometimes cheats, study finds
Booth, H. When AI thinks it will lose, it sometimes cheats, study finds. Time (2025). https://time.com/7259395/ai-chess-cheating-palisade-research/
-
[7]
In ACL (1) (2024), Association for Computational Linguistics, pp
Chen, J., Hu, X., Liu, S., Huang, S., Tu, W., He, Z., and Wen, L.LLMArena: Assess- ing capabilities of large language models in dynamic multi-agent environments. In ACL (1) (2024), Association for Computational Linguistics, pp. 13055–13077
work page 2024
-
[8]
AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases
Chen, Z., Xiang, Z., Xiao, C., Song, D., and Li, B. AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge bases. In NeurIPS (2024)
work page 2024
-
[9]
AI agents under threat: A survey of key security challenges and future pathways
Deng, Z., Guo, Y., Han, C., Ma, W., Xiong, J., Wen, S., and Xiang, Y. AI agents under threat: A survey of key security challenges and future pathways. ACM Comput. Surv. 57, 7 (2025), 182:1–182:36
work page 2025
-
[10]
A survey on in-context learning
Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., and Sui, Z. A survey on in-context learning. In EMNLP (2024), Association for Computational Linguistics, pp. 1107–1128
work page 2024
-
[11]
Safeguarding large language models: A survey
Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., and Huang, X. Safeguarding large language models: A survey. CoRR abs/2406.02622 (2024)
-
[12]
What can you verify and enforce at runtime? Int
Falcone, Y., Fernandez, J., and Mounier, L. What can you verify and enforce at runtime? Int. J. Softw. Tools Technol. Transf. 14, 3 (2012), 349–382
work page 2012
-
[13]
llama.cpp: LLM inference in C/C++
Gerganov, G., and ggml-org Community. llama.cpp: LLM inference in C/C++. https://github.com/ggml-org/llama.cpp, 2025
work page 2025
-
[14]
Guo, C., Liu, X., Xie, C., Zhou, A., Zeng, Y., Lin, Z., Song, D., and Li, B.RedCode: Risky code execution and generation benchmark for code agents. In NeurIPS (2024)
work page 2024
-
[15]
Guo, T., Chen, X., W ang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large language model based multi-agents: A survey of progress and challenges. In IJCAI (2024), ijcai.org, pp. 8048–8057
work page 2024
-
[16]
Han, S., Zhang, Q., Y ao, Y., Jin, W., Xu, Z., and He, C.LLM multi-agent systems: Challenges and open problems. CoRR abs/2402.03578 (2024)
- [17]
-
[18]
Accessed: 2025-01-14
work page 2025
-
[19]
LangChain Expression Language (LCEL)
LangChain Contributors. LangChain Expression Language (LCEL). https: //python.langchain.com/docs/concepts/lcel/, 2025
work page 2025
-
[20]
Detecting Standard Violation Errors in Smart Contracts
Li, A., and Long, F. Detecting standard violation errors in smart contracts. CoRR abs/1812.07702 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
CAMEL: communicative agents for "mind" exploration of large language model society
Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. CAMEL: communicative agents for "mind" exploration of large language model society. In NeurIPS (2023)
work page 2023
-
[22]
Eia: Environmental injection attack on generalist web agents for privacy leakage
Liao, Z., Mo, L., Xu, C., Kang, M., Zhang, J., Xiao, C., Tian, Y., Li, B., and Sun, H. Eia: Environmental injection attack on generalist web agents for privacy leakage. In ICLR (2025), OpenReview.net
work page 2025
-
[23]
Efficient detection of toxic prompts in large language models
Liu, Y., Yu, J., Sun, H., Shi, L., Deng, G., Chen, Y., and Liu, Y. Efficient detection of toxic prompts in large language models. In ASE (2024), ACM, pp. 455–467
work page 2024
-
[24]
A language agent for au- tonomous driving
Mao, J., Ye, J., Qian, Y., Pavone, M., and Wang, Y. A language agent for au- tonomous driving. CoRR abs/2311.10813 (2023)
-
[25]
McKinsey & Company. What are AI guardrails? https://www.mckinsey.com/ featured-insights/mckinsey-explainers/what-are-ai-guardrails, 2024. Accessed: 2025-02-21
work page 2024
-
[26]
Real estate listing gaffe exposes widespread use of AI in Australian industry – and potential risks
McLeod, C. Real estate listing gaffe exposes widespread use of AI in Australian industry – and potential risks. The Guardian (2024). Accessed: 2025-07-25
work page 2024
-
[27]
AutoGen: A framework for building AI agents and applications
Microsoft. AutoGen: A framework for building AI agents and applications. https://microsoft.github.io/autogen/stable//index.html, 2025. Accessed: 2025-01- 14
work page 2025
-
[28]
Min, N. M., Pham, L. H., Li, Y., and Sun, J. CROW: eliminating backdoors from large language models via internal consistency regularization. In ICML (2025), OpenReview.net
work page 2025
-
[29]
NeMo: A scalable generative AI framework
NVIDIA. NeMo: A scalable generative AI framework. https://github.com/ NVIDIA/NeMo, 2025
work page 2025
-
[30]
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. InUIST (2023), ACM, pp. 2:1–2:22
work page 2023
-
[31]
The Definitive ANTLR 4 Reference
Parr, T. The Definitive ANTLR 4 Reference . Pragmatic Bookshelf, 2013
work page 2013
-
[32]
Pedro, R., Castro, D., Carreira, P., and Santos, N. From prompt injections to SQL injection attacks: How protected is your llm-integrated web application? CoRR abs/2308.01990 (2023)
-
[33]
Richards, T. B. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT, 2025
work page 2025
-
[34]
J., and Hashimoto, T.Identifying the risks of LM agents with an LM-emulated sandbox
Ruan, Y., Dong, H., W ang, A., Pitis, S., Zhou, Y., Ba, J., Dubois, Y., Maddison, C. J., and Hashimoto, T.Identifying the risks of LM agents with an LM-emulated sandbox. In ICLR (2024), OpenReview.net
work page 2024
-
[35]
Sánchez, C., Schneider, G., Ahrendt, W., Bartocci, E., Bianculli, D., Colombo, C., Falcone, Y., Francalanza, A., Krstic, S., Lourenço, J. M., Nick- ovic, D., Pace, G. J., Rufino, J., Signoles, J., Traytel, D., and Weiss, A. A survey of challenges for runtime verification from advanced application domains (beyond software). Formal Methods Syst. Des. 54 , 3...
work page 2019
-
[36]
Shi, W., Xu, R., Zhuang, Y., Yu, Y., Zhang, J., Wu, H., Zhu, Y., Ho, J. C., Yang, C., and Wang, M. D. EHRAgent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In EMNLP (2024), Association for Computational Linguistics, pp. 22315–22339
work page 2024
-
[37]
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Y ao, S.Reflexion: language agents with verbal reinforcement learning. In NeurIPS (2023)
work page 2023
-
[38]
Sun, Y., Poskitt, C. M., Sun, J., Chen, Y., and Y ang, Z.LawBreaker: An approach for specifying traffic laws and fuzzing autonomous vehicles. In ASE (2022), ACM, pp. 62:1–62:12
work page 2022
-
[39]
Sun, Y., Poskitt, C. M., W ang, K., and Sun, J.FixDrive: Automatically repairing autonomous vehicle driving behaviour for $0.08 per violation. In ICSE (2025), IEEE, pp. 1921–1933
work page 2025
-
[40]
Prioritizing safeguarding over autonomy: Risks of LLM agents for science
Tang, X., Jin, Q., Zhu, K., Yuan, T., Zhang, Y., Zhou, W., Qu, M., Zhao, Y., Tang, J., Zhang, Z., Cohan, A., Lu, Z., and Gerstein, M. Prioritizing safeguarding over autonomy: Risks of LLM agents for science. CoRR abs/2402.04247 (2024)
-
[41]
Voyager: An open-ended embodied agent with large language models
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res. 2024 (2024)
work page 2024
-
[42]
M., Sun, Y., Sun, J., Wang, J., Cheng, P., and Chen, J
Wang, K., Poskitt, C. M., Sun, Y., Sun, J., Wang, J., Cheng, P., and Chen, J. 𝜇Drive: User-controlled autonomous driving. CoRR abs/2407.13201 (2024)
-
[43]
W ang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., and Wen, J. A survey on large language model based autonomous agents. Frontiers Comput. Sci. 18 , 6 (2024), 186345
work page 2024
-
[44]
In ICML (2024), OpenReview.net
W ang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., and Ji, H.Executable code actions elicit better LLM agents. In ICML (2024), OpenReview.net
work page 2024
-
[45]
The rise and potential of large language model based agents: A survey
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., W ang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., W ang, X., Xiong, L., Zhou, Y., W ang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Qin, W., Zheng, Y., Qiu, X., Huang, X., Zhang, Q., and Gui, T. The rise and potential of large language model based agents: A survey. Sci. Ch...
work page 2025
-
[46]
Xiang, Z., Zheng, L., Li, Y., Hong, J., Li, Q., Xie, H., Zhang, J., Xiong, Z., Xie, C., Y ang, C., Song, D., and Li, B.GuardAgent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning. CoRR abs/2406.09187 (2024)
-
[47]
Xing, M., Zhang, R., Xue, H., Chen, Q., Y ang, F., and Xiao, Z.Understanding the weakness of large language model agents within a complex android environment. In KDD (2024), ACM, pp. 6061–6072
work page 2024
-
[48]
E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. In NeurIPS (2024)
work page 2024
-
[49]
Watch out for your agents! Investigating backdoor threats to LLM-based agents
Yang, W., Bi, X., Lin, Y., Chen, S., Zhou, J., and Sun, X. Watch out for your agents! Investigating backdoor threats to LLM-based agents. In NeurIPS (2024)
work page 2024
-
[50]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. ReAct: Synergizing reasoning and acting in language models. In ICLR (2023), OpenReview.net
work page 2023
-
[51]
SafeAgentBench: A benchmark for safe task planning of embodied LLM agents
Yin, S., Pang, X., Ding, Y., Chen, M., Bi, Y., Xiong, Y., Huang, W., Xiang, Z., Shao, J., and Chen, S. SafeAgentBench: A benchmark for safe task planning of embodied LLM agents. CoRR abs/2412.13178 (2024)
-
[52]
Breaking Agents : Compromising Autonomous LLM Agents Through Malfunction Amplification , July 2024 a
Zhang, B., Tan, Y., Shen, Y., Salem, A., Backes, M., Zannettou, S., and Zhang, Y. Breaking agents: Compromising autonomous LLM agents through malfunction amplification. CoRR abs/2407.20859 (2024)
-
[53]
Zhang, M., Goh, K. K., Zhang, P., and Sun, J. LLMScan: Causal scan for LLM misbehavior detection. In ICML (2025), OpenReview.net
work page 2025
-
[54]
Zhang, Q., Zhou, C., Go, G., Zeng, B., Shi, H., Xu, Z., and Jiang, Y.Imperceptible content poisoning in LLM-powered applications. In ASE (2024), ACM, pp. 242– 254
work page 2024
-
[55]
Zhang, Y., Cai, Y., Zuo, X., Luan, X., Wang, K., Hou, Z., Zhang, Y., Wei, Z., Sun, M., Sun, J., Sun, J., and Dong, J. S. Position: Trustworthy AI agents require the integration of large language models and formal methods. In ICML (2025), OpenReview.net
work page 2025
-
[56]
Towards general conceptual model editing via adversarial representation engineering
Zhang, Y., Wei, Z., Sun, J., and Sun, M. Towards general conceptual model editing via adversarial representation engineering. CoRR abs/2404.13752 (2024)
-
[57]
Defending large language models against jailbreak attacks via layer-specific editing
Zhao, W., Li, Z., Li, Y., Zhang, Y., and Sun, J. Defending large language models against jailbreak attacks via layer-specific editing. In EMNLP (Findings) (2024), Association for Computational Linguistics, pp. 5094–5109
work page 2024
-
[58]
GPT-4V(ision) is a generalist web agent, if grounded
Zheng, B., Gou, B., Kil, J., Sun, H., and Su, Y. GPT-4V(ision) is a generalist web agent, if grounded. In ICML (2024), OpenReview.net
work page 2024
-
[59]
Zheng, J., W ang, H., Zhang, A., Nguyen, T. D., Sun, J., and Chua, T.ALI-Agent: Assessing LLMs’ alignment with human values via agent-based evaluation. In NeurIPS (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.