Recognition: no theorem link
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Pith reviewed 2026-05-12 13:30 UTC · model grok-4.3
The pith
LLM-based agents show critical vulnerabilities across prompts, tools, and memory with attack success rates reaching 84.30 percent and current defenses offering limited protection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent Security Bench structures the evaluation of LLM agent security by defining scenarios, agents, tools, attack and defense techniques, and metrics. When applied to 10 prompt injection attacks, one memory poisoning attack, a new backdoor attack, four mixed attacks, and 11 defenses, it exposes high average attack success rates of 84.30 percent in system prompts, user prompt handling, tool usage, and memory retrieval while showing that current defenses achieve only limited mitigation across tested backbones.
What carries the argument
The Agent Security Bench (ASB) framework, which organizes standardized scenarios, agents, tools, 27 attack and defense methods, and 7 metrics to test vulnerabilities at each stage of agent execution.
If this is right
- Agents remain exposed at system prompt, user prompt, tool usage, and memory retrieval stages.
- Current defenses fail to provide strong protection against the evaluated attacks.
- A utility-security balance metric can guide future agent design choices.
- Vulnerabilities appear consistent across multiple LLM backbones.
- New research on agent security is required to address the identified gaps.
Where Pith is reading between the lines
- Real deployments in finance or autonomous driving could face operational risks from these attack vectors.
- Agent builders may need to integrate security checks at every operational stage rather than relying on post-hoc defenses.
- Extending the benchmark to dynamic multi-agent interactions could reveal additional failure modes.
- The results suggest that tool-calling LLMs require purpose-built security layers beyond standard model alignment.
Load-bearing premise
The selected 10 scenarios, 10 agents, over 400 tools, and 27 attack and defense methods provide representative coverage of real-world LLM agent threats.
What would settle it
A new defense that maintains high task success while reducing attack success rates below 20 percent across all 10 scenarios and tested backbones would undermine the claim of widespread critical vulnerabilities and limited defense effectiveness.
read the original abstract
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent Security Bench (ASB), a framework for formalizing and benchmarking attacks and defenses in LLM-based agents. It includes 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents, over 400 tools, 27 attack/defense methods, and 7 metrics. The authors evaluate 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 defenses across 13 LLM backbones, reporting a highest average attack success rate of 84.30% with vulnerabilities at system prompt, user prompt, tool usage, and memory stages, plus limited defense effectiveness. They also propose a new metric for balancing utility and security and release the code.
Significance. If the benchmark scope is representative, this provides a timely standardized framework for assessing LLM agent security, which is valuable given the growing use of tool-using agents. The open code enables reproducibility, and the new utility-security balance metric is a constructive addition for practical evaluation. The stage-specific vulnerability findings could usefully inform defense research if the results generalize.
major comments (2)
- [Section 3 (ASB Framework)] Section 3 (ASB Framework description): The selection criteria, diversity arguments, threat-model justification, or coverage analysis for the 10 scenarios, 10 agents, and >400 tools are not provided. This is load-bearing for the central claims, as the reported 84.30% ASR, 'critical vulnerabilities' across stages, and 'limited effectiveness' of defenses depend on these choices being sufficiently representative and realistic rather than artifacts of simplified or narrow prototypes.
- [Section 5 (Benchmark Results)] Section 5 (Benchmark Results): The manuscript reports concrete numerical results (e.g., ASRs and defense outcomes) but provides limited detail on exact metric definitions, attack implementations, and aggregation methods for the 'highest average attack success rate' and the 7 metrics. While code availability helps, this reduces independent verifiability of the specific headline numbers without external execution.
minor comments (3)
- The abstract's phrasing of 'highest average attack success rate of 84.30%' is ambiguous (e.g., maximum ASR vs. average of per-attack maxima) and should be clarified with reference to the relevant table or definition.
- Ensure the new utility-security balance metric is formally defined (with equation or pseudocode) in the main text rather than only described at a high level.
- Table or figure captions for the main results could more explicitly link back to the 27 methods and 13 backbones for easier cross-referencing.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments help us strengthen the presentation of the ASB framework and results. We address each major comment below and will revise the paper to improve clarity and completeness.
read point-by-point responses
-
Referee: [Section 3 (ASB Framework)] Section 3 (ASB Framework description): The selection criteria, diversity arguments, threat-model justification, or coverage analysis for the 10 scenarios, 10 agents, and >400 tools are not provided. This is load-bearing for the central claims, as the reported 84.30% ASR, 'critical vulnerabilities' across stages, and 'limited effectiveness' of defenses depend on these choices being sufficiently representative and realistic rather than artifacts of simplified or narrow prototypes.
Authors: We agree that the manuscript would benefit from a more explicit discussion of these elements. The 10 scenarios were selected to span diverse real-world domains (e-commerce, autonomous driving, finance, etc.) where tool-using agents are actively deployed, the 10 agents reflect common architectural patterns, and the >400 tools were chosen to cover a wide range of functionalities and potential attack surfaces. However, we acknowledge that a dedicated justification, threat-model statement, and coverage analysis are currently insufficiently detailed. In the revision we will add a new subsection to Section 3 that formally states the threat model, explains the selection criteria, and provides a brief diversity and coverage argument supporting the representativeness of the benchmark. revision: yes
-
Referee: [Section 5 (Benchmark Results)] Section 5 (Benchmark Results): The manuscript reports concrete numerical results (e.g., ASRs and defense outcomes) but provides limited detail on exact metric definitions, attack implementations, and aggregation methods for the 'highest average attack success rate' and the 7 metrics. While code availability helps, this reduces independent verifiability of the specific headline numbers without external execution.
Authors: We recognize that the paper should be more self-contained. The public code repository supplies the precise metric implementations, attack code, and aggregation scripts, but readers should not need to execute external code to understand the headline numbers. In the revised Section 5 we will add explicit definitions of all seven metrics, a clear description of how attack success rates (including the reported highest average of 84.30 %) are calculated and aggregated across scenarios, models, and runs, and concise overviews of the attack and defense implementations. These additions will improve independent verifiability while retaining the code for full reproducibility. revision: yes
Circularity Check
No circularity: pure empirical benchmark with independent experimental results
full rationale
The paper defines ASB as a fixed collection of 10 scenarios, 10 agents, >400 tools, 27 attack/defense methods and 7 metrics, then executes those attacks on 13 external LLM backbones and reports observed attack success rates (e.g., 84.30 %). No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the headline claims; the results are direct outputs of the defined experimental protocol. The representativeness of the chosen components is an external-validity question, not a circular reduction of the reported numbers to the inputs by construction. This is the expected non-circular outcome for a benchmarking study.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 35 Pith papers
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,...
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...
-
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
-
Toward a Principled Framework for Agent Safety Measurement
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
-
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
-
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
-
SUDP: Secret-Use Delegation Protocol for Agentic Systems
SUDP is a protocol allowing untrusted agents to cause bounded, secret-backed operations through fresh user grants redeemed by a custodian, preventing reusable secret exposure.
-
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
-
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
EvoAgent is an evolvable LLM agent framework using structured skill learning, user-feedback loops, and hierarchical delegation that boosts GPT5.2 performance by about 28% in real-world trade scenarios under LLM-as-Jud...
-
SafeAgent: A Runtime Protection Architecture for Agentic Systems
SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
-
SoK: Security of Autonomous LLM Agents in Agentic Commerce
The paper systematizes security for LLM agents in agentic commerce into five threat dimensions, identifies 12 cross-layer attack vectors, and proposes a layered defense architecture.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
AI-Driven Security Alert Screening and Alert Fatigue Mitigation in Security Operations Centers: A Comprehensive Survey
A literature survey synthesizes 119 studies on AI-driven alert screening into a four-stage taxonomy of filtering, triage, correlation, and generative augmentation while identifying gaps in deployment realism and robustness.
Reference graph
Works this paper leans on
-
[1]
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases , author=. 2024 , eprint=
work page 2024
-
[2]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
Poisoning Retrieval Corpora by Injecting Adversarial Passages , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
work page 2023
-
[3]
33rd USENIX Security Symposium (USENIX Security 24) , year =
Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =
-
[4]
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models , author=. 2024 , eprint=
work page 2024
-
[6]
Securing llm systems against prompt injection , author=
-
[7]
Simon Willison , year=
-
[8]
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples , author=. 2022 , journal=
work page 2022
-
[9]
NeurIPS ML Safety Workshop , year =
Perez, Fábio and Ribeiro, Ian , title =. NeurIPS ML Safety Workshop , year =
-
[10]
Retrieval-augmented generation for knowledge-intensive NLP tasks , year =
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =
-
[11]
The Impact of Reasoning Step Length on Large Language Models
Jin, Mingyu and Yu, Qinkai and Shu, Dong and Zhao, Haiyan and Hua, Wenyue and Meng, Yanda and Zhang, Yongfeng and Du, Mengnan. The Impact of Reasoning Step Length on Large Language Models. Findings of the Association for Computational Linguistics ACL 2024. 2024
work page 2024
-
[12]
Backdoor Attacks for In-Context Learning with Language Models , author=. 2023 , eprint=
work page 2023
-
[13]
BadPrompt: Backdoor Attacks on Continuous Prompts , author=. 2022 , eprint=
work page 2022
-
[14]
A Survey of Man In The Middle Attacks , year=
Conti, Mauro and Dragoni, Nicola and Lesyk, Viktor , journal=. A Survey of Man In The Middle Attacks , year=
-
[16]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. arXiv , year=
-
[17]
Detecting Language Model Attacks with Perplexity , author=. 2023 , journal=
work page 2023
-
[20]
Llama 3.1: An In-Depth Analysis of the Next-Generation Large Language Model , author=. 2024 , publisher=
work page 2024
-
[23]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
-
[25]
ChatGPT: Optimizing Language Models for Dialogue , author=. OpenAI Blog , year=
- [26]
-
[27]
Ignore previous prompt: Attack techniques for language models (2022) , author=. URL https://arxiv , volume=
work page 2022
-
[29]
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[30]
Rab: Provable robustness against backdoor attacks , booktitle =
Maurice Weber and Xiaojun Xu and Bojan Karla. Rab: Provable robustness against backdoor attacks , booktitle =. 2023 , pages =
work page 2023
-
[31]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Zhen Xiang and Zidi Xiong and Bo Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[32]
International Conference on Machine Learning , pages=
Poisoning language models during instruction tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[33]
arXiv preprint arXiv:2305.14710 , year=
Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models , author=. arXiv preprint arXiv:2305.14710 , year=
-
[39]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[40]
Towards Imperceptible Backdoor Attack in Self-supervised Learning , author=. 2024 , eprint=
work page 2024
-
[41]
Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =
work page 2024
-
[43]
Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and dahai li and Zhiyuan Liu and Maosong Sun , booktitle=. Tool. 2024 , url=
work page 2024
-
[44]
In Advances in Neural Information Processing Systems (NeurIPS) , year=
OpenAGI: When LLM Meets Domain Experts , author=. In Advances in Neural Information Processing Systems (NeurIPS) , year=
- [45]
-
[46]
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem , author=. 2023 , eprint=
work page 2023
-
[48]
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents , author=. 2024 , eprint=
work page 2024
-
[50]
Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2024
-
[52]
2024 IEEE Security and Privacy Workshops (SPW) , pages=
Exploiting programmatic behavior of llms: Dual-use through standard security attacks , author=. 2024 IEEE Security and Privacy Workshops (SPW) , pages=. 2024 , organization=
work page 2024
-
[53]
Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=
-
[55]
Jose Selvi , title =
-
[58]
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design , author=. 2023 , eprint=
work page 2023
-
[59]
Conversational Health Agents: A Personalized LLM-Powered Agent Framework , author=. 2024 , eprint=
work page 2024
-
[60]
PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents , author=. 2024 , eprint=
work page 2024
- [61]
-
[62]
Maddison and Tatsunori Hashimoto , booktitle=
Yangjun Ruan and Honghua Dong and Andrew Wang and Silviu Pitis and Yongchao Zhou and Jimmy Ba and Yann Dubois and Chris J. Maddison and Tatsunori Hashimoto , booktitle=. Identifying the Risks of. 2024 , url=
work page 2024
-
[63]
R-judge: Benchmarking safety risk awareness for llm agents,
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents , author=. arXiv preprint arXiv:2401.10019 , year=
-
[64]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. arXiv , primaryClass=. 2024 , eprint=
work page 2024
-
[65]
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution , author=. EMNLP , year=
-
[66]
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors , author=. 2024 , eprint=
work page 2024
- [67]
- [68]
- [69]
- [70]
- [71]
-
[72]
Proceedings of the 41st International Conference on Machine Learning (ICML) , year =
Jian Xie and Kai Zhang and others , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =
-
[73]
Kinjal Basu and Ibrahim Abdelaziz and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[74]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases , author=. 2023 , eprint=
work page 2023
-
[75]
Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=
work page 2023
-
[76]
Xiao Yang and Kai Sun and Hao Xin and Yushi Sun and others , title =. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks , year =
-
[77]
arXiv preprint arXiv:2406.14595 , year=
Adversaries Can Misuse Combinations of Safe Models , author=. arXiv preprint arXiv:2406.14595 , year=
-
[79]
Jin, Mingyu and Yu, Qinkai and Huang, Jingyuan and Zeng, Qingcheng and Wang, Zhenting and Hua, Wenyue and Zhao, Haiyan and Mei, Kai and Meng, Yanda and Ding, Kaize and Yang, Fan and Du, Mengnan and Zhang, Yongfeng. Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?. Proceedings of the 31st International C...
work page 2025
-
[80]
Available at SSRN 5062105 , year=
Recommender Systems Meet Large Language Model Agents: A Survey , author=. Available at SSRN 5062105 , year=
-
[82]
InstructAgent: Building User Controllable Recommender via LLM Agent , author=. 2025 , eprint=
work page 2025
-
[84]
Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, and Ramesh Jain. Conversational health agents: A personalized llm-powered agent framework, 2024. URL https://arxiv.org/abs/2310.02374
-
[85]
Detecting language model attacks with perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv, 2023
work page 2023
-
[86]
Artificial Analysis. Model leaderboards, 2024. URL https://artificialanalysis.ai/leaderboards/models. Accessed: 2024-09-29
work page 2024
-
[87]
Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, 2024
work page 2024
-
[88]
Api-blend: A comprehensive corpora for training and benchmarking api llms
Kinjal Basu, Ibrahim Abdelaziz, et al. Api-blend: A comprehensive corpora for training and benchmarking api llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[89]
Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv, 2022
work page 2022
-
[90]
Badprompt: Backdoor attacks on continuous prompts, 2022
Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, and Xiaojie Yuan. Badprompt: Backdoor attacks on continuous prompts, 2022. URL https://arxiv.org/abs/2211.14719
-
[91]
Available: https://arxiv.org/abs/2407.12784
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024. URL https://arxiv.org/abs/2407.12784
-
[92]
A survey of man in the middle attacks
Mauro Conti, Nicola Dragoni, and Viktor Lesyk. A survey of man in the middle attacks. IEEE Communications Surveys & Tutorials, 18 0 (3): 0 2027--2051, 2016. doi:10.1109/COMST.2016.2548426
-
[93]
Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents, 2024
Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents, 2024
work page 2024
-
[94]
The philosopher’s stone: Trojaning plugins of large language models
Tian Dong, Minhui Xue, Guoxing Chen, Rayne Holland, Shaofeng Li, Yan Meng, Zhen Liu, and Haojin Zhu. The philosopher’s stone: Trojaning plugins of large language models. arXiv preprint arXiv:2312.00374, 2024
-
[95]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[96]
Ai prompt optimization services
Fiverr. Ai prompt optimization services. Online; accessed 2024-11-13, 2024. https://www.fiverr.com/gigs/ai-prompt
work page 2024
-
[97]
Openagi: When llm meets domain experts
Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and Yongfeng Zhang. Openagi: When llm meets domain experts. In Advances in Neural Information Processing Systems (NeurIPS), 2023 a
work page 2023
-
[98]
Llm as os (llmao), agents as apps: Envisioning aios, agents and the aios-agent ecosystem
Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, and Yongfeng Zhang. Llm as os, agents as apps: Envisioning aios, agents and the aios-agent ecosystem, 2023 b . URL https://arxiv.org/abs/2312.03815
-
[99]
Iddo Gino. Rapidapi hub. https://rapidapi.com/hub, 2024. Accessed: 2024-10-01
work page 2024
-
[100]
Using gpt-eliezer against chatgpt jailbreaking, 2023
R Gorman and Stuart Armstrong. Using gpt-eliezer against chatgpt jailbreaking, 2023. URL https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgptjailbreaking
work page 2023
-
[101]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp.\ 79--90, 2023
work page 2023
-
[102]
Securing llm systems against prompt injection, 2023
Rich Harang. Securing llm systems against prompt injection, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.