arxiv: 2410.02644 · v4 · submitted 2024-10-03 · 💻 cs.CR · cs.AI

Recognition: no theorem link

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang , Jingyuan Huang , Kai Mei , Yifei Yao , Zhenting Wang , Chenlu Zhan , Hongwei Wang , Yongfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 13:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsagent securityprompt injectionmemory poisoningbackdoor attacksbenchmarkvulnerabilitiesdefenses

0 comments

The pith

LLM-based agents show critical vulnerabilities across prompts, tools, and memory with attack success rates reaching 84.30 percent and current defenses offering limited protection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agent Security Bench as a framework to formalize and evaluate security threats to LLM agents that use tools and memory. It covers 10 scenarios such as e-commerce and finance, 10 agents, over 400 tools, 27 attack and defense methods, and 13 LLM backbones. Benchmarking of prompt injections, memory poisoning, a Plan-of-Thought backdoor, and mixed attacks finds high success rates at multiple stages of agent operation. Existing defenses prove largely ineffective at stopping these attacks. The work also adds a metric for measuring how well agents balance task performance against security.

Core claim

Agent Security Bench structures the evaluation of LLM agent security by defining scenarios, agents, tools, attack and defense techniques, and metrics. When applied to 10 prompt injection attacks, one memory poisoning attack, a new backdoor attack, four mixed attacks, and 11 defenses, it exposes high average attack success rates of 84.30 percent in system prompts, user prompt handling, tool usage, and memory retrieval while showing that current defenses achieve only limited mitigation across tested backbones.

What carries the argument

The Agent Security Bench (ASB) framework, which organizes standardized scenarios, agents, tools, 27 attack and defense methods, and 7 metrics to test vulnerabilities at each stage of agent execution.

If this is right

Agents remain exposed at system prompt, user prompt, tool usage, and memory retrieval stages.
Current defenses fail to provide strong protection against the evaluated attacks.
A utility-security balance metric can guide future agent design choices.
Vulnerabilities appear consistent across multiple LLM backbones.
New research on agent security is required to address the identified gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real deployments in finance or autonomous driving could face operational risks from these attack vectors.
Agent builders may need to integrate security checks at every operational stage rather than relying on post-hoc defenses.
Extending the benchmark to dynamic multi-agent interactions could reveal additional failure modes.
The results suggest that tool-calling LLMs require purpose-built security layers beyond standard model alignment.

Load-bearing premise

The selected 10 scenarios, 10 agents, over 400 tools, and 27 attack and defense methods provide representative coverage of real-world LLM agent threats.

What would settle it

A new defense that maintains high task success while reducing attack success rates below 20 percent across all 10 scenarios and tested backbones would undermine the claim of widespread critical vulnerabilities and limited defense effectiveness.

read the original abstract

Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASB gives a useful new benchmark setup for LLM agent attacks and defenses but its headline claims about widespread critical vulnerabilities rest on unexamined choices of scenarios and agents.

read the letter

The paper's main contribution is the Agent Security Bench framework itself: it organizes attacks and defenses across system prompts, user prompts, tool use, and memory, then runs them on 10 scenarios, 10 agents, more than 400 tools, and 13 LLM backbones. It also adds a Plan-of-Thought backdoor attack and a utility-security balance metric that prior prompt-injection work did not emphasize. Code release helps anyone who wants to extend or check the numbers. Those pieces are concrete and new enough to be worth looking at for anyone building or securing agents in domains like finance or autonomous systems. The empirical results—highest average attack success around 84% and weak performance from the 11 defenses—are presented clearly enough to serve as a starting point for discussion. The work is straightforward empirical benchmarking with no fitted parameters or self-referential math, so the circularity burden is low. The selection of the 10 scenarios and 10 agents is the soft spot that matters most. The abstract states the counts but does not show systematic sampling, threat-model justification, or coverage arguments. If those choices skew toward simplified prototypes or narrow tasks, the high attack rates and limited-defense conclusion could be artifacts of the benchmark rather than evidence of broad insecurity. Without the full methods section it is hard to judge how much weight to give the 84% figure. This paper is for researchers who need a shared testbed for agent security experiments. A reader already working on LLM agent robustness will find the structure and the new attack variant useful even if they later refine the scenario set. It is worth sending to peer review so referees can check the selection criteria, the attack implementations, and whether the new metric actually captures the intended trade-off.

Referee Report

2 major / 3 minor

Summary. The paper introduces Agent Security Bench (ASB), a framework for formalizing and benchmarking attacks and defenses in LLM-based agents. It includes 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents, over 400 tools, 27 attack/defense methods, and 7 metrics. The authors evaluate 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 defenses across 13 LLM backbones, reporting a highest average attack success rate of 84.30% with vulnerabilities at system prompt, user prompt, tool usage, and memory stages, plus limited defense effectiveness. They also propose a new metric for balancing utility and security and release the code.

Significance. If the benchmark scope is representative, this provides a timely standardized framework for assessing LLM agent security, which is valuable given the growing use of tool-using agents. The open code enables reproducibility, and the new utility-security balance metric is a constructive addition for practical evaluation. The stage-specific vulnerability findings could usefully inform defense research if the results generalize.

major comments (2)

[Section 3 (ASB Framework)] Section 3 (ASB Framework description): The selection criteria, diversity arguments, threat-model justification, or coverage analysis for the 10 scenarios, 10 agents, and >400 tools are not provided. This is load-bearing for the central claims, as the reported 84.30% ASR, 'critical vulnerabilities' across stages, and 'limited effectiveness' of defenses depend on these choices being sufficiently representative and realistic rather than artifacts of simplified or narrow prototypes.
[Section 5 (Benchmark Results)] Section 5 (Benchmark Results): The manuscript reports concrete numerical results (e.g., ASRs and defense outcomes) but provides limited detail on exact metric definitions, attack implementations, and aggregation methods for the 'highest average attack success rate' and the 7 metrics. While code availability helps, this reduces independent verifiability of the specific headline numbers without external execution.

minor comments (3)

The abstract's phrasing of 'highest average attack success rate of 84.30%' is ambiguous (e.g., maximum ASR vs. average of per-attack maxima) and should be clarified with reference to the relevant table or definition.
Ensure the new utility-security balance metric is formally defined (with equation or pseudocode) in the main text rather than only described at a high level.
Table or figure captions for the main results could more explicitly link back to the 27 methods and 13 backbones for easier cross-referencing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments help us strengthen the presentation of the ASB framework and results. We address each major comment below and will revise the paper to improve clarity and completeness.

read point-by-point responses

Referee: [Section 3 (ASB Framework)] Section 3 (ASB Framework description): The selection criteria, diversity arguments, threat-model justification, or coverage analysis for the 10 scenarios, 10 agents, and >400 tools are not provided. This is load-bearing for the central claims, as the reported 84.30% ASR, 'critical vulnerabilities' across stages, and 'limited effectiveness' of defenses depend on these choices being sufficiently representative and realistic rather than artifacts of simplified or narrow prototypes.

Authors: We agree that the manuscript would benefit from a more explicit discussion of these elements. The 10 scenarios were selected to span diverse real-world domains (e-commerce, autonomous driving, finance, etc.) where tool-using agents are actively deployed, the 10 agents reflect common architectural patterns, and the >400 tools were chosen to cover a wide range of functionalities and potential attack surfaces. However, we acknowledge that a dedicated justification, threat-model statement, and coverage analysis are currently insufficiently detailed. In the revision we will add a new subsection to Section 3 that formally states the threat model, explains the selection criteria, and provides a brief diversity and coverage argument supporting the representativeness of the benchmark. revision: yes
Referee: [Section 5 (Benchmark Results)] Section 5 (Benchmark Results): The manuscript reports concrete numerical results (e.g., ASRs and defense outcomes) but provides limited detail on exact metric definitions, attack implementations, and aggregation methods for the 'highest average attack success rate' and the 7 metrics. While code availability helps, this reduces independent verifiability of the specific headline numbers without external execution.

Authors: We recognize that the paper should be more self-contained. The public code repository supplies the precise metric implementations, attack code, and aggregation scripts, but readers should not need to execute external code to understand the headline numbers. In the revised Section 5 we will add explicit definitions of all seven metrics, a clear description of how attack success rates (including the reported highest average of 84.30 %) are calculated and aggregated across scenarios, models, and runs, and concise overviews of the attack and defense implementations. These additions will improve independent verifiability while retaining the code for full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with independent experimental results

full rationale

The paper defines ASB as a fixed collection of 10 scenarios, 10 agents, >400 tools, 27 attack/defense methods and 7 metrics, then executes those attacks on 13 external LLM backbones and reports observed attack success rates (e.g., 84.30 %). No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the headline claims; the results are direct outputs of the defined experimental protocol. The representativeness of the chosen components is an external-validity question, not a circular reduction of the reported numbers to the inputs by construction. This is the expected non-circular outcome for a benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmarking study that defines its own testbed and runs experiments; it does not rely on free parameters, unstated axioms, or new invented entities beyond the standard concepts of LLM agents, attacks, and defenses.

pith-pipeline@v0.9.0 · 5571 in / 1291 out tokens · 86563 ms · 2026-05-12T13:30:45.975296+00:00 · methodology

discussion (0)

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions
cs.CR 2026-05 conditional novelty 8.0

Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,...
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
cs.CR 2026-04 unverdicted novelty 8.0 full

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
Toward a Principled Framework for Agent Safety Measurement
cs.CR 2026-05 unverdicted novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
cs.CR 2026-04 unverdicted novelty 7.0

A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
cs.CR 2026-04 unverdicted novelty 7.0

Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
cs.CR 2026-04 unverdicted novelty 7.0

SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
cs.CR 2026-05 conditional novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
cs.CR 2026-05 unverdicted novelty 6.0

DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
cs.CR 2026-05 unverdicted novelty 6.0

Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
cs.SE 2026-05 unverdicted novelty 6.0

MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
SUDP: Secret-Use Delegation Protocol for Agentic Systems
cs.CR 2026-04 unverdicted novelty 6.0

SUDP is a protocol allowing untrusted agents to cause bounded, secret-backed operations through fresh user grants redeemed by a custodian, preventing reusable secret exposure.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
cs.CR 2026-04 conditional novelty 6.0

Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 5.0

EvoAgent is an evolvable LLM agent framework using structured skill learning, user-feedback loops, and hierarchical delegation that boosts GPT5.2 performance by about 28% in real-world trade scenarios under LLM-as-Jud...
SafeAgent: A Runtime Protection Architecture for Agentic Systems
cs.AI 2026-04 unverdicted novelty 5.0

SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
SoK: Security of Autonomous LLM Agents in Agentic Commerce
cs.CR 2026-04 unverdicted novelty 5.0

The paper systematizes security for LLM agents in agentic commerce into five threat dimensions, identifies 12 cross-layer attack vectors, and proposes a layered defense architecture.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
AI-Driven Security Alert Screening and Alert Fatigue Mitigation in Security Operations Centers: A Comprehensive Survey
cs.CR 2026-05 unverdicted novelty 3.0

A literature survey synthesizes 119 studies on AI-driven alert screening into a four-stage taxonomy of filtering, triage, correlation, and generative augmentation while identifying gaps in deployment realism and robustness.

Reference graph

Works this paper leans on

143 extracted references · 143 canonical work pages · cited by 33 Pith papers · 11 internal anchors

[1]

2024 , eprint=

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases , author=. 2024 , eprint=

work page 2024
[2]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Poisoning Retrieval Corpora by Injecting Adversarial Passages , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[3]

33rd USENIX Security Symposium (USENIX Security 24) , year =

Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =

work page
[4]

2024 , eprint=

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models , author=. 2024 , eprint=

work page 2024
[6]

Securing llm systems against prompt injection , author=

work page
[7]

Simon Willison , year=

work page
[8]

2022 , journal=

Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples , author=. 2022 , journal=

work page 2022
[9]

NeurIPS ML Safety Workshop , year =

Perez, Fábio and Ribeiro, Ian , title =. NeurIPS ML Safety Workshop , year =

work page
[10]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page
[11]

The Impact of Reasoning Step Length on Large Language Models

Jin, Mingyu and Yu, Qinkai and Shu, Dong and Zhao, Haiyan and Hua, Wenyue and Meng, Yanda and Zhang, Yongfeng and Du, Mengnan. The Impact of Reasoning Step Length on Large Language Models. Findings of the Association for Computational Linguistics ACL 2024. 2024

work page 2024
[12]

2023 , eprint=

Backdoor Attacks for In-Context Learning with Language Models , author=. 2023 , eprint=

work page 2023
[13]

2022 , eprint=

BadPrompt: Backdoor Attacks on Continuous Prompts , author=. 2022 , eprint=

work page 2022
[14]

A Survey of Man In The Middle Attacks , year=

Conti, Mauro and Dragoni, Nicola and Lesyk, Viktor , journal=. A Survey of Man In The Middle Attacks , year=

work page
[16]

arXiv , year=

Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. arXiv , year=

work page
[17]

2023 , journal=

Detecting Language Model Attacks with Perplexity , author=. 2023 , journal=

work page 2023
[20]

2024 , publisher=

Llama 3.1: An In-Depth Analysis of the Next-Generation Large Language Model , author=. 2024 , publisher=

work page 2024
[23]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

OpenAI Blog , year=

Hello GPT-4o , author=. OpenAI Blog , year=

work page
[25]

OpenAI Blog , year=

ChatGPT: Optimizing Language Models for Dialogue , author=. OpenAI Blog , year=

work page
[26]

2023 , url =

R Gorman and Stuart Armstrong , title =. 2023 , url =

work page 2023
[27]

URL https://arxiv , volume=

Ignore previous prompt: Attack techniques for language models (2022) , author=. URL https://arxiv , volume=

work page 2022
[29]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[30]

Rab: Provable robustness against backdoor attacks , booktitle =

Maurice Weber and Xiaojun Xu and Bojan Karla. Rab: Provable robustness against backdoor attacks , booktitle =. 2023 , pages =

work page 2023
[31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zhen Xiang and Zidi Xiong and Bo Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[32]

International Conference on Machine Learning , pages=

Poisoning language models during instruction tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[33]

arXiv preprint arXiv:2305.14710 , year=

Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models , author=. arXiv preprint arXiv:2305.14710 , year=

work page arXiv
[39]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[40]

2024 , eprint=

Towards Imperceptible Backdoor Attack in Self-supervised Learning , author=. 2024 , eprint=

work page 2024
[41]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[43]

Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Lauren Hong and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and dahai li and Zhiyuan Liu and Maosong Sun , booktitle=. Tool. 2024 , url=

work page 2024
[44]

In Advances in Neural Information Processing Systems (NeurIPS) , year=

OpenAGI: When LLM Meets Domain Experts , author=. In Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[45]

2024 , eprint=

AIOS: LLM Agent Operating System , author=. 2024 , eprint=

work page 2024
[46]

2023 , eprint=

LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem , author=. 2023 , eprint=

work page 2023
[48]

2024 , eprint=

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents , author=. 2024 , eprint=

work page 2024
[50]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[52]

2024 IEEE Security and Privacy Workshops (SPW) , pages=

Exploiting programmatic behavior of llms: Dual-use through standard security attacks , author=. 2024 IEEE Security and Privacy Workshops (SPW) , pages=. 2024 , organization=

work page 2024
[53]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

work page
[55]

Jose Selvi , title =

work page
[58]

2023 , eprint=

FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design , author=. 2023 , eprint=

work page 2023
[59]

2024 , eprint=

Conversational Health Agents: A Personalized LLM-Powered Agent Framework , author=. 2024 , eprint=

work page 2024
[60]

2024 , eprint=

PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents , author=. 2024 , eprint=

work page 2024
[61]

2024 , eprint=

A Language Agent for Autonomous Driving , author=. 2024 , eprint=

work page 2024
[62]

Maddison and Tatsunori Hashimoto , booktitle=

Yangjun Ruan and Honghua Dong and Andrew Wang and Silviu Pitis and Yongchao Zhou and Jimmy Ba and Yann Dubois and Chris J. Maddison and Tatsunori Hashimoto , booktitle=. Identifying the Risks of. 2024 , url=

work page 2024
[63]

R-judge: Benchmarking safety risk awareness for llm agents,

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents , author=. arXiv preprint arXiv:2401.10019 , year=

work page arXiv
[64]

arXiv , primaryClass=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. arXiv , primaryClass=. 2024 , eprint=

work page 2024
[65]

EMNLP , year=

TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution , author=. EMNLP , year=

work page
[66]

2024 , eprint=

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors , author=. 2024 , eprint=

work page 2024
[67]

2024 , eprint=

Agent Workflow Memory , author=. 2024 , eprint=

work page 2024
[68]

2024 , url =

Artificial Analysis , title =. 2024 , url =

work page 2024
[69]

2024 , howpublished =

Iddo Gino , title =. 2024 , howpublished =

work page 2024
[70]

2024 , howpublished =

OpenAI , title =. 2024 , howpublished =

work page 2024
[71]

2024 , note =

AI Prompt Optimization Services , author =. 2024 , note =

work page 2024
[72]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

Jian Xie and Kai Zhang and others , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

work page
[73]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Kinjal Basu and Ibrahim Abdelaziz and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[74]

2023 , eprint=

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases , author=. 2023 , eprint=

work page 2023
[75]

2023 , eprint=

Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=

work page 2023
[76]

Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks , year =

Xiao Yang and Kai Sun and Hao Xin and Yushi Sun and others , title =. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks , year =

work page
[77]

arXiv preprint arXiv:2406.14595 , year=

Adversaries Can Misuse Combinations of Safe Models , author=. arXiv preprint arXiv:2406.14595 , year=

work page arXiv
[79]

Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?

Jin, Mingyu and Yu, Qinkai and Huang, Jingyuan and Zeng, Qingcheng and Wang, Zhenting and Hua, Wenyue and Zhao, Haiyan and Mei, Kai and Meng, Yanda and Ding, Kaize and Yang, Fan and Du, Mengnan and Zhang, Yongfeng. Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?. Proceedings of the 31st International C...

work page 2025
[80]

Available at SSRN 5062105 , year=

Recommender Systems Meet Large Language Model Agents: A Survey , author=. Available at SSRN 5062105 , year=

work page
[82]

2025 , eprint=

InstructAgent: Building User Controllable Recommender via LLM Agent , author=. 2025 , eprint=

work page 2025
[84]

Rahmani, and Ramesh Jain

Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, and Ramesh Jain. Conversational health agents: A personalized llm-powered agent framework, 2024. URL https://arxiv.org/abs/2310.02374

work page arXiv 2024
[85]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv, 2023

work page 2023
[86]

Model leaderboards, 2024

Artificial Analysis. Model leaderboards, 2024. URL https://artificialanalysis.ai/leaderboards/models. Accessed: 2024-09-29

work page 2024
[87]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, 2024

work page 2024
[88]

Api-blend: A comprehensive corpora for training and benchmarking api llms

Kinjal Basu, Ibrahim Abdelaziz, et al. Api-blend: A comprehensive corpora for training and benchmarking api llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[89]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv, 2022

work page 2022
[90]

Badprompt: Backdoor attacks on continuous prompts, 2022

Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, and Xiaojie Yuan. Badprompt: Backdoor attacks on continuous prompts, 2022. URL https://arxiv.org/abs/2211.14719

work page arXiv 2022
[91]

Available: https://arxiv.org/abs/2407.12784

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024. URL https://arxiv.org/abs/2407.12784

work page arXiv 2024
[92]

A survey of man in the middle attacks

Mauro Conti, Nicola Dragoni, and Viktor Lesyk. A survey of man in the middle attacks. IEEE Communications Surveys & Tutorials, 18 0 (3): 0 2027--2051, 2016. doi:10.1109/COMST.2016.2548426

work page doi:10.1109/comst.2016.2548426 2027
[93]

Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents, 2024

work page 2024
[94]

The philosopher’s stone: Trojaning plugins of large language models

Tian Dong, Minhui Xue, Guoxing Chen, Rayne Holland, Shaofeng Li, Yan Meng, Zhen Liu, and Haojin Zhu. The philosopher’s stone: Trojaning plugins of large language models. arXiv preprint arXiv:2312.00374, 2024

work page arXiv 2024
[95]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[96]

Ai prompt optimization services

Fiverr. Ai prompt optimization services. Online; accessed 2024-11-13, 2024. https://www.fiverr.com/gigs/ai-prompt

work page 2024
[97]

Openagi: When llm meets domain experts

Yingqiang Ge, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and Yongfeng Zhang. Openagi: When llm meets domain experts. In Advances in Neural Information Processing Systems (NeurIPS), 2023 a

work page 2023
[98]

Llm as os (llmao), agents as apps: Envisioning aios, agents and the aios-agent ecosystem

Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, and Yongfeng Zhang. Llm as os, agents as apps: Envisioning aios, agents and the aios-agent ecosystem, 2023 b . URL https://arxiv.org/abs/2312.03815

work page arXiv 2023
[99]

Rapidapi hub

Iddo Gino. Rapidapi hub. https://rapidapi.com/hub, 2024. Accessed: 2024-10-01

work page 2024
[100]

Using gpt-eliezer against chatgpt jailbreaking, 2023

R Gorman and Stuart Armstrong. Using gpt-eliezer against chatgpt jailbreaking, 2023. URL https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgptjailbreaking

work page 2023
[101]

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp.\ 79--90, 2023

work page 2023
[102]

Securing llm systems against prompt injection, 2023

Rich Harang. Securing llm systems against prompt injection, 2023

work page 2023

Showing first 80 references.