Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

Benjamin C. M. Fung; Boyang Li; Farkhund Iqbal; Heba Ismail; Miles Q. Li

arxiv: 2605.16282 · v1 · pith:SO5JUZLZnew · submitted 2026-04-11 · 💻 cs.CY · cs.AI

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

Miles Q. Li , Benjamin C. M. Fung , Boyang Li , Heba Ismail , Farkhund Iqbal This is my paper

Pith reviewed 2026-05-21 01:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI agent safetysafety benchmarkstaxonomyconsistency analysisLLM agentsrisk evaluationbenchmark methodologyKendall concordance

0 comments

The pith

Different safety benchmarks for AI agents reach contradictory conclusions about model safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs 40 behavioral safety benchmarks for LLM-based autonomous agents and introduces a six-axis taxonomy to classify their evaluation methods. Applying the taxonomy produces a coverage matrix showing broad but shallow risk coverage with limited convergence across benchmarks. A statistical consistency check using 95 percent confidence intervals and Kendall's W analysis finds no evidence of ranking concordance across evaluation dimensions. This matters because the choice of benchmark can produce opposite assessments of whether an agent is safe, affecting deployment decisions. The analysis also identifies that most benchmarks emphasize externally imposed risks over internal agent behaviors and leave robustness largely untested.

Core claim

The authors catalog 40 agent-safety benchmarks and propose a six-axis taxonomy of evaluation methodology. Applying the taxonomy reveals broad risk coverage but limited methodological convergence, with benchmarks concentrated in sandboxed and constrained settings. The cross-benchmark consistency check with 95% confidence intervals and Kendall's W concordance analysis finds no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94), demonstrating that benchmark choice can yield contradictory safety conclusions.

What carries the argument

A six-axis taxonomy of benchmark evaluation methodology used to build a coverage matrix and perform cross-benchmark concordance analysis with Kendall's W.

If this is right

Benchmark choice can yield contradictory safety conclusions.
Coverage counts often overstate evaluation depth.
Environment fidelity systematically shapes reported safety.
The field disproportionately tests externally imposed rather than agent-internal risks.
Metric fragmentation limits comparison and robustness remains effectively unbenchmarked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers and regulators may need to test agents against multiple benchmarks rather than relying on any single one to reach stable safety judgments.
Adopting the proposed minimum reporting standards could reduce future inconsistencies by making benchmark designs more comparable.
The observed lack of concordance suggests value in creating benchmarks that directly probe agent-internal risk generation instead of only external threats.

Load-bearing premise

The manual classification of the 40 benchmarks into the six-axis taxonomy accurately captures the methodological differences that produce divergent safety conclusions.

What would settle it

Re-running the concordance analysis on the same or similar benchmarks but obtaining a Kendall's W value substantially higher than 0.10 with p below 0.05 would falsify the no-concordance result.

Figures

Figures reproduced from arXiv: 2605.16282 by Benjamin C. M. Fung, Boyang Li, Farkhund Iqbal, Heba Ismail, Miles Q. Li.

**Figure 2.** Figure 2: Heatmap of the risk–benchmark coverage matrix for 40 core behavioral benchmarks (full 45-entry [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Radar chart of risk coverage among the 40 core behavioral benchmarks. Dark = primary coverage; [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

read the original abstract

The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper catalogs 40 agent safety benchmarks, introduces a six-axis taxonomy, and reports low cross-benchmark ranking agreement via Kendall's W, but the claim of contradictory safety conclusions needs confirmation that the rankings share overlapping agent sets.

read the letter

Hi colleague, the main takeaway is that this paper pulls together 40 agent safety benchmarks from 2023-2026, lays out a six-axis taxonomy for their methods, and runs a consistency check that finds almost no agreement in how they rank safety (W = 0.10, p = 0.94). That points to benchmark choice mattering a lot for the conclusions you draw, which is worth having on record if you're comparing safety claims across papers. They do a good job with the catalog and the coverage matrix, which shows broad but shallow risk coverage and highlights things like limited robustness testing and a bias toward external rather than internal risks. Releasing the metadata, codings, and artifacts is useful for anyone who wants to verify or extend the work, and the call for minimum reporting standards follows directly from the fragmentation they document. The taxonomy itself looks like a practical way to sort out differences in environments, metrics, and threat models that prior work left scattered. The soft spot sits in the statistical section. The low concordance is used to argue that benchmark choice can produce contradictory safety conclusions, but this only holds if the rankings come from a shared set of agents or systems. If the benchmarks instead evaluate largely disjoint collections, then weak agreement is expected and does not demonstrate conflicting verdicts on the same properties. The abstract does not make the overlap explicit, so the full paper should clarify how the rankings were aligned or adjust the interpretation. The manual classification into the taxonomy also rests on author judgment without reported inter-rater checks, though that is a minor issue for a first mapping. This is for people working on agent evaluation and safety validation who need a current map of the field and its gaps. A serious referee should see it because the catalog and taxonomy stand on their own even if the concordance claim requires some tightening on the data side. I would send it for peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript catalogs 40 behavioral agent-safety benchmarks (2023-2026) along with 5 adjacent artifacts, proposes a six-axis taxonomy of benchmark evaluation methodology, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency analysis using Kendall's W concordance and 95% confidence intervals on rankings. It concludes that there is no evidence of ranking concordance (W = 0.10, p = 0.94), that benchmark choice can produce contradictory safety conclusions, that coverage counts overstate depth, that environment fidelity shapes results, and that the field under-tests agent-internal risks and robustness. The authors release structured metadata, taxonomy codings, and artifacts, and propose minimum reporting standards.

Significance. If the consistency analysis is valid, the work is significant for documenting fragmentation in a rapidly growing subfield and for releasing reusable artifacts that enable future meta-analyses. The taxonomy and coverage matrix provide a concrete framework for comparing evaluation instruments, and the call for reporting standards addresses a practical gap. The statistical grounding (explicit W and p-values) is a strength relative to purely qualitative surveys.

major comments (1)

[cross-benchmark consistency check] Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.

minor comments (2)

[consistency analysis] The abstract states that full details on how rankings were derived from each benchmark's metrics are needed; the main text should include a table or subsection enumerating the exact metrics, normalization steps, and agent sets used for each benchmark in the concordance calculation.
[taxonomy section] The six-axis taxonomy is introduced as an invented classification; a brief discussion of inter-rater reliability or sensitivity to axis redefinition would strengthen the claim that it exhaustively captures methodological differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the need for greater clarity on our consistency analysis. We agree that explicitly addressing the agent overlap is necessary to fully support our interpretation of the results and will revise the manuscript to include this information.

read point-by-point responses

Referee: Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.

Authors: We appreciate this observation. The consistency analysis was performed using a common overlapping set of agents evaluated across multiple benchmarks to enable direct comparison of safety rankings. We acknowledge that this detail was not stated explicitly in the original manuscript. In the revision we will add a dedicated paragraph in the consistency analysis section describing the agent selection criteria, the number of shared agents, the specific benchmarks involved, and a summary of the overlap sizes. This will confirm that the reported low concordance (W = 0.10, p = 0.94) reflects genuine divergence in safety conclusions for the same agents rather than an artifact of disjoint sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical survey and standard statistical analysis on external benchmarks.

full rationale

The paper catalogs 40 external benchmarks from 2023-2026, proposes a six-axis taxonomy derived from methodological inspection of those benchmarks, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency check using Kendall's W and 95% confidence intervals on observed rankings. These steps rely on external data sources and standard non-parametric statistics rather than any fitted parameters, self-defined quantities, or self-citation chains that reduce the central claims to the paper's own inputs by construction. The manual classification is an analytical coding step whose outputs are then tested against the same external corpus; it does not create a self-referential loop where a result is predicted from a quantity defined in terms of itself. No load-bearing uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the authors' manual assignment of each benchmark to the six taxonomy axes and on the assumption that the collected set of 40 benchmarks is sufficiently representative for the consistency conclusions to generalize.

axioms (1)

standard math Standard statistical assumptions underlying Kendall's W and 95% confidence interval calculations hold for the derived safety rankings.
Invoked when reporting W = 0.10, p = 0.94.

invented entities (1)

Six-axis taxonomy of benchmark evaluation methodology no independent evidence
purpose: To classify and compare agent safety benchmarks along dimensions that affect safety conclusions.
New framework introduced by the authors; no independent evidence outside this paper.

pith-pipeline@v0.9.0 · 5802 in / 1442 out tokens · 36492 ms · 2026-05-21T01:38:16.199810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a six-axis taxonomy of benchmark evaluation methodology... and apply it across the corpus... finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 22 internal anchors

[1]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

M. Andriushchenko, A. Souly, M. Dziemian, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. In Proceedings of ICLR, 2025. arXiv:2410.09024

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Arora, S

N. Arora, S. Joel, I. Kavathekar, et al. Exposing weak links in multi-agent systems under adversarial prompting. arXiv preprint arXiv:2511.10949, 2025

work page arXiv 2025
[3]

Y. Bai, A. Jones, K. Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Black, A

S. Black, A. Cooper Stickland, J. Pencharz, et al. Replibench: Evaluating the autonomous replication capabilities of language model agents. arXiv preprint arXiv:2504.18565, 2025

work page arXiv 2025
[5]

Bordes, C

F. Bordes, C. Ross, J. T. Kao, E. Spiliopoulou, and A. Williams. Eval factsheets: A structured framework for documenting AI evaluations. arXiv preprint arXiv:2512.04062, 2025

work page arXiv 2025
[6]

Z. Chen, Z. Xiang, C. Xiao, et al. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. In NeurIPS, 2024. arXiv:2407.12784

work page arXiv 2024
[7]

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

A. Chhabra, S. Datta, S. K. Nahin, and P. Mohapatra. Agentic ai security: Threats, defenses, evaluation, and open challenges. arXiv preprint arXiv:2510.23883, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Cui, W.-L

J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. Or-bench: An over-refusal benchmark for large language models. In ICML, 2025. arXiv:2405.20947

work page arXiv 2025
[9]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

E. Debenedetti, J. Zhang, M. Balunovi \'c , et al. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024. arXiv:2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

S. Dong, S. Xu, P. He, et al. Minja: Memory injection attacks on llm agents via query-only interaction. In NeurIPS, 2025. arXiv:2503.03704

work page arXiv 2025
[11]

El Yagoubi, R

F. El Yagoubi, R. Al Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems. arXiv preprint arXiv:2602.11510, 2026

work page arXiv 2026
[12]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, et al. Wasp: Benchmarking web agent security against prompt injection attacks. arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Y. Feng, Y. Li, Y. Wu, et al. Backdooragent: A unified framework for backdoor attacks on llm-based agents. arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026
[14]

Y. Fu, X. Yuan, and D. Wang. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments. arXiv preprint arXiv:2506.15253, 2025

work page arXiv 2025
[15]

Alignment faking in large language models

R. Greenblatt, C. Denison, B. Wright, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Hadeliya, M

T. Hadeliya, M. A. Jauhar, N. Sakpal, and D. Cruz. When refusals fail: Unstable safety mechanisms in long-context llm agents. In AAAI TrustAgent Workshop, 2026. arXiv:2512.02445

work page arXiv 2026
[18]

arXiv preprint arXiv:2502.14143 (2025).https: //doi.org/10.48550/arXiv.2502.14143

L. Hammond, A. Chan, J. Clifton, et al. Multi-agent risks from advanced ai. arXiv preprint arXiv:2502.14143, 2025

work page arXiv 2025
[19]

Hopman, J

M. Hopman, J. Elstner, M. Avramidou, et al. Evaluating and understanding scheming propensity in llm agents. arXiv preprint arXiv:2603.01608, 2026

work page arXiv 2026
[20]

W. Hua, X. Yang, M. Jin, et al. Trustagent: Towards safe and trustworthy llm-based agents. In EMNLP, 2024. arXiv:2402.01586

work page arXiv 2024
[21]

Risks from Learned Optimization in Advanced Machine Learning Systems

E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[22]

Jiang, Y

T. Jiang, Y. Wang, J. Liang, and T. Wang. Agentlab: Benchmarking llm agents against long-horizon attacks. arXiv preprint arXiv:2602.16901, 2026

work page arXiv 2026
[23]

Juneja, J

G. Juneja, J. N. S. Pasupulati, A. Albalak, et al. Magpie: A benchmark for multi-agent contextual privacy evaluation. arXiv preprint arXiv:2510.15186, 2025

work page arXiv 2025
[24]

Kavathekar, H

I. Kavathekar, H. Jain, A. Rathod, et al. Tamas: Benchmarking adversarial risks in multi-agent llm systems. In MAS Workshop, ICML, 2025. arXiv:2511.05269

work page arXiv 2025
[25]

Kutasov, Y

J. Kutasov, Y. Sun, P. Colognese, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents. arXiv preprint arXiv:2506.15740, 2025

work page arXiv 2025
[26]

J. Lee, D. Hahm, J. S. Choi, et al. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control. arXiv preprint arXiv:2410.17520, 2024

work page arXiv 2024
[27]

I. Levy, B. Wiesel, S. Marreed, et al. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. In ICLR, 2026. arXiv:2410.06703

work page arXiv 2026
[28]

M. Q. Li, B. C. M. Fung, M. Weiss, et al. A benchmark for evaluating outcome-driven constraint violations in autonomous ai agents. arXiv preprint arXiv:2512.20798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Liang, S

X. Liang, S. Niu, Z. Li, et al. Saferag: Benchmarking security in retrieval-augmented generation of large language model. In ACL, 2025. arXiv:2501.18636

work page arXiv 2025
[30]

A. Liu, Z. Ying, L. Wang, J. Mu, J. Guo, J. Wang, Y. Ma, S. Liang, M. Zhang, X. Liu, and D. Tao. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions. In MAS Workshop, ICML, 2025. arXiv:2506.14697

work page arXiv 2025
[31]

X. Lu, Z. Chen, X. Hu, et al. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks. arXiv preprint arXiv:2506.16402, 2025

work page arXiv 2025
[32]

H. Luo, S. Dai, C. Ni, et al. Agentauditor: Human-level safety and security evaluation for llm agents. In NeurIPS, 2025. arXiv:2506.00641

work page arXiv 2025
[33]

X. Ma, Y. Gao, Y. Wang, et al. Safety at scale: A comprehensive survey of large model and agent safety. arXiv preprint arXiv:2502.05206, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Natural emergent misalignment from reward hacking in production rl.arXiv preprint arXiv:2511.18397, 2025

M. MacDiarmid, B. Wright, J. Uesato, et al. Natural emergent misalignment from reward hacking in production RL . arXiv preprint arXiv:2511.18397, 2025

work page arXiv 2025
[35]

McGregor, V

S. McGregor, V. Lu, V. Tashev, et al. Risk management for mitigating benchmark failure modes: BenchRisk . In NeurIPS, 2025. arXiv:2510.21460

work page arXiv 2025
[36]

Frontier Models are Capable of In-context Scheming

A. Meinke, B. Schoen, J. Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

S. Messick. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50 0 (9): 0 741--749, 1995

work page 1995
[38]

Evaluation and benchmarking of LLM agents: A survey,

M. Mohammadi, Y. Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In KDD, 2025. arXiv:2507.21504

work page arXiv 2025
[39]

A. Naik, P. Quinn, G. Bosch, et al. Agentmisalignment: Measuring the propensity for misaligned behaviour in llm-based agents. arXiv preprint arXiv:2506.04018, 2025

work page arXiv 2025
[40]

Nakamura, A

M. Nakamura, A. Kumar, S. Das, et al. Colosseum: Auditing collusion in cooperative multi-agent systems. arXiv preprint arXiv:2602.15198, 2026

work page arXiv 2026
[41]

N \"o ther, A

J. N \"o ther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms. arXiv preprint arXiv:2508.16481, 2025

work page arXiv 2025
[42]

A. Pan, J. S. Chan, A. Zou, et al. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In ICML, 2023. arXiv:2304.03279

work page arXiv 2023
[43]

Q. Ren, Z. Zheng, J. Guo, et al. When ai agents collude online: Financial fraud risks by collaborative llm agents on social platforms. arXiv preprint arXiv:2511.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

R. Ren, S. Basart, A. Khoja, et al. Safetywashing: Do ai safety benchmarks actually measure safety progress? In NeurIPS, 2024. arXiv:2407.21792

work page arXiv 2024
[45]

Reuel, A

A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices. In NeurIPS, 2024. arXiv:2411.12990

work page arXiv 2024
[46]

Y. Ruan, H. Dong, A. Wang, et al. Identifying the risks of lm agents with an lm-emulated sandbox. In ICLR (Spotlight), 2024. arXiv:2309.15817

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Schlatter, B

J. Schlatter, B. Weinstein-Raun, and J. Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026. arXiv:2509.14260

work page arXiv 2026
[48]

U. M. Sehwag, S. Shabihi, A. McAvoy, et al. Propensitybench: Evaluating latent safety risks in large language models via an agentic approach. arXiv preprint arXiv:2511.20703, 2025

work page arXiv 2025
[49]

O. Tailor. Audit the whisper: Detecting steganographic collusion in multi-agent llms. arXiv preprint arXiv:2510.04303, 2025

work page arXiv 2025
[50]

Vijayvargiya, A

S. Vijayvargiya, A. B. Soni, X. Zhou, et al. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety. In ICLR, 2026. arXiv:2507.06134

work page arXiv 2026
[51]

H. Wang, C. M. Poskitt, and J. Sun. Agentspec: Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of ICSE, 2026. arXiv:2503.18666

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

K. Wang, G. Zhang, Z. Zhou, et al. A comprehensive survey in llm(-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025

work page arXiv 2025
[53]

L. Wang, C. Ma, X. Feng, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024. arXiv:2308.11432

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Z. Xi, W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

H. Xia, H. Wang, Z. Liu, et al. Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms. In EMNLP Findings, 2025. arXiv:2509.07315

work page arXiv 2025
[56]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Z. Xiang, L. Zheng, Y. Li, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. In ICML, 2025. arXiv:2406.09187

work page internal anchor Pith review arXiv 2025
[57]

Y. Xie, Y. Yuan, W. Wang, et al. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In EMNLP, 2025

work page 2025
[58]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, et al. Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

S. Yin, X. Pang, Y. Ding, et al. Safeagentbench: A benchmark for safe task planning of embodied llm agents. arXiv preprint arXiv:2412.13178, 2024

work page arXiv 2024
[60]

C. Yu, S. Engelmann, R. Cao, D. Ali, and O. Papakyriakopoulos. How should ai safety benchmarks benchmark safety? arXiv preprint arXiv:2601.23112, 2026

work page arXiv 2026
[61]

M. Yu, F. Meng, X. Zhou, et al. A survey on trustworthy llm agents: Threats and countermeasures. In KDD, 2025. arXiv:2503.09648

work page arXiv 2025
[62]

T. Yuan, Z. He, L. Dong, et al. R-judge: Benchmarking safety risk awareness for llm agents. In EMNLP Findings, 2024. arXiv:2401.10019

work page arXiv 2024
[63]

Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox

Summery Yue. Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox. X (formerly Twitter), February 2026. https://x.com/summeryue0/status/2025774069124399363

work page arXiv 2026
[64]

Q. Zhan, Z. Liang, Z. Ying, and D. Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In ACL Findings, 2024. arXiv:2403.02691

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang, J. Huang, K. Mei, et al. Agent security bench ( ASB ): Formalizing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025. arXiv:2410.02644

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, S. Cui, Y. Lu, et al. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Zhang, Y

Z. Zhang, Y. Zhang, L. Li, et al. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. In ACL, 2024 b . arXiv:2401.11880

work page arXiv 2024
[68]

K. Zhou, S. Jangam, A. Nagarajan, et al. Safepro: Evaluating the safety of professional-level ai agents. arXiv preprint arXiv:2601.06663, 2026

work page arXiv 2026
[69]

X. Zong, Z. Shen, L. Wang, et al. Mcp-safetybench: A benchmark for safety evaluation of large language models with real-world mcp servers. arXiv preprint arXiv:2512.15163, 2025

work page arXiv 2025
[70]

A. Zou, Z. Wang, N. Carlini, et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

W. Zou, R. Geng, B. Wang, and J. Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. arXiv:2402.07867

work page arXiv 2025

[1] [1]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

M. Andriushchenko, A. Souly, M. Dziemian, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. In Proceedings of ICLR, 2025. arXiv:2410.09024

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Arora, S

N. Arora, S. Joel, I. Kavathekar, et al. Exposing weak links in multi-agent systems under adversarial prompting. arXiv preprint arXiv:2511.10949, 2025

work page arXiv 2025

[3] [3]

Y. Bai, A. Jones, K. Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Black, A

S. Black, A. Cooper Stickland, J. Pencharz, et al. Replibench: Evaluating the autonomous replication capabilities of language model agents. arXiv preprint arXiv:2504.18565, 2025

work page arXiv 2025

[5] [5]

Bordes, C

F. Bordes, C. Ross, J. T. Kao, E. Spiliopoulou, and A. Williams. Eval factsheets: A structured framework for documenting AI evaluations. arXiv preprint arXiv:2512.04062, 2025

work page arXiv 2025

[6] [6]

Z. Chen, Z. Xiang, C. Xiao, et al. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. In NeurIPS, 2024. arXiv:2407.12784

work page arXiv 2024

[7] [7]

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

A. Chhabra, S. Datta, S. K. Nahin, and P. Mohapatra. Agentic ai security: Threats, defenses, evaluation, and open challenges. arXiv preprint arXiv:2510.23883, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Cui, W.-L

J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. Or-bench: An over-refusal benchmark for large language models. In ICML, 2025. arXiv:2405.20947

work page arXiv 2025

[9] [9]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

E. Debenedetti, J. Zhang, M. Balunovi \'c , et al. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024. arXiv:2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

S. Dong, S. Xu, P. He, et al. Minja: Memory injection attacks on llm agents via query-only interaction. In NeurIPS, 2025. arXiv:2503.03704

work page arXiv 2025

[11] [11]

El Yagoubi, R

F. El Yagoubi, R. Al Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems. arXiv preprint arXiv:2602.11510, 2026

work page arXiv 2026

[12] [12]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, et al. Wasp: Benchmarking web agent security against prompt injection attacks. arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Y. Feng, Y. Li, Y. Wu, et al. Backdooragent: A unified framework for backdoor attacks on llm-based agents. arXiv preprint arXiv:2601.04566, 2026

work page arXiv 2026

[14] [14]

Y. Fu, X. Yuan, and D. Wang. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments. arXiv preprint arXiv:2506.15253, 2025

work page arXiv 2025

[15] [15]

Alignment faking in large language models

R. Greenblatt, C. Denison, B. Wright, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Hadeliya, M

T. Hadeliya, M. A. Jauhar, N. Sakpal, and D. Cruz. When refusals fail: Unstable safety mechanisms in long-context llm agents. In AAAI TrustAgent Workshop, 2026. arXiv:2512.02445

work page arXiv 2026

[18] [18]

arXiv preprint arXiv:2502.14143 (2025).https: //doi.org/10.48550/arXiv.2502.14143

L. Hammond, A. Chan, J. Clifton, et al. Multi-agent risks from advanced ai. arXiv preprint arXiv:2502.14143, 2025

work page arXiv 2025

[19] [19]

Hopman, J

M. Hopman, J. Elstner, M. Avramidou, et al. Evaluating and understanding scheming propensity in llm agents. arXiv preprint arXiv:2603.01608, 2026

work page arXiv 2026

[20] [20]

W. Hua, X. Yang, M. Jin, et al. Trustagent: Towards safe and trustworthy llm-based agents. In EMNLP, 2024. arXiv:2402.01586

work page arXiv 2024

[21] [21]

Risks from Learned Optimization in Advanced Machine Learning Systems

E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[22] [22]

Jiang, Y

T. Jiang, Y. Wang, J. Liang, and T. Wang. Agentlab: Benchmarking llm agents against long-horizon attacks. arXiv preprint arXiv:2602.16901, 2026

work page arXiv 2026

[23] [23]

Juneja, J

G. Juneja, J. N. S. Pasupulati, A. Albalak, et al. Magpie: A benchmark for multi-agent contextual privacy evaluation. arXiv preprint arXiv:2510.15186, 2025

work page arXiv 2025

[24] [24]

Kavathekar, H

I. Kavathekar, H. Jain, A. Rathod, et al. Tamas: Benchmarking adversarial risks in multi-agent llm systems. In MAS Workshop, ICML, 2025. arXiv:2511.05269

work page arXiv 2025

[25] [25]

Kutasov, Y

J. Kutasov, Y. Sun, P. Colognese, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents. arXiv preprint arXiv:2506.15740, 2025

work page arXiv 2025

[26] [26]

J. Lee, D. Hahm, J. S. Choi, et al. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control. arXiv preprint arXiv:2410.17520, 2024

work page arXiv 2024

[27] [27]

I. Levy, B. Wiesel, S. Marreed, et al. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. In ICLR, 2026. arXiv:2410.06703

work page arXiv 2026

[28] [28]

M. Q. Li, B. C. M. Fung, M. Weiss, et al. A benchmark for evaluating outcome-driven constraint violations in autonomous ai agents. arXiv preprint arXiv:2512.20798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Liang, S

X. Liang, S. Niu, Z. Li, et al. Saferag: Benchmarking security in retrieval-augmented generation of large language model. In ACL, 2025. arXiv:2501.18636

work page arXiv 2025

[30] [30]

A. Liu, Z. Ying, L. Wang, J. Mu, J. Guo, J. Wang, Y. Ma, S. Liang, M. Zhang, X. Liu, and D. Tao. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions. In MAS Workshop, ICML, 2025. arXiv:2506.14697

work page arXiv 2025

[31] [31]

X. Lu, Z. Chen, X. Hu, et al. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks. arXiv preprint arXiv:2506.16402, 2025

work page arXiv 2025

[32] [32]

H. Luo, S. Dai, C. Ni, et al. Agentauditor: Human-level safety and security evaluation for llm agents. In NeurIPS, 2025. arXiv:2506.00641

work page arXiv 2025

[33] [33]

X. Ma, Y. Gao, Y. Wang, et al. Safety at scale: A comprehensive survey of large model and agent safety. arXiv preprint arXiv:2502.05206, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Natural emergent misalignment from reward hacking in production rl.arXiv preprint arXiv:2511.18397, 2025

M. MacDiarmid, B. Wright, J. Uesato, et al. Natural emergent misalignment from reward hacking in production RL . arXiv preprint arXiv:2511.18397, 2025

work page arXiv 2025

[35] [35]

McGregor, V

S. McGregor, V. Lu, V. Tashev, et al. Risk management for mitigating benchmark failure modes: BenchRisk . In NeurIPS, 2025. arXiv:2510.21460

work page arXiv 2025

[36] [36]

Frontier Models are Capable of In-context Scheming

A. Meinke, B. Schoen, J. Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

S. Messick. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50 0 (9): 0 741--749, 1995

work page 1995

[38] [38]

Evaluation and benchmarking of LLM agents: A survey,

M. Mohammadi, Y. Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In KDD, 2025. arXiv:2507.21504

work page arXiv 2025

[39] [39]

A. Naik, P. Quinn, G. Bosch, et al. Agentmisalignment: Measuring the propensity for misaligned behaviour in llm-based agents. arXiv preprint arXiv:2506.04018, 2025

work page arXiv 2025

[40] [40]

Nakamura, A

M. Nakamura, A. Kumar, S. Das, et al. Colosseum: Auditing collusion in cooperative multi-agent systems. arXiv preprint arXiv:2602.15198, 2026

work page arXiv 2026

[41] [41]

N \"o ther, A

J. N \"o ther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms. arXiv preprint arXiv:2508.16481, 2025

work page arXiv 2025

[42] [42]

A. Pan, J. S. Chan, A. Zou, et al. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In ICML, 2023. arXiv:2304.03279

work page arXiv 2023

[43] [43]

Q. Ren, Z. Zheng, J. Guo, et al. When ai agents collude online: Financial fraud risks by collaborative llm agents on social platforms. arXiv preprint arXiv:2511.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

R. Ren, S. Basart, A. Khoja, et al. Safetywashing: Do ai safety benchmarks actually measure safety progress? In NeurIPS, 2024. arXiv:2407.21792

work page arXiv 2024

[45] [45]

Reuel, A

A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices. In NeurIPS, 2024. arXiv:2411.12990

work page arXiv 2024

[46] [46]

Y. Ruan, H. Dong, A. Wang, et al. Identifying the risks of lm agents with an lm-emulated sandbox. In ICLR (Spotlight), 2024. arXiv:2309.15817

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Schlatter, B

J. Schlatter, B. Weinstein-Raun, and J. Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026. arXiv:2509.14260

work page arXiv 2026

[48] [48]

U. M. Sehwag, S. Shabihi, A. McAvoy, et al. Propensitybench: Evaluating latent safety risks in large language models via an agentic approach. arXiv preprint arXiv:2511.20703, 2025

work page arXiv 2025

[49] [49]

O. Tailor. Audit the whisper: Detecting steganographic collusion in multi-agent llms. arXiv preprint arXiv:2510.04303, 2025

work page arXiv 2025

[50] [50]

Vijayvargiya, A

S. Vijayvargiya, A. B. Soni, X. Zhou, et al. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety. In ICLR, 2026. arXiv:2507.06134

work page arXiv 2026

[51] [51]

H. Wang, C. M. Poskitt, and J. Sun. Agentspec: Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of ICSE, 2026. arXiv:2503.18666

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

K. Wang, G. Zhang, Z. Zhou, et al. A comprehensive survey in llm(-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025

work page arXiv 2025

[53] [53]

L. Wang, C. Ma, X. Feng, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024. arXiv:2308.11432

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Z. Xi, W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

H. Xia, H. Wang, Z. Liu, et al. Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms. In EMNLP Findings, 2025. arXiv:2509.07315

work page arXiv 2025

[56] [56]

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Z. Xiang, L. Zheng, Y. Li, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. In ICML, 2025. arXiv:2406.09187

work page internal anchor Pith review arXiv 2025

[57] [57]

Y. Xie, Y. Yuan, W. Wang, et al. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In EMNLP, 2025

work page 2025

[58] [58]

Survey on Evaluation of LLM-based Agents

A. Yehudai, L. Eden, A. Li, et al. Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

S. Yin, X. Pang, Y. Ding, et al. Safeagentbench: A benchmark for safe task planning of embodied llm agents. arXiv preprint arXiv:2412.13178, 2024

work page arXiv 2024

[60] [60]

C. Yu, S. Engelmann, R. Cao, D. Ali, and O. Papakyriakopoulos. How should ai safety benchmarks benchmark safety? arXiv preprint arXiv:2601.23112, 2026

work page arXiv 2026

[61] [61]

M. Yu, F. Meng, X. Zhou, et al. A survey on trustworthy llm agents: Threats and countermeasures. In KDD, 2025. arXiv:2503.09648

work page arXiv 2025

[62] [62]

T. Yuan, Z. He, L. Dong, et al. R-judge: Benchmarking safety risk awareness for llm agents. In EMNLP Findings, 2024. arXiv:2401.10019

work page arXiv 2024

[63] [63]

Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox

Summery Yue. Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox. X (formerly Twitter), February 2026. https://x.com/summeryue0/status/2025774069124399363

work page arXiv 2026

[64] [64]

Q. Zhan, Z. Liang, Z. Ying, and D. Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In ACL Findings, 2024. arXiv:2403.02691

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang, J. Huang, K. Mei, et al. Agent security bench ( ASB ): Formalizing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025. arXiv:2410.02644

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, S. Cui, Y. Lu, et al. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Zhang, Y

Z. Zhang, Y. Zhang, L. Li, et al. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. In ACL, 2024 b . arXiv:2401.11880

work page arXiv 2024

[68] [68]

K. Zhou, S. Jangam, A. Nagarajan, et al. Safepro: Evaluating the safety of professional-level ai agents. arXiv preprint arXiv:2601.06663, 2026

work page arXiv 2026

[69] [69]

X. Zong, Z. Shen, L. Wang, et al. Mcp-safetybench: A benchmark for safety evaluation of large language models with real-world mcp servers. arXiv preprint arXiv:2512.15163, 2025

work page arXiv 2025

[70] [70]

A. Zou, Z. Wang, N. Carlini, et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

W. Zou, R. Geng, B. Wang, and J. Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. arXiv:2402.07867

work page arXiv 2025