pith. sign in

arxiv: 2605.16282 · v1 · pith:SO5JUZLZnew · submitted 2026-04-11 · 💻 cs.CY · cs.AI

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

Pith reviewed 2026-05-21 01:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords AI agent safetysafety benchmarkstaxonomyconsistency analysisLLM agentsrisk evaluationbenchmark methodologyKendall concordance
0
0 comments X

The pith

Different safety benchmarks for AI agents reach contradictory conclusions about model safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs 40 behavioral safety benchmarks for LLM-based autonomous agents and introduces a six-axis taxonomy to classify their evaluation methods. Applying the taxonomy produces a coverage matrix showing broad but shallow risk coverage with limited convergence across benchmarks. A statistical consistency check using 95 percent confidence intervals and Kendall's W analysis finds no evidence of ranking concordance across evaluation dimensions. This matters because the choice of benchmark can produce opposite assessments of whether an agent is safe, affecting deployment decisions. The analysis also identifies that most benchmarks emphasize externally imposed risks over internal agent behaviors and leave robustness largely untested.

Core claim

The authors catalog 40 agent-safety benchmarks and propose a six-axis taxonomy of evaluation methodology. Applying the taxonomy reveals broad risk coverage but limited methodological convergence, with benchmarks concentrated in sandboxed and constrained settings. The cross-benchmark consistency check with 95% confidence intervals and Kendall's W concordance analysis finds no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94), demonstrating that benchmark choice can yield contradictory safety conclusions.

What carries the argument

A six-axis taxonomy of benchmark evaluation methodology used to build a coverage matrix and perform cross-benchmark concordance analysis with Kendall's W.

If this is right

  • Benchmark choice can yield contradictory safety conclusions.
  • Coverage counts often overstate evaluation depth.
  • Environment fidelity systematically shapes reported safety.
  • The field disproportionately tests externally imposed rather than agent-internal risks.
  • Metric fragmentation limits comparison and robustness remains effectively unbenchmarked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers and regulators may need to test agents against multiple benchmarks rather than relying on any single one to reach stable safety judgments.
  • Adopting the proposed minimum reporting standards could reduce future inconsistencies by making benchmark designs more comparable.
  • The observed lack of concordance suggests value in creating benchmarks that directly probe agent-internal risk generation instead of only external threats.

Load-bearing premise

The manual classification of the 40 benchmarks into the six-axis taxonomy accurately captures the methodological differences that produce divergent safety conclusions.

What would settle it

Re-running the concordance analysis on the same or similar benchmarks but obtaining a Kendall's W value substantially higher than 0.10 with p below 0.05 would falsify the no-concordance result.

Figures

Figures reproduced from arXiv: 2605.16282 by Benjamin C. M. Fung, Boyang Li, Farkhund Iqbal, Heba Ismail, Miles Q. Li.

Figure 1
Figure 1. Figure 1: Year-level timeline of core behavioral agent-safety benchmark publications, color-coded by primary [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap of the risk–benchmark coverage matrix for 40 core behavioral benchmarks (full 45-entry [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar chart of risk coverage among the 40 core behavioral benchmarks. Dark = primary coverage; [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
read the original abstract

The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript catalogs 40 behavioral agent-safety benchmarks (2023-2026) along with 5 adjacent artifacts, proposes a six-axis taxonomy of benchmark evaluation methodology, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency analysis using Kendall's W concordance and 95% confidence intervals on rankings. It concludes that there is no evidence of ranking concordance (W = 0.10, p = 0.94), that benchmark choice can produce contradictory safety conclusions, that coverage counts overstate depth, that environment fidelity shapes results, and that the field under-tests agent-internal risks and robustness. The authors release structured metadata, taxonomy codings, and artifacts, and propose minimum reporting standards.

Significance. If the consistency analysis is valid, the work is significant for documenting fragmentation in a rapidly growing subfield and for releasing reusable artifacts that enable future meta-analyses. The taxonomy and coverage matrix provide a concrete framework for comparing evaluation instruments, and the call for reporting standards addresses a practical gap. The statistical grounding (explicit W and p-values) is a strength relative to purely qualitative surveys.

major comments (1)
  1. [cross-benchmark consistency check] Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.
minor comments (2)
  1. [consistency analysis] The abstract states that full details on how rankings were derived from each benchmark's metrics are needed; the main text should include a table or subsection enumerating the exact metrics, normalization steps, and agent sets used for each benchmark in the concordance calculation.
  2. [taxonomy section] The six-axis taxonomy is introduced as an invented classification; a brief discussion of inter-rater reliability or sensitivity to axis redefinition would strengthen the claim that it exhaustively captures methodological differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the need for greater clarity on our consistency analysis. We agree that explicitly addressing the agent overlap is necessary to fully support our interpretation of the results and will revise the manuscript to include this information.

read point-by-point responses
  1. Referee: Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.

    Authors: We appreciate this observation. The consistency analysis was performed using a common overlapping set of agents evaluated across multiple benchmarks to enable direct comparison of safety rankings. We acknowledge that this detail was not stated explicitly in the original manuscript. In the revision we will add a dedicated paragraph in the consistency analysis section describing the agent selection criteria, the number of shared agents, the specific benchmarks involved, and a summary of the overlap sizes. This will confirm that the reported low concordance (W = 0.10, p = 0.94) reflects genuine divergence in safety conclusions for the same agents rather than an artifact of disjoint sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical survey and standard statistical analysis on external benchmarks.

full rationale

The paper catalogs 40 external benchmarks from 2023-2026, proposes a six-axis taxonomy derived from methodological inspection of those benchmarks, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency check using Kendall's W and 95% confidence intervals on observed rankings. These steps rely on external data sources and standard non-parametric statistics rather than any fitted parameters, self-defined quantities, or self-citation chains that reduce the central claims to the paper's own inputs by construction. The manual classification is an analytical coding step whose outputs are then tested against the same external corpus; it does not create a self-referential loop where a result is predicted from a quantity defined in terms of itself. No load-bearing uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the authors' manual assignment of each benchmark to the six taxonomy axes and on the assumption that the collected set of 40 benchmarks is sufficiently representative for the consistency conclusions to generalize.

axioms (1)
  • standard math Standard statistical assumptions underlying Kendall's W and 95% confidence interval calculations hold for the derived safety rankings.
    Invoked when reporting W = 0.10, p = 0.94.
invented entities (1)
  • Six-axis taxonomy of benchmark evaluation methodology no independent evidence
    purpose: To classify and compare agent safety benchmarks along dimensions that affect safety conclusions.
    New framework introduced by the authors; no independent evidence outside this paper.

pith-pipeline@v0.9.0 · 5802 in / 1442 out tokens · 36492 ms · 2026-05-21T01:38:16.199810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 22 internal anchors

  1. [1]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    M. Andriushchenko, A. Souly, M. Dziemian, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. In Proceedings of ICLR, 2025. arXiv:2410.09024

  2. [2]

    Arora, S

    N. Arora, S. Joel, I. Kavathekar, et al. Exposing weak links in multi-agent systems under adversarial prompting. arXiv preprint arXiv:2511.10949, 2025

  3. [3]

    Y. Bai, A. Jones, K. Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  4. [4]

    Black, A

    S. Black, A. Cooper Stickland, J. Pencharz, et al. Replibench: Evaluating the autonomous replication capabilities of language model agents. arXiv preprint arXiv:2504.18565, 2025

  5. [5]

    Bordes, C

    F. Bordes, C. Ross, J. T. Kao, E. Spiliopoulou, and A. Williams. Eval factsheets: A structured framework for documenting AI evaluations. arXiv preprint arXiv:2512.04062, 2025

  6. [6]

    Z. Chen, Z. Xiang, C. Xiao, et al. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. In NeurIPS, 2024. arXiv:2407.12784

  7. [7]

    Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

    A. Chhabra, S. Datta, S. K. Nahin, and P. Mohapatra. Agentic ai security: Threats, defenses, evaluation, and open challenges. arXiv preprint arXiv:2510.23883, 2025

  8. [8]

    Cui, W.-L

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. Or-bench: An over-refusal benchmark for large language models. In ICML, 2025. arXiv:2405.20947

  9. [9]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    E. Debenedetti, J. Zhang, M. Balunovi \'c , et al. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024. arXiv:2406.13352

  10. [10]

    S. Dong, S. Xu, P. He, et al. Minja: Memory injection attacks on llm agents via query-only interaction. In NeurIPS, 2025. arXiv:2503.03704

  11. [11]

    El Yagoubi, R

    F. El Yagoubi, R. Al Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems. arXiv preprint arXiv:2602.11510, 2026

  12. [12]

    WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

    I. Evtimov, A. Zharmagambetov, A. Grattafiori, et al. Wasp: Benchmarking web agent security against prompt injection attacks. arXiv preprint arXiv:2504.18575, 2025

  13. [13]

    Y. Feng, Y. Li, Y. Wu, et al. Backdooragent: A unified framework for backdoor attacks on llm-based agents. arXiv preprint arXiv:2601.04566, 2026

  14. [14]

    Y. Fu, X. Yuan, and D. Wang. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments. arXiv preprint arXiv:2506.15253, 2025

  15. [15]

    Alignment faking in large language models

    R. Greenblatt, C. Denison, B. Wright, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

  16. [16]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    K. Greshake, S. Abdelnabi, S. Mishra, et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023

  17. [17]

    Hadeliya, M

    T. Hadeliya, M. A. Jauhar, N. Sakpal, and D. Cruz. When refusals fail: Unstable safety mechanisms in long-context llm agents. In AAAI TrustAgent Workshop, 2026. arXiv:2512.02445

  18. [18]

    arXiv preprint arXiv:2502.14143 (2025).https: //doi.org/10.48550/arXiv.2502.14143

    L. Hammond, A. Chan, J. Clifton, et al. Multi-agent risks from advanced ai. arXiv preprint arXiv:2502.14143, 2025

  19. [19]

    Hopman, J

    M. Hopman, J. Elstner, M. Avramidou, et al. Evaluating and understanding scheming propensity in llm agents. arXiv preprint arXiv:2603.01608, 2026

  20. [20]

    W. Hua, X. Yang, M. Jin, et al. Trustagent: Towards safe and trustworthy llm-based agents. In EMNLP, 2024. arXiv:2402.01586

  21. [21]

    Risks from Learned Optimization in Advanced Machine Learning Systems

    E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

  22. [22]

    Jiang, Y

    T. Jiang, Y. Wang, J. Liang, and T. Wang. Agentlab: Benchmarking llm agents against long-horizon attacks. arXiv preprint arXiv:2602.16901, 2026

  23. [23]

    Juneja, J

    G. Juneja, J. N. S. Pasupulati, A. Albalak, et al. Magpie: A benchmark for multi-agent contextual privacy evaluation. arXiv preprint arXiv:2510.15186, 2025

  24. [24]

    Kavathekar, H

    I. Kavathekar, H. Jain, A. Rathod, et al. Tamas: Benchmarking adversarial risks in multi-agent llm systems. In MAS Workshop, ICML, 2025. arXiv:2511.05269

  25. [25]

    Kutasov, Y

    J. Kutasov, Y. Sun, P. Colognese, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents. arXiv preprint arXiv:2506.15740, 2025

  26. [26]

    J. Lee, D. Hahm, J. S. Choi, et al. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control. arXiv preprint arXiv:2410.17520, 2024

  27. [27]

    I. Levy, B. Wiesel, S. Marreed, et al. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents. In ICLR, 2026. arXiv:2410.06703

  28. [28]

    M. Q. Li, B. C. M. Fung, M. Weiss, et al. A benchmark for evaluating outcome-driven constraint violations in autonomous ai agents. arXiv preprint arXiv:2512.20798, 2025

  29. [29]

    Liang, S

    X. Liang, S. Niu, Z. Li, et al. Saferag: Benchmarking security in retrieval-augmented generation of large language model. In ACL, 2025. arXiv:2501.18636

  30. [30]

    A. Liu, Z. Ying, L. Wang, J. Mu, J. Guo, J. Wang, Y. Ma, S. Liang, M. Zhang, X. Liu, and D. Tao. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions. In MAS Workshop, ICML, 2025. arXiv:2506.14697

  31. [31]

    X. Lu, Z. Chen, X. Hu, et al. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks. arXiv preprint arXiv:2506.16402, 2025

  32. [32]

    H. Luo, S. Dai, C. Ni, et al. Agentauditor: Human-level safety and security evaluation for llm agents. In NeurIPS, 2025. arXiv:2506.00641

  33. [33]

    X. Ma, Y. Gao, Y. Wang, et al. Safety at scale: A comprehensive survey of large model and agent safety. arXiv preprint arXiv:2502.05206, 2025

  34. [34]

    Natural emergent misalignment from reward hacking in production rl.arXiv preprint arXiv:2511.18397, 2025

    M. MacDiarmid, B. Wright, J. Uesato, et al. Natural emergent misalignment from reward hacking in production RL . arXiv preprint arXiv:2511.18397, 2025

  35. [35]

    McGregor, V

    S. McGregor, V. Lu, V. Tashev, et al. Risk management for mitigating benchmark failure modes: BenchRisk . In NeurIPS, 2025. arXiv:2510.21460

  36. [36]

    Frontier Models are Capable of In-context Scheming

    A. Meinke, B. Schoen, J. Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024

  37. [37]

    S. Messick. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50 0 (9): 0 741--749, 1995

  38. [38]

    Evaluation and benchmarking of LLM agents: A survey,

    M. Mohammadi, Y. Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In KDD, 2025. arXiv:2507.21504

  39. [39]

    A. Naik, P. Quinn, G. Bosch, et al. Agentmisalignment: Measuring the propensity for misaligned behaviour in llm-based agents. arXiv preprint arXiv:2506.04018, 2025

  40. [40]

    Nakamura, A

    M. Nakamura, A. Kumar, S. Das, et al. Colosseum: Auditing collusion in cooperative multi-agent systems. arXiv preprint arXiv:2602.15198, 2026

  41. [41]

    N \"o ther, A

    J. N \"o ther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms. arXiv preprint arXiv:2508.16481, 2025

  42. [42]

    A. Pan, J. S. Chan, A. Zou, et al. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In ICML, 2023. arXiv:2304.03279

  43. [43]

    Q. Ren, Z. Zheng, J. Guo, et al. When ai agents collude online: Financial fraud risks by collaborative llm agents on social platforms. arXiv preprint arXiv:2511.06448, 2025

  44. [44]

    R. Ren, S. Basart, A. Khoja, et al. Safetywashing: Do ai safety benchmarks actually measure safety progress? In NeurIPS, 2024. arXiv:2407.21792

  45. [45]

    Reuel, A

    A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices. In NeurIPS, 2024. arXiv:2411.12990

  46. [46]

    Y. Ruan, H. Dong, A. Wang, et al. Identifying the risks of lm agents with an lm-emulated sandbox. In ICLR (Spotlight), 2024. arXiv:2309.15817

  47. [47]

    Schlatter, B

    J. Schlatter, B. Weinstein-Raun, and J. Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026. arXiv:2509.14260

  48. [48]

    U. M. Sehwag, S. Shabihi, A. McAvoy, et al. Propensitybench: Evaluating latent safety risks in large language models via an agentic approach. arXiv preprint arXiv:2511.20703, 2025

  49. [49]

    O. Tailor. Audit the whisper: Detecting steganographic collusion in multi-agent llms. arXiv preprint arXiv:2510.04303, 2025

  50. [50]

    Vijayvargiya, A

    S. Vijayvargiya, A. B. Soni, X. Zhou, et al. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety. In ICLR, 2026. arXiv:2507.06134

  51. [51]

    H. Wang, C. M. Poskitt, and J. Sun. Agentspec: Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of ICSE, 2026. arXiv:2503.18666

  52. [52]

    K. Wang, G. Zhang, Z. Zhou, et al. A comprehensive survey in llm(-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025

  53. [53]

    L. Wang, C. Ma, X. Feng, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024. arXiv:2308.11432

  54. [54]

    Z. Xi, W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023

  55. [55]

    H. Xia, H. Wang, Z. Liu, et al. Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms. In EMNLP Findings, 2025. arXiv:2509.07315

  56. [56]

    GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

    Z. Xiang, L. Zheng, Y. Li, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. In ICML, 2025. arXiv:2406.09187

  57. [57]

    Y. Xie, Y. Yuan, W. Wang, et al. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In EMNLP, 2025

  58. [58]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai, L. Eden, A. Li, et al. Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416, 2025

  59. [59]

    S. Yin, X. Pang, Y. Ding, et al. Safeagentbench: A benchmark for safe task planning of embodied llm agents. arXiv preprint arXiv:2412.13178, 2024

  60. [60]

    C. Yu, S. Engelmann, R. Cao, D. Ali, and O. Papakyriakopoulos. How should ai safety benchmarks benchmark safety? arXiv preprint arXiv:2601.23112, 2026

  61. [61]

    M. Yu, F. Meng, X. Zhou, et al. A survey on trustworthy llm agents: Threats and countermeasures. In KDD, 2025. arXiv:2503.09648

  62. [62]

    T. Yuan, Z. He, L. Dong, et al. R-judge: Benchmarking safety risk awareness for llm agents. In EMNLP Findings, 2024. arXiv:2401.10019

  63. [63]

    Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox

    Summery Yue. Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox. X (formerly Twitter), February 2026. https://x.com/summeryue0/status/2025774069124399363

  64. [64]

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In ACL Findings, 2024. arXiv:2403.02691

  65. [65]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    H. Zhang, J. Huang, K. Mei, et al. Agent security bench ( ASB ): Formalizing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025. arXiv:2410.02644

  66. [66]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Z. Zhang, S. Cui, Y. Lu, et al. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024 a

  67. [67]

    Zhang, Y

    Z. Zhang, Y. Zhang, L. Li, et al. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. In ACL, 2024 b . arXiv:2401.11880

  68. [68]

    K. Zhou, S. Jangam, A. Nagarajan, et al. Safepro: Evaluating the safety of professional-level ai agents. arXiv preprint arXiv:2601.06663, 2026

  69. [69]

    X. Zong, Z. Shen, L. Wang, et al. Mcp-safetybench: A benchmark for safety evaluation of large language models with real-world mcp servers. arXiv preprint arXiv:2512.15163, 2025

  70. [70]

    A. Zou, Z. Wang, N. Carlini, et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

  71. [71]

    W. Zou, R. Geng, B. Wang, and J. Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. arXiv:2402.07867