Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
Pith reviewed 2026-05-21 01:38 UTC · model grok-4.3
The pith
Different safety benchmarks for AI agents reach contradictory conclusions about model safety.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors catalog 40 agent-safety benchmarks and propose a six-axis taxonomy of evaluation methodology. Applying the taxonomy reveals broad risk coverage but limited methodological convergence, with benchmarks concentrated in sandboxed and constrained settings. The cross-benchmark consistency check with 95% confidence intervals and Kendall's W concordance analysis finds no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94), demonstrating that benchmark choice can yield contradictory safety conclusions.
What carries the argument
A six-axis taxonomy of benchmark evaluation methodology used to build a coverage matrix and perform cross-benchmark concordance analysis with Kendall's W.
If this is right
- Benchmark choice can yield contradictory safety conclusions.
- Coverage counts often overstate evaluation depth.
- Environment fidelity systematically shapes reported safety.
- The field disproportionately tests externally imposed rather than agent-internal risks.
- Metric fragmentation limits comparison and robustness remains effectively unbenchmarked.
Where Pith is reading between the lines
- Developers and regulators may need to test agents against multiple benchmarks rather than relying on any single one to reach stable safety judgments.
- Adopting the proposed minimum reporting standards could reduce future inconsistencies by making benchmark designs more comparable.
- The observed lack of concordance suggests value in creating benchmarks that directly probe agent-internal risk generation instead of only external threats.
Load-bearing premise
The manual classification of the 40 benchmarks into the six-axis taxonomy accurately captures the methodological differences that produce divergent safety conclusions.
What would settle it
Re-running the concordance analysis on the same or similar benchmarks but obtaining a Kendall's W value substantially higher than 0.10 with p below 0.05 would falsify the no-concordance result.
Figures
read the original abstract
The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript catalogs 40 behavioral agent-safety benchmarks (2023-2026) along with 5 adjacent artifacts, proposes a six-axis taxonomy of benchmark evaluation methodology, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency analysis using Kendall's W concordance and 95% confidence intervals on rankings. It concludes that there is no evidence of ranking concordance (W = 0.10, p = 0.94), that benchmark choice can produce contradictory safety conclusions, that coverage counts overstate depth, that environment fidelity shapes results, and that the field under-tests agent-internal risks and robustness. The authors release structured metadata, taxonomy codings, and artifacts, and propose minimum reporting standards.
Significance. If the consistency analysis is valid, the work is significant for documenting fragmentation in a rapidly growing subfield and for releasing reusable artifacts that enable future meta-analyses. The taxonomy and coverage matrix provide a concrete framework for comparing evaluation instruments, and the call for reporting standards addresses a practical gap. The statistical grounding (explicit W and p-values) is a strength relative to purely qualitative surveys.
major comments (1)
- [cross-benchmark consistency check] Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.
minor comments (2)
- [consistency analysis] The abstract states that full details on how rankings were derived from each benchmark's metrics are needed; the main text should include a table or subsection enumerating the exact metrics, normalization steps, and agent sets used for each benchmark in the concordance calculation.
- [taxonomy section] The six-axis taxonomy is introduced as an invented classification; a brief discussion of inter-rater reliability or sensitivity to axis redefinition would strengthen the claim that it exhaustively captures methodological differences.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for highlighting the need for greater clarity on our consistency analysis. We agree that explicitly addressing the agent overlap is necessary to fully support our interpretation of the results and will revise the manuscript to include this information.
read point-by-point responses
-
Referee: Consistency analysis (abstract and associated section): The central claim that 'benchmark choice can yield contradictory safety conclusions' is supported by the reported Kendall's W = 0.10 (p = 0.94) across evaluation dimensions. However, the manuscript does not explicitly state whether the compared rankings are derived from a common, overlapping set of evaluated agents or from disjoint agent sets. If the latter, low concordance is expected by construction and does not demonstrate conflicting verdicts on identical safety properties. Clarification or an explicit overlap analysis is required to secure the interpretation.
Authors: We appreciate this observation. The consistency analysis was performed using a common overlapping set of agents evaluated across multiple benchmarks to enable direct comparison of safety rankings. We acknowledge that this detail was not stated explicitly in the original manuscript. In the revision we will add a dedicated paragraph in the consistency analysis section describing the agent selection criteria, the number of shared agents, the specific benchmarks involved, and a summary of the overlap sizes. This will confirm that the reported low concordance (W = 0.10, p = 0.94) reflects genuine divergence in safety conclusions for the same agents rather than an artifact of disjoint sets. revision: yes
Circularity Check
No significant circularity; empirical survey and standard statistical analysis on external benchmarks.
full rationale
The paper catalogs 40 external benchmarks from 2023-2026, proposes a six-axis taxonomy derived from methodological inspection of those benchmarks, applies the taxonomy to produce a coverage matrix, and performs a cross-benchmark consistency check using Kendall's W and 95% confidence intervals on observed rankings. These steps rely on external data sources and standard non-parametric statistics rather than any fitted parameters, self-defined quantities, or self-citation chains that reduce the central claims to the paper's own inputs by construction. The manual classification is an analytical coding step whose outputs are then tested against the same external corpus; it does not create a self-referential loop where a result is predicted from a quantity defined in terms of itself. No load-bearing uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard statistical assumptions underlying Kendall's W and 95% confidence interval calculations hold for the derived safety rankings.
invented entities (1)
-
Six-axis taxonomy of benchmark evaluation methodology
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a six-axis taxonomy of benchmark evaluation methodology... and apply it across the corpus... finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M. Andriushchenko, A. Souly, M. Dziemian, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. In Proceedings of ICLR, 2025. arXiv:2410.09024
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Y. Bai, A. Jones, K. Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [4]
- [5]
- [6]
-
[7]
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A. Chhabra, S. Datta, S. K. Nahin, and P. Mohapatra. Agentic ai security: Threats, defenses, evaluation, and open challenges. arXiv preprint arXiv:2510.23883, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
-
[9]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
E. Debenedetti, J. Zhang, M. Balunovi \'c , et al. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024. arXiv:2406.13352
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [10]
-
[11]
F. El Yagoubi, R. Al Mallah, and G. Badu-Marfo. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems. arXiv preprint arXiv:2602.11510, 2026
-
[12]
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
I. Evtimov, A. Zharmagambetov, A. Grattafiori, et al. Wasp: Benchmarking web agent security against prompt injection attacks. arXiv preprint arXiv:2504.18575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [13]
- [14]
-
[15]
Alignment faking in large language models
R. Greenblatt, C. Denison, B. Wright, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
K. Greshake, S. Abdelnabi, S. Mishra, et al. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
T. Hadeliya, M. A. Jauhar, N. Sakpal, and D. Cruz. When refusals fail: Unstable safety mechanisms in long-context llm agents. In AAAI TrustAgent Workshop, 2026. arXiv:2512.02445
-
[18]
arXiv preprint arXiv:2502.14143 (2025).https: //doi.org/10.48550/arXiv.2502.14143
L. Hammond, A. Chan, J. Clifton, et al. Multi-agent risks from advanced ai. arXiv preprint arXiv:2502.14143, 2025
- [19]
- [20]
-
[21]
Risks from Learned Optimization in Advanced Machine Learning Systems
E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
- [22]
- [23]
-
[24]
I. Kavathekar, H. Jain, A. Rathod, et al. Tamas: Benchmarking adversarial risks in multi-agent llm systems. In MAS Workshop, ICML, 2025. arXiv:2511.05269
-
[25]
J. Kutasov, Y. Sun, P. Colognese, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents. arXiv preprint arXiv:2506.15740, 2025
- [26]
- [27]
-
[28]
M. Q. Li, B. C. M. Fung, M. Weiss, et al. A benchmark for evaluating outcome-driven constraint violations in autonomous ai agents. arXiv preprint arXiv:2512.20798, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [29]
- [30]
- [31]
- [32]
-
[33]
X. Ma, Y. Gao, Y. Wang, et al. Safety at scale: A comprehensive survey of large model and agent safety. arXiv preprint arXiv:2502.05206, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
M. MacDiarmid, B. Wright, J. Uesato, et al. Natural emergent misalignment from reward hacking in production RL . arXiv preprint arXiv:2511.18397, 2025
-
[35]
S. McGregor, V. Lu, V. Tashev, et al. Risk management for mitigating benchmark failure modes: BenchRisk . In NeurIPS, 2025. arXiv:2510.21460
-
[36]
Frontier Models are Capable of In-context Scheming
A. Meinke, B. Schoen, J. Scheurer, et al. Frontier models are capable of in-context scheming. arXiv preprint arXiv:2412.04984, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
S. Messick. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50 0 (9): 0 741--749, 1995
work page 1995
-
[38]
Evaluation and benchmarking of LLM agents: A survey,
M. Mohammadi, Y. Li, J. Lo, and W. Yip. Evaluation and benchmarking of llm agents: A survey. In KDD, 2025. arXiv:2507.21504
- [39]
-
[40]
M. Nakamura, A. Kumar, S. Das, et al. Colosseum: Auditing collusion in cooperative multi-agent systems. arXiv preprint arXiv:2602.15198, 2026
-
[41]
J. N \"o ther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms. arXiv preprint arXiv:2508.16481, 2025
- [42]
-
[43]
Q. Ren, Z. Zheng, J. Guo, et al. When ai agents collude online: Financial fraud risks by collaborative llm agents on social platforms. arXiv preprint arXiv:2511.06448, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [44]
- [45]
-
[46]
Y. Ruan, H. Dong, A. Wang, et al. Identifying the risks of lm agents with an lm-emulated sandbox. In ICLR (Spotlight), 2024. arXiv:2309.15817
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
J. Schlatter, B. Weinstein-Raun, and J. Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026. arXiv:2509.14260
- [48]
- [49]
-
[50]
S. Vijayvargiya, A. B. Soni, X. Zhou, et al. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety. In ICLR, 2026. arXiv:2507.06134
-
[51]
H. Wang, C. M. Poskitt, and J. Sun. Agentspec: Customizable runtime enforcement for safe and reliable LLM agents. In Proceedings of ICSE, 2026. arXiv:2503.18666
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [52]
-
[53]
L. Wang, C. Ma, X. Feng, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024. arXiv:2308.11432
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Z. Xi, W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [55]
-
[56]
GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
Z. Xiang, L. Zheng, Y. Li, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. In ICML, 2025. arXiv:2406.09187
work page internal anchor Pith review arXiv 2025
-
[57]
Y. Xie, Y. Yuan, W. Wang, et al. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In EMNLP, 2025
work page 2025
-
[58]
Survey on Evaluation of LLM-based Agents
A. Yehudai, L. Eden, A. Li, et al. Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [59]
- [60]
- [61]
- [62]
-
[63]
Summery Yue. Nothing humbles you like telling your OpenClaw ``confirm before acting'' and watching it speedrun deleting your inbox. X (formerly Twitter), February 2026. https://x.com/summeryue0/status/2025774069124399363
-
[64]
Q. Zhan, Z. Liang, Z. Ying, and D. Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In ACL Findings, 2024. arXiv:2403.02691
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
H. Zhang, J. Huang, K. Mei, et al. Agent security bench ( ASB ): Formalizing and benchmarking attacks and defenses in llm-based agents. In ICLR, 2025. arXiv:2410.02644
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Z. Zhang, S. Cui, Y. Lu, et al. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [67]
- [68]
- [69]
-
[70]
A. Zou, Z. Wang, N. Carlini, et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [71]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.