Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

Ahmad-Reza Sadeghi; Chris Hicks; Konrad Rieck; Sahar Abdelnabi

arxiv: 2605.22568 · v1 · pith:HVN2PGCTnew · submitted 2026-05-21 · 💻 cs.CR · cs.AI

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

Sahar Abdelnabi , Chris Hicks , Konrad Rieck , Ahmad-Reza Sadeghi This is my paper

Pith reviewed 2026-05-22 04:58 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentssecurity evaluationbenchmarkingbenchmark vulnerabilitiestemporal stalenessruntime uncertaintyevaluation frameworks

0 comments

The pith

Benchmarks for AI agents in security roles are undermined by vulnerabilities, staleness, and runtime uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing ways of testing AI agents for security tasks rest on shaky ground. It identifies three specific problems that distort results: flaws that let agents exploit the test itself, benchmarks that quickly become outdated, and unpredictable behavior during execution. If these problems are real, then reported performance numbers cannot be trusted to predict how agents will handle actual threats. Readers care because security decisions based on faulty tests could leave systems exposed or waste resources on ineffective defenses.

Core claim

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, the paper characterizes three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. It then outlines practical directions toward building more robust and trustworthy evaluation frameworks.

What carries the argument

Three core challenges—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—that together explain why current security evaluations of AI agents produce unreliable results.

If this is right

Security evaluations of AI agents can be made more reliable by designing benchmarks that close off the identified vulnerabilities.
Evaluation frameworks that account for temporal changes and runtime variability will produce results that better reflect real deployment conditions.
Practical improvements to benchmarks can reduce the risk of overestimating an agent's security capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three challenges may appear in benchmarks for AI agents outside security, such as in privacy or reliability testing.
Developers could create versioned benchmark suites that are refreshed on a fixed schedule to test the staleness hypothesis directly.
If runtime uncertainty dominates, then repeated runs with fixed seeds or controlled environments should narrow performance variance in future tests.

Load-bearing premise

The recent empirical evidence the paper cites is enough to show that these three challenges are the main reasons security evaluations of AI agents are flawed.

What would settle it

A controlled comparison where the same AI agents are tested on both standard benchmarks and newly designed ones that deliberately eliminate vulnerabilities, update frequently, and control runtime conditions, then measuring whether the performance rankings or scores change substantially.

read the original abstract

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper pulls together known issues in benchmarking AI agents for security but stops short of showing its three challenges are systematically the core ones.

read the letter

The main point is that benchmarks for AI agents in security-critical roles are undermined by benchmark vulnerabilities, temporal staleness, and runtime uncertainty. The authors draw on recent empirical studies to describe these problems and sketch some practical steps toward better evaluation frameworks. This is a useful synthesis that flags how tests can be gamed, lose relevance over time, or produce noisy results due to runtime factors. It gives concrete examples from the literature without claiming a new method or dataset, which keeps the contribution focused and grounded in what others have already observed. The discussion of practical directions is the part that adds the most value, as it tries to move from diagnosis to suggestions for more trustworthy setups. The softer spot is the framing of these three issues as the core challenges. The paper relies on cited empirical evidence rather than a broad survey or quantification of failure modes, so it is not obvious whether other problems, such as limited coverage of certain attacks or weak environment simulation, are less central. Without that systematic check, the emphasis on these particular three rests on the representativeness of the referenced work. This is the kind of paper that researchers working on AI security evaluations or agent testing will find relevant. It can help them avoid over-trusting current benchmarks when making deployment decisions. It deserves peer review because the topic is timely and the authors engage directly with real weaknesses in the evaluation pipeline, even if revisions could tighten the evidence for the centrality claim.

Referee Report

2 major / 2 minor

Summary. The manuscript synthesizes recent empirical evidence to argue that benchmarks for evaluating AI agents in security-critical roles are undermined by three core challenges—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—and outlines practical directions for constructing more robust evaluation frameworks.

Significance. If the three challenges are shown to be primary rather than illustrative, the work would be significant for the AI security community by providing a structured critique of current evaluation practices and actionable guidance toward trustworthy benchmarks, especially given the increasing deployment of agents in security contexts.

major comments (2)

[Abstract and §2] The designation of benchmark vulnerabilities, temporal staleness, and runtime uncertainty as the 'core' challenges (Abstract and §2) rests on cited empirical studies without a demonstrated systematic taxonomy or prevalence analysis; the manuscript does not quantify how these dominate over alternatives such as prompt-injection coverage gaps or environment-simulation fidelity, leaving the 'core' claim under-supported.
[§3] §3's outline of practical directions for robust frameworks lacks concrete evaluation criteria or falsifiable tests that would allow readers to assess whether proposed mitigations address the three challenges at the level of the original empirical failures.

minor comments (2)

[§2.2] Notation for 'runtime uncertainty' could be clarified with a short formal definition or example in §2.2 to distinguish it from related concepts like nondeterminism in agent execution.
[References] A small number of citations appear to predate the most recent agent-benchmarking literature; adding 2–3 post-2024 references would strengthen the synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §2] The designation of benchmark vulnerabilities, temporal staleness, and runtime uncertainty as the 'core' challenges (Abstract and §2) rests on cited empirical studies without a demonstrated systematic taxonomy or prevalence analysis; the manuscript does not quantify how these dominate over alternatives such as prompt-injection coverage gaps or environment-simulation fidelity, leaving the 'core' claim under-supported.

Authors: The manuscript synthesizes recent empirical evidence from the cited studies in §2 to characterize these three challenges as particularly salient in current security evaluations of AI agents. We did not perform or claim a systematic taxonomy or prevalence quantification, which would require a broader survey beyond the paper's scope. Prompt-injection gaps fall under benchmark vulnerabilities, while simulation fidelity issues relate to runtime uncertainty, as discussed. To address the concern about the 'core' designation, we will revise the abstract and §2 to describe them as 'three key challenges' supported by the reviewed literature, and add a brief paragraph on selection rationale without asserting dominance over all alternatives. revision: yes
Referee: [§3] §3's outline of practical directions for robust frameworks lacks concrete evaluation criteria or falsifiable tests that would allow readers to assess whether proposed mitigations address the three challenges at the level of the original empirical failures.

Authors: We agree that §3 would be strengthened by more concrete criteria. In the revision, we will expand each practical direction with specific evaluation criteria and falsifiable tests tied to the empirical failures in §2. For instance, for temporal staleness we will propose a decay metric comparing agent performance on time-stamped benchmark versions; similar testable metrics will be added for benchmark vulnerabilities and runtime uncertainty. revision: yes

Circularity Check

0 steps flagged

Synthesis of external empirical evidence with no load-bearing circular steps

full rationale

The paper frames its central contribution as characterizing three challenges (benchmark vulnerabilities, temporal staleness, runtime uncertainty) by building on recent empirical evidence from external studies. No equations, fitted parameters, or derivation chains exist that reduce outputs to inputs by construction. The abstract and outline present the work as a synthesis rather than a self-referential proof or prediction. Any self-citations, if present, are not load-bearing for the core claim per the provided context and do not invoke uniqueness theorems or ansatzes from prior author work. This is a normal low-circularity outcome for a position/survey-style paper relying on cited evidence; concerns about whether the evidence is exhaustive fall under correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.0 · 5568 in / 1002 out tokens · 25396 ms · 2026-05-22T04:58:58.840405+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

A. K. Zhang et al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models, ” inProc. Int. Conf. Learn. Representations (ICLR), 2025

work page 2025
[2]

CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale,

Z. Wang et al., “CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale, ” inProc. Int. Conf. Learn. Representations (ICLR), 2026

work page 2026
[3]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

G. Deng et al., “PentestGPT: Evaluating and harnessing large language models for automated penetration testing, ” inProc. 33rd USENIX Security Symp., 2024

work page 2024
[4]

AgentAuditor: Human-level safety and security evaluation for LLM agents,

H. Luo et al., “AgentAuditor: Human-level safety and security evaluation for LLM agents, ” inProc. Advances Neural Inf. Process. Syst. (NeurIPS), 2025

work page 2025
[5]

A detailed analysis of the KDD CUP 99 data set,

M. Tavallaee et al., “A detailed analysis of the KDD CUP 99 data set, ” in Proc. 2nd IEEE Symp. Comput. Intell. Security Defence Appl. (CISDA), 2009

work page 2009
[6]

FuzzBench: An open fuzzer benchmarking platform and service,

J. Metzman, L. Szekeres, L. M. R. Simon, R. T. Sprabery, and A. Arya, “FuzzBench: An open fuzzer benchmarking platform and service, ” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), 2021, pp. 1393–1403

work page 2021
[7]

LAVA: Large-scale automated vulnerability addition,

B. Dolan-Gavitt et al., “LAVA: Large-scale automated vulnerability addition, ” in Proc. IEEE Symp. Security Privacy (S&P), 2016, pp. 110–121

work page 2016
[8]

Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025

Y. Zhu et al., “Establishing best practices for building rigorous agentic benchmarks, ” arXiv:2507.02825, 2025

work page arXiv 2025
[9]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

J. Chen et al., “SecureAgentBench: Benchmarking secure code generation under realistic vulnerability scenarios, ” arXiv:2509.22097, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

How We Broke Top AI Agent Benchmarks: And What Comes Next

Hao Wang et al., “How We Broke Top AI Agent Benchmarks: And What Comes Next”, 2026. https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

work page 2026
[11]

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic, “Eval awareness in Claude Opus 4.6’s BrowseComp performance”, 2026. https://www.anthropic.com/engineering/eval-awareness- browsecomp

work page 2026
[12]

Beyond Rewards in Reinforcement Learning for Cyber Defence

Bates et al., “Beyond Rewards in Reinforcement Learning for Cyber Defence”, ICML, 2026

work page 2026
[13]

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data,

Zhang et al., “Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data, ” in Proc. IEEE SaTML, 2025

work page 2025
[14]

The Emerging Science of Machine Learning Benchmarks

Moritz Hardt, “The Emerging Science of Machine Learning Benchmarks”, Princeton University Press, 2026. 6

work page 2026

[1] [1]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

A. K. Zhang et al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models, ” inProc. Int. Conf. Learn. Representations (ICLR), 2025

work page 2025

[2] [2]

CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale,

Z. Wang et al., “CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale, ” inProc. Int. Conf. Learn. Representations (ICLR), 2026

work page 2026

[3] [3]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

G. Deng et al., “PentestGPT: Evaluating and harnessing large language models for automated penetration testing, ” inProc. 33rd USENIX Security Symp., 2024

work page 2024

[4] [4]

AgentAuditor: Human-level safety and security evaluation for LLM agents,

H. Luo et al., “AgentAuditor: Human-level safety and security evaluation for LLM agents, ” inProc. Advances Neural Inf. Process. Syst. (NeurIPS), 2025

work page 2025

[5] [5]

A detailed analysis of the KDD CUP 99 data set,

M. Tavallaee et al., “A detailed analysis of the KDD CUP 99 data set, ” in Proc. 2nd IEEE Symp. Comput. Intell. Security Defence Appl. (CISDA), 2009

work page 2009

[6] [6]

FuzzBench: An open fuzzer benchmarking platform and service,

J. Metzman, L. Szekeres, L. M. R. Simon, R. T. Sprabery, and A. Arya, “FuzzBench: An open fuzzer benchmarking platform and service, ” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng. (ESEC/FSE), 2021, pp. 1393–1403

work page 2021

[7] [7]

LAVA: Large-scale automated vulnerability addition,

B. Dolan-Gavitt et al., “LAVA: Large-scale automated vulnerability addition, ” in Proc. IEEE Symp. Security Privacy (S&P), 2016, pp. 110–121

work page 2016

[8] [8]

Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025

Y. Zhu et al., “Establishing best practices for building rigorous agentic benchmarks, ” arXiv:2507.02825, 2025

work page arXiv 2025

[9] [9]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

J. Chen et al., “SecureAgentBench: Benchmarking secure code generation under realistic vulnerability scenarios, ” arXiv:2509.22097, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

How We Broke Top AI Agent Benchmarks: And What Comes Next

Hao Wang et al., “How We Broke Top AI Agent Benchmarks: And What Comes Next”, 2026. https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

work page 2026

[11] [11]

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic, “Eval awareness in Claude Opus 4.6’s BrowseComp performance”, 2026. https://www.anthropic.com/engineering/eval-awareness- browsecomp

work page 2026

[12] [12]

Beyond Rewards in Reinforcement Learning for Cyber Defence

Bates et al., “Beyond Rewards in Reinforcement Learning for Cyber Defence”, ICML, 2026

work page 2026

[13] [13]

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data,

Zhang et al., “Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data, ” in Proc. IEEE SaTML, 2025

work page 2025

[14] [14]

The Emerging Science of Machine Learning Benchmarks

Moritz Hardt, “The Emerging Science of Machine Learning Benchmarks”, Princeton University Press, 2026. 6

work page 2026