CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

Farshad Khorrami; Haoran Xi; Kimberly Milner; Meet Udeshi; Minghao Shao; Muhammad Shafique; Nanda Rani; Prashanth Krishnamurthy; Ramesh Karri; Saksham Aggarwal

arxiv: 2602.08023 · v3 · pith:4A2D2RMDnew · submitted 2026-02-08 · 💻 cs.CR · cs.AI· cs.MA

CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

Nanda Rani , Kimberly Milner , Minghao Shao , Meet Udeshi , Haoran Xi , Venkata Sai Charan Putrevu , Saksham Aggarwal , Sandeep K. Shukla

show 4 more authors

Prashanth Krishnamurthy Farshad Khorrami Muhammad Shafique Ramesh Karri

This is my paper

Pith reviewed 2026-05-21 14:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA

keywords LLM offensive agentsCTF benchmarkingmulti-target evaluationweb vulnerabilitiesstrategic reasoningautonomous discoveryoffensive securityagent evaluation

0 comments

The pith

CTFExplorer introduces a multi-target benchmark where LLM agents must discover, prioritize, and chain attacks across 40 unknown web services.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for LLM offensive security agents use isolated single-target setups with known vulnerabilities and measure only exploitation success. This misses how real CTF participants must triage unknown surfaces, allocate effort under uncertainty, and chain multiple attacks. The paper introduces CTFExplorer as a benchmark suite deploying 40 web-based vulnerable services in one environment, where agents explore and exploit without predefined guidance. It adds a reactive multi-agent reference framework and an agent-agnostic evaluation system that records structured reasoning traces to assess behaviors like target selection and handling failed hypotheses. A sympathetic reader would care because this shifts evaluation from narrow exploitation metrics toward the strategic reasoning required in actual offensive security work.

Core claim

The central claim is that shifting offensive security evaluation to a multi-target setting tests how agents explore unknown surfaces, prioritize targets, and chain attacks, using 40 autonomously discoverable web services together with a reactive multi-agent reference setup and an evaluation framework that records reasoning traces for assessment beyond binary flag capture.

What carries the argument

A single simulated environment containing 40 web-based vulnerable services that agents must autonomously discover and distinguish without guidance, paired with a reactive multi-agent reference framework and an agent-agnostic evaluation system that logs structured reasoning traces.

If this is right

Agents can now be assessed on target selection and effort allocation rather than exploitation alone.
Evaluation includes how agents manage failed hypotheses and coordinate across multiple stages.
Structured reasoning traces enable measurement of security intelligence extraction as a distinct capability.
Different LLM agents can be compared on integrated strategic behavior instead of isolated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could highlight gaps in current LLM agents for handling discovery and triage that single-target tests conceal.
It might be extended to non-web CTF formats or integrated with live network environments for further realism.
Reasoning trace data could support targeted fine-tuning of agents for better multi-step planning under uncertainty.

Load-bearing premise

A simulated environment with 40 autonomously discoverable web services and a reactive multi-agent reference framework sufficiently captures real CTF participant triage and effort allocation under uncertainty.

What would settle it

Comparing agent prioritization patterns and success rates in the CTFExplorer environment against the same agents' performance in live human CTF competitions or real-world web penetration tests would show whether the benchmark captures relevant strategic behavior.

read the original abstract

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CTFExplorer provides a multi-target benchmark to evaluate strategic reasoning in LLM offensive agents beyond single-target exploitation.

read the letter

This paper's main takeaway is a benchmark called CTFExplorer that moves LLM offensive agent evaluation from single known targets to a multi-target web CTF with 40 autonomously discoverable services. The goal is to test strategic elements like exploration, prioritization, and chaining that current setups ignore. What is new here is the combination of an environment where agents must find targets without help, plus an evaluation method that logs structured reasoning traces for finer analysis of behavior. They also offer a reactive multi-agent reference as a baseline. This setup directly tackles the limitation that single-target tests only measure exploitation and not the triage and effort allocation under uncertainty that real CTF work involves. The paper does well at laying out why this matters and describing a high-level architecture that could support it. The motivation is clear and grounded in how actual security competitions work. The soft spots are around the lack of concrete results or validation details. We need to see how the 40 services were selected, what metrics define success beyond flag capture, and whether experiments show that agents behave differently in this multi-target setting. Without that, it's hard to know if the benchmark delivers on its promise or if the reference framework introduces its own artifacts. This is for researchers in AI security and LLM agent development who want more realistic testing environments. A reader working on offensive AI tools would find the ideas useful for thinking about evaluation design. It deserves serious referee time because the gap it identifies is real and the proposed direction is a logical next step, even if the current version would benefit from more empirical backing.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing single-target benchmarks for LLM-based offensive security agents only measure exploitation and fail to assess strategic reasoning such as exploration, prioritization, and attack chaining under uncertainty. It introduces CTFExplorer, a multi-target benchmark deploying 40 autonomously discoverable web vulnerable services, along with a reactive multi-agent reference framework and an agent-agnostic evaluation system that records structured reasoning traces for behavioral assessment beyond binary flag capture.

Significance. If the benchmark and evaluation framework hold up under testing, the work could meaningfully advance evaluation of LLM offensive agents by shifting focus from isolated exploitation to realistic multi-target triage and effort allocation, with the structured traces offering a useful tool for fine-grained behavioral analysis.

major comments (2)

Abstract: the manuscript describes the benchmark design and goals but provides no empirical results, validation data, or details on how the 40 services were chosen or how success is measured, making it impossible to verify whether the multi-target setup actually surfaces the claimed strategic behaviors.
Benchmark setup description: the assumption that a simulated environment with 40 autonomously discoverable services sufficiently captures real CTF participant triage and effort allocation under uncertainty is load-bearing for the central claim yet lacks concrete justification, preliminary experiments, or comparison to single-target baselines.

minor comments (1)

Evaluation framework section: provide more detail on the exact structure of the reasoning traces recorded and how they enable assessment of target selection and hypothesis handling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each major comment point by point below, noting revisions that will strengthen the presentation of CTFExplorer.

read point-by-point responses

Referee: Abstract: the manuscript describes the benchmark design and goals but provides no empirical results, validation data, or details on how the 40 services were chosen or how success is measured, making it impossible to verify whether the multi-target setup actually surfaces the claimed strategic behaviors.

Authors: We agree that the abstract would benefit from a concise summary of empirical elements to better frame the contribution. The full manuscript details the selection of the 40 services (curated for diversity across common web vulnerability classes such as injection, broken authentication, and misconfiguration, drawn from public CTF repositories and aligned with OWASP categories) and defines success via a combination of flag capture and structured reasoning trace metrics (e.g., target prioritization sequences, hypothesis revision counts, and cross-target chaining). Preliminary agent evaluations demonstrating these behaviors are reported in the experiments section. We will revise the abstract to include a brief statement of these results and validation approach. revision: yes
Referee: Benchmark setup description: the assumption that a simulated environment with 40 autonomously discoverable services sufficiently captures real CTF participant triage and effort allocation under uncertainty is load-bearing for the central claim yet lacks concrete justification, preliminary experiments, or comparison to single-target baselines.

Authors: This is a fair observation on the need for stronger grounding. The manuscript motivates the multi-target design by reference to real CTF dynamics (unknown surface discovery, resource allocation across targets), but we will expand the benchmark setup section with additional justification, including a table mapping service types to real-world prevalence and a new subsection presenting preliminary experiments. These will include direct comparisons of agent triage efficiency, effort allocation, and strategic trace quality between the multi-target CTFExplorer environment and equivalent single-target baselines to quantify the added assessment value. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark design

full rationale

The paper introduces CTFExplorer as a benchmark suite for evaluating LLM-based offensive agents in a multi-target web CTF environment with 40 autonomously discoverable services. It motivates the shift from single-target setups, describes the environment architecture, a reactive multi-agent reference framework, and an agent-agnostic evaluation method that records reasoning traces. No mathematical derivations, equations, fitted parameters, predictions, or self-citations appear as load-bearing steps in the provided abstract and setup. The central contribution is the direct presentation of the benchmark design and evaluation framework, which stands independently without reducing to prior results or self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's contribution rests on introducing a new benchmark and reference framework rather than deriving results from prior equations or data fits.

axioms (1)

domain assumption LLM-based agents can be evaluated for offensive security capabilities in simulated multi-service environments
Core premise enabling the benchmark's purpose as stated in the abstract.

invented entities (1)

CTFExplorer benchmark suite no independent evidence
purpose: To test exploration, prioritization, and attack chaining in multi-target settings
Newly introduced construct with no independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5763 in / 1201 out tokens · 76775 ms · 2026-05-21T14:27:32.704189+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CrackMeBench: Binary Reverse Engineering for Agents
cs.SE 2026-05 accept novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.