arxiv: 2604.19354 · v2 · submitted 2026-04-21 · 💻 cs.AI · cs.CR· cs.SE

Recognition: unknown

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges

Ali Al-Kaswan , Maksim Plotnikov , Maxim H\'ajek , Roland V\'izner , Arie van Deursen , Maliheh Izadi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.SE

keywords LLM agentsCapture the FlagCTF benchmarkcybersecuritypartial credit evaluationautonomous agentsoffensive securityDeepRed

0 comments

The pith

LLM agents achieve only 35 percent average checkpoint completion on realistic Capture the Flag challenges, with the largest gaps on non-standard discovery and longer-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepRed, an open-source benchmark that runs LLM agents inside isolated Kali Linux environments connected to target VMs, giving them terminal tools and optional web search while logging every action. It replaces binary solved-or-failed scoring with partial-credit checkpoints drawn from public writeups, labeled automatically by a summarise-then-judge pipeline. Across ten commercially available models and ten VM-based challenges, the strongest agent reaches just 35 percent average completion, performing best on familiar categories and worst when the task requires novel exploration or sustained adaptation. A reader should care because the result supplies the first granular, reproducible measurement of how far current agents stand from autonomous offensive cybersecurity work.

Core claim

DeepRed places agents in virtualized attacker environments with full terminal access and records complete execution traces; checkpoint lists derived from public writeups plus an automated summarise-then-judge pipeline then assign partial credit for each challenge. The resulting evaluation of ten models shows the best performer completing 35 percent of checkpoints on average, with markedly lower scores on challenges that demand non-standard discovery or extended planning sequences.

What carries the argument

Partial-credit scoring via challenge-specific checkpoints extracted from public writeups, together with a summarise-then-judge pipeline that labels completion from full execution logs.

If this is right

Agents perform reliably on common challenge categories but drop sharply on tasks that require non-standard discovery.
Longer-horizon adaptation remains the clearest performance bottleneck.
Full execution traces enable post-hoc analysis of exactly where agents stall or loop.
Current commercial models are not yet capable of reliable autonomous progress on realistic offensive tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partial-credit approach could be reused on other agent benchmarks to expose incremental progress instead of binary outcomes.
The observed discovery and planning gaps suggest that adding explicit exploration mechanisms or longer context memory would produce the largest gains.
If future agents close these gaps, the same benchmark format could serve as a safety test before deployment in real networks.

Load-bearing premise

Checkpoints taken from public writeups plus the automated summarise-then-judge process give an unbiased and sufficiently complete measure of meaningful progress toward solving each challenge.

What would settle it

Re-evaluating the same ten challenges with an expanded, independently verified checkpoint list or with human experts judging the same logs would show whether the 35 percent figure and the category gaps hold or shrink substantially.

Figures

Figures reproduced from arXiv: 2604.19354 by Ali Al-Kaswan, Arie van Deursen, Maksim Plotnikov, Maliheh Izadi, Maxim H\'ajek, Roland V\'izner.

read the original abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepRed gives a realistic VM-based testbed for LLM agents on CTF with partial-credit scoring, but the writeup-derived checkpoints and LLM judge pipeline need more validation before the 35% numbers can be trusted.

read the letter

The paper introduces DeepRed, an open benchmark that runs LLM agents inside isolated Kali VMs connected to real CTF targets, with full terminal logging and an automated way to score partial progress via checkpoints pulled from public writeups. That setup is more grounded than most prior binary CTF evaluations of agents, and releasing the code and environments is genuinely helpful for follow-up work. The main result—that the strongest model reaches only 35% average checkpoint completion, with better performance on common tasks and weaker results on non-standard discovery or longer sequences—gives a clearer picture of where current agents fall short than pass/fail metrics alone would allow. The automated summarise-then-judge pipeline for labeling traces is a practical addition that avoids manual review for every run. The soft spots sit mainly in the scoring method. Checkpoints taken from public writeups capture one documented path and can easily miss alternative intermediate states or creative branches an agent might legitimately explore, especially on challenges that reward non-standard approaches. The LLM summarizer and judge could also misread long logs or over- or under-count partial successes without clear error bars or human validation checks. The paper does not appear to report statistical tests or details on how the ten challenges and ten models were chosen, which leaves the absolute numbers harder to interpret. This work is for researchers building or evaluating LLM agents for security tasks and for anyone who needs a reproducible, VM-level testbed rather than toy environments. A reader focused on AI safety or automated vulnerability discovery would get concrete data and a starting point for extensions. It deserves a serious referee because the environment and partial-credit idea are new enough to warrant review, even if the current scoring pipeline requires tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeepRed, an open-source benchmark for LLM agents performing Capture the Flag challenges in isolated virtualized Kali environments with terminal and optional web-search tools. It defines partial-credit scoring via challenge-specific checkpoints extracted from public writeups, labels execution traces with an automated summarise-then-judge LLM pipeline, and reports results for ten commercially available LLMs across ten VM-based CTF challenges. The headline finding is that the strongest model reaches only 35% average checkpoint completion, performing better on common challenge categories and worse on tasks needing non-standard discovery or longer-horizon adaptation.

Significance. If the partial-credit methodology proves reliable, the work supplies a reproducible, open benchmark that moves CTF agent evaluation beyond binary solved/unsolved outcomes and supplies concrete evidence of current limitations in exploratory and adaptive cybersecurity tasks. The provision of full execution traces and the open-source release are clear strengths that enable follow-on research.

major comments (3)

[Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.
[Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.
[Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.

minor comments (2)

[Abstract] Abstract and §1: the selection criteria for the ten challenges and ten models should be stated explicitly to allow readers to judge potential selection bias.
[Figure captions] Figure captions and §5: clarify whether the reported checkpoint percentages are averaged across all checkpoints per challenge or weighted by difficulty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will make to improve the work.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.

Authors: We agree that checkpoints extracted from public writeups represent only documented successful paths and do not systematically capture alternative routes, failed branches, or non-standard actions. This is an inherent limitation of the approach and could affect the interpretation of the 35% average checkpoint completion and the category-wise performance differences. Our rationale for using writeups was that they provide expert-validated milestones that are reproducible and publicly accessible, serving as a practical proxy for partial progress. In the revised manuscript we will expand Section 3.2 to explicitly acknowledge this bias, discuss its potential impact on the reported results, and note that the benchmark is designed to be extensible so that additional checkpoints from multiple sources can be incorporated in future iterations. revision: partial
Referee: [Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.

Authors: We recognize that the absence of validation for the summarise-then-judge pipeline is a significant gap, given its central role in producing the quantitative results. The original submission omitted this validation primarily due to the substantial effort required to manually annotate lengthy terminal traces. For the revised version we will add a validation subsection in Section 4.2 that reports human-expert agreement rates on a sampled subset of traces, together with an error analysis focused on long logs and instances of partial success. This will provide direct evidence of the pipeline's reliability. revision: yes
Referee: [Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.

Authors: We accept that greater transparency and statistical context are needed. The ten challenges were chosen to span representative CTF categories drawn from public platforms, and the ten models were selected as a cross-section of commercially available LLMs at the time of the study. Each model-challenge pair was evaluated in a single run because of the high computational cost of full VM-based agent executions. Consequently, we cannot retroactively supply error bars from repeated trials without new experiments. In the revision we will add explicit selection criteria for both challenges and models, clarify the single-run design, and discuss the resulting limitations on statistical inference. We will also include any feasible measures of variability across categories. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This is a purely empirical benchmark paper that evaluates LLM agents on CTF challenges via recorded execution traces, checkpoints extracted from public writeups, and an automated summarise-then-judge pipeline. The central result (35% average checkpoint completion) is a direct aggregate measurement from the described evaluation process on ten models and ten challenges; no equations, fitted parameters, derivations, or self-citations are invoked to produce it. The study contains no load-bearing self-citations, uniqueness theorems, ansatzes, or renamings that reduce claims to inputs by construction. The evaluation pipeline is presented as a methodological choice whose validity can be assessed externally against the raw logs and writeups, satisfying the criteria for a self-contained, non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper; no mathematical axioms, free parameters, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5507 in / 1021 out tokens · 26312 ms · 2026-05-10T02:34:28.931800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khor- rami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, and Ofir Press. 2025. EnIGMA: Inter- active Tools Substantially Assist LM Agents in Finding Securi...

2025
[2]

Tanwir Ahmad, Matko Butkovic, and Dragos Truscan. 2025. Using reinforcement learning for security testing: A systematic mapping study. In2025 IEEE Inter- national Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 208–216

2025
[3]

Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi. 2025. Code red! on the harmfulness of applying off-the-shelf large language models to programming tasks.Proceedings of the ACM on Software Engineering2, FSE (2025), 2477–2499

2025
[4]

Thiem Nguyen Ba, Binh Nguyen Thanh, and Viet-Trung Tran. 2024. CoverNexus: Multi-agent LLM System for Automated Code Coverage Enhancement. InInter- national Symposium on Information and Communication Technology. Springer, 472–484

2024
[5]

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. 2018. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228(2018)

work page arXiv 2018
[6]

Theo Combe, Antony Martin, and Roberto Di Pietro. 2016. To Docker or not to Docker: A security perspective.IEEE Cloud Computing3, 5 (2016), 54–62

2016
[7]

Tiago Conceição and Nuno Cruz. 2025. Evaluation of the maturity of LLMs in the cybersecurity domain: T. Conceição, N. Cruz.International Journal of Information Security24, 5 (2025), 197

2025
[8]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864

2024
[9]

Ismayil Hasanov, Seppo Virtanen, Antti Hakkala, and Jouni Isoaho. 2024. Appli- cation of large language models in cybersecurity: A systematic literature review. IEEE access12 (2024), 176751–176778

2024
[10]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for soft- ware engineering: A systematic literature review.arXiv preprint arXiv:2308.10620 (2023)

work page arXiv 2023
[11]

Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024. Language models for code completion: A practical evaluation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[12]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review arXiv 2023
[13]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

2023
[14]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105

2025
[15]

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

2025
[16]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges AIware ’26, July 6–7, 2026, Montreal, QC, Canada Toolformer: Language models can teach themselves to us...

2023
[17]

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498

2024
[18]

Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, and Anna Muller
[19]

arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Gener- ated Code. arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889

work page arXiv
[20]

Alba Thaqi, Arbena Musa, and Blerim Rexha. 2024. Leveraging ai for ctf chal- lenge optimization. In2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES). IEEE, 1–5

2024
[21]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. arXiv:2402.01030 [cs.CL]

work page arXiv 2024
[22]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[23]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[24]

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al
[25]

InThe Thirteenth International Conference on Learning Representations

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations
[26]

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InInternational Conference on Machine Learning. PMLR, 79850– 79867. Received 2026-02-15; accepted 2026-03-28

2025