pith. machine review for the scientific record. sign in

arxiv: 2604.19354 · v2 · submitted 2026-04-21 · 💻 cs.AI · cs.CR· cs.SE

Recognition: unknown

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.SE
keywords LLM agentsCapture the FlagCTF benchmarkcybersecuritypartial credit evaluationautonomous agentsoffensive securityDeepRed
0
0 comments X

The pith

LLM agents achieve only 35 percent average checkpoint completion on realistic Capture the Flag challenges, with the largest gaps on non-standard discovery and longer-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepRed, an open-source benchmark that runs LLM agents inside isolated Kali Linux environments connected to target VMs, giving them terminal tools and optional web search while logging every action. It replaces binary solved-or-failed scoring with partial-credit checkpoints drawn from public writeups, labeled automatically by a summarise-then-judge pipeline. Across ten commercially available models and ten VM-based challenges, the strongest agent reaches just 35 percent average completion, performing best on familiar categories and worst when the task requires novel exploration or sustained adaptation. A reader should care because the result supplies the first granular, reproducible measurement of how far current agents stand from autonomous offensive cybersecurity work.

Core claim

DeepRed places agents in virtualized attacker environments with full terminal access and records complete execution traces; checkpoint lists derived from public writeups plus an automated summarise-then-judge pipeline then assign partial credit for each challenge. The resulting evaluation of ten models shows the best performer completing 35 percent of checkpoints on average, with markedly lower scores on challenges that demand non-standard discovery or extended planning sequences.

What carries the argument

Partial-credit scoring via challenge-specific checkpoints extracted from public writeups, together with a summarise-then-judge pipeline that labels completion from full execution logs.

If this is right

  • Agents perform reliably on common challenge categories but drop sharply on tasks that require non-standard discovery.
  • Longer-horizon adaptation remains the clearest performance bottleneck.
  • Full execution traces enable post-hoc analysis of exactly where agents stall or loop.
  • Current commercial models are not yet capable of reliable autonomous progress on realistic offensive tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partial-credit approach could be reused on other agent benchmarks to expose incremental progress instead of binary outcomes.
  • The observed discovery and planning gaps suggest that adding explicit exploration mechanisms or longer context memory would produce the largest gains.
  • If future agents close these gaps, the same benchmark format could serve as a safety test before deployment in real networks.

Load-bearing premise

Checkpoints taken from public writeups plus the automated summarise-then-judge process give an unbiased and sufficiently complete measure of meaningful progress toward solving each challenge.

What would settle it

Re-evaluating the same ten challenges with an expanded, independently verified checkpoint list or with human experts judging the same logs would show whether the 35 percent figure and the category gaps hold or shrink substantially.

Figures

Figures reproduced from arXiv: 2604.19354 by Ali Al-Kaswan, Arie van Deursen, Maksim Plotnikov, Maliheh Izadi, Maxim H\'ajek, Roland V\'izner.

Figure 1
Figure 1. Figure 1: Overview of the three core components of DeepRed. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeepRed, an open-source benchmark for LLM agents performing Capture the Flag challenges in isolated virtualized Kali environments with terminal and optional web-search tools. It defines partial-credit scoring via challenge-specific checkpoints extracted from public writeups, labels execution traces with an automated summarise-then-judge LLM pipeline, and reports results for ten commercially available LLMs across ten VM-based CTF challenges. The headline finding is that the strongest model reaches only 35% average checkpoint completion, performing better on common challenge categories and worse on tasks needing non-standard discovery or longer-horizon adaptation.

Significance. If the partial-credit methodology proves reliable, the work supplies a reproducible, open benchmark that moves CTF agent evaluation beyond binary solved/unsolved outcomes and supplies concrete evidence of current limitations in exploratory and adaptive cybersecurity tasks. The provision of full execution traces and the open-source release are clear strengths that enable follow-on research.

major comments (3)
  1. [Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.
  2. [Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.
  3. [Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.
minor comments (2)
  1. [Abstract] Abstract and §1: the selection criteria for the ten challenges and ten models should be stated explicitly to allow readers to judge potential selection bias.
  2. [Figure captions] Figure captions and §5: clarify whether the reported checkpoint percentages are averaged across all checkpoints per challenge or weighted by difficulty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each of the three major comments point by point below, indicating the revisions we will make to improve the work.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Checkpoint Definition): the claim that checkpoints derived from public writeups constitute a complete and unbiased proxy for meaningful partial progress is not supported by evidence that alternative successful paths, failed branches, or non-standard actions are systematically captured; public writeups typically document only one route, which directly affects the validity of both the 35% aggregate score and the reported performance gap between common and non-standard challenges.

    Authors: We agree that checkpoints extracted from public writeups represent only documented successful paths and do not systematically capture alternative routes, failed branches, or non-standard actions. This is an inherent limitation of the approach and could affect the interpretation of the 35% average checkpoint completion and the category-wise performance differences. Our rationale for using writeups was that they provide expert-validated milestones that are reproducible and publicly accessible, serving as a practical proxy for partial progress. In the revised manuscript we will expand Section 3.2 to explicitly acknowledge this bias, discuss its potential impact on the reported results, and note that the benchmark is designed to be extensible so that additional checkpoints from multiple sources can be incorporated in future iterations. revision: partial

  2. Referee: [Section 4.2] Section 4.2 (Summarise-then-Judge Pipeline): no validation is reported for the LLM summarizer and judge (e.g., human agreement rates, error analysis on long terminal logs, or cases of partial success mislabeled as failure), which is load-bearing for the central empirical claims given that checkpoint completion is the sole quantitative outcome measure.

    Authors: We recognize that the absence of validation for the summarise-then-judge pipeline is a significant gap, given its central role in producing the quantitative results. The original submission omitted this validation primarily due to the substantial effort required to manually annotate lengthy terminal traces. For the revised version we will add a validation subsection in Section 4.2 that reports human-expert agreement rates on a sampled subset of traces, together with an error analysis focused on long logs and instances of partial success. This will provide direct evidence of the pipeline's reliability. revision: yes

  3. Referee: [Results] Results section (Table 2 or equivalent): the 35% figure and category-wise comparisons lack error bars, statistical tests, or details on how the ten challenges and ten models were selected, making it impossible to assess whether the observed limitations are robust or artifacts of the particular sample.

    Authors: We accept that greater transparency and statistical context are needed. The ten challenges were chosen to span representative CTF categories drawn from public platforms, and the ten models were selected as a cross-section of commercially available LLMs at the time of the study. Each model-challenge pair was evaluated in a single run because of the high computational cost of full VM-based agent executions. Consequently, we cannot retroactively supply error bars from repeated trials without new experiments. In the revision we will add explicit selection criteria for both challenges and models, clarify the single-run design, and discuss the resulting limitations on statistical inference. We will also include any feasible measures of variability across categories. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

This is a purely empirical benchmark paper that evaluates LLM agents on CTF challenges via recorded execution traces, checkpoints extracted from public writeups, and an automated summarise-then-judge pipeline. The central result (35% average checkpoint completion) is a direct aggregate measurement from the described evaluation process on ten models and ten challenges; no equations, fitted parameters, derivations, or self-citations are invoked to produce it. The study contains no load-bearing self-citations, uniqueness theorems, ansatzes, or renamings that reduce claims to inputs by construction. The evaluation pipeline is presented as a methodological choice whose validity can be assessed externally against the raw logs and writeups, satisfying the criteria for a self-contained, non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper; no mathematical axioms, free parameters, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5507 in / 1021 out tokens · 26312 ms · 2026-05-10T02:34:28.931800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khor- rami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, and Ofir Press. 2025. EnIGMA: Inter- active Tools Substantially Assist LM Agents in Finding Securi...

  2. [2]

    Tanwir Ahmad, Matko Butkovic, and Dragos Truscan. 2025. Using reinforcement learning for security testing: A systematic mapping study. In2025 IEEE Inter- national Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 208–216

  3. [3]

    Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, and Maliheh Izadi. 2025. Code red! on the harmfulness of applying off-the-shelf large language models to programming tasks.Proceedings of the ACM on Software Engineering2, FSE (2025), 2477–2499

  4. [4]

    Thiem Nguyen Ba, Binh Nguyen Thanh, and Viet-Trung Tran. 2024. CoverNexus: Multi-agent LLM System for Automated Code Coverage Enhancement. InInter- national Symposium on Information and Communication Technology. Springer, 472–484

  5. [5]

    Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. 2018. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228(2018)

  6. [6]

    Theo Combe, Antony Martin, and Roberto Di Pietro. 2016. To Docker or not to Docker: A security perspective.IEEE Cloud Computing3, 5 (2016), 54–62

  7. [7]

    Tiago Conceição and Nuno Cruz. 2025. Evaluation of the maturity of LLMs in the cybersecurity domain: T. Conceição, N. Cruz.International Journal of Information Security24, 5 (2025), 197

  8. [8]

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864

  9. [9]

    Ismayil Hasanov, Seppo Virtanen, Antti Hakkala, and Jouni Isoaho. 2024. Appli- cation of large language models in cybersecurity: A systematic literature review. IEEE access12 (2024), 176751–176778

  10. [10]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for soft- ware engineering: A systematic literature review.arXiv preprint arXiv:2308.10620 (2023)

  11. [11]

    Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024. Language models for code completion: A practical evaluation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  12. [12]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  13. [13]

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

  14. [14]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105

  15. [15]

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

  16. [16]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges AIware ’26, July 6–7, 2026, Montreal, QC, Canada Toolformer: Language models can teach themselves to us...

  17. [17]

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498

  18. [18]

    Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, and Anna Muller

  19. [19]

    arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889

    Generate and Pray: Using SALLMS to Evaluate the Security of LLM Gener- ated Code. arXiv:2311.00889 [cs.SE] https://arxiv.org/abs/2311.00889

  20. [20]

    Alba Thaqi, Arbena Musa, and Blerim Rexha. 2024. Leveraging ai for ctf chal- lenge optimization. In2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES). IEEE, 1–5

  21. [21]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. arXiv:2402.01030 [cs.CL]

  22. [22]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  23. [23]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  24. [24]

    Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al

  25. [25]

    InThe Thirteenth International Conference on Learning Representations

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations

  26. [26]

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InInternational Conference on Machine Learning. PMLR, 79850– 79867. Received 2026-02-15; accepted 2026-03-28