pith. sign in

arxiv: 2605.21497 · v1 · pith:VOT4SMH3new · submitted 2026-04-29 · 💻 cs.CR · cs.AI

Autonomous LLM Agents & CTFs: A Second Look

Pith reviewed 2026-05-22 01:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsCapture the FlagCTF challengesoffensive securityagent architecturesweb vulnerabilitiessecurity automation
0
0 comments X

The pith

A general-purpose LLM agent matches the success rate of custom-engineered architectures on 30 web CTF challenges by solving 19 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper takes a second look at reports that LLM agents can automate offensive security tasks at near human levels. It builds several agent architectures with rising complexity and modularity, tests them with multiple language model backbones on 30 web-based CTF challenges that span 14 vulnerability classes, and compares the outcomes to claude-code, a general-purpose agent that sets its own internal structure. The results show the general agent performs about as well as the custom versions, that all agents fail on the same hard categories, and that adding structured roles improves consistency while lowering costs. Readers focused on security tools would care because the work indicates that simpler, off-the-shelf agents can serve as effective starting points for automation.

Core claim

The paper shows that claude-code achieves performance comparable to the engineered architectures, solving 19 out of 30 tasks. Both the custom architectures and claude-code encounter the same difficulties in specific challenge categories, which points to barriers that keep current agents below human-level capability. By using the manually designed architectures, the authors measure the effect of added components and find that structured orchestration of specialized roles outperforms monolithic designs, which improves run-to-run consistency and reduces execution costs.

What carries the argument

claude-code, the general-purpose agent that automatically determines its internal architecture, serving as a baseline against custom modular designs of increasing complexity

If this is right

  • General-purpose agents act as strong baselines for offensive security tasks without requiring heavy custom engineering.
  • Certain vulnerability classes create persistent barriers that limit all current agents below human performance.
  • Structured orchestration of specialized roles produces higher consistency and lower costs than monolithic agent designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could prioritize base model improvements over added architectural complexity for these security tasks.
  • Applying the agents to real-world security operations rather than isolated CTFs would test practical readiness.
  • Teams might begin with a general agent and layer in modules only for the categories that show consistency problems.

Load-bearing premise

The 30 selected web-based CTF challenges across 14 vulnerability classes are representative enough to support general conclusions about agent capabilities and barriers.

What would settle it

Testing the same set of agents on a new collection of 30 CTF challenges from additional vulnerability classes or non-web settings and checking whether the 19 out of 30 success rate and shared failure patterns remain.

Figures

Figures reproduced from arXiv: 2605.21497 by Dario Rossi, Idilio Drago, Marco Mellia, Matteo Boffa, Thanh Minh Bui, Youness Bouchari.

Figure 1
Figure 1. Figure 1: Tested agent architectures. We progress from a single-agent Executor (a) to a structured multi-agent configuration (c) consisting of a Recon Node, Planner, and Evaluator. We use claude code (not shown) as a baseline. All systems are granted access to a vulnerable (Dockerized) service and terminate by outputting the flag, if successfully captured. state of the environment [21]. An LLM can function as the de… view at source ↗
Figure 2
Figure 2. Figure 2: Tool calls (steps) vs runs status and architectures. For failed runs, simpler architectures often hits the maximum number of steps, whereas structured planning leads to earlier and more deliberate termination. analysis, we examine whether the Planner correctly identifies the target vulnerability. It succeeds in 23 out of 30 benchmarks. The remaining 7 challenges correspond to vulnerabilities that neither o… view at source ↗
read the original abstract

Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human-level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run-to-run consistency, and reducing execution costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates engineered LLM agent architectures of increasing complexity and modularity against the general-purpose claude-code agent on 30 web-based CTF challenges spanning 14 vulnerability classes. It reports that claude-code solves 19/30 tasks with performance comparable to the engineered designs, identifies shared struggles across challenge categories as persistent barriers below human level, and finds that structured orchestration of specialized roles improves run-to-run consistency while reducing execution costs.

Significance. If the challenge selection is representative, the work establishes general-purpose agents as strong baselines for offensive security tasks and provides actionable evidence on the value of modular designs. The multi-backbone comparison and direct performance counts add empirical weight to claims about agent limitations in cybersecurity.

major comments (2)
  1. [Abstract and evaluation setup] Abstract and evaluation setup: the central claims that claude-code's 19/30 performance shows general-purpose agents are strong baselines and that persistent barriers are revealed rest on the 30 challenges being representative across 14 classes, yet no selection criteria, difficulty calibration against human solvers, or coverage statistics are provided; this risks selection effects favoring easier web vulnerabilities such as SQLi and XSS.
  2. [Results section] Results section: the assertion of improved run-to-run consistency from structured orchestration lacks reported exact run counts per configuration or statistical tests supporting the consistency and cost-reduction claims, which are load-bearing for the third main finding.
minor comments (2)
  1. [Architecture descriptions] Notation for agent components could be standardized across sections to improve readability of the architecture comparisons.
  2. [Results] A table summarizing solved tasks per vulnerability class would help readers assess the distribution of successes and failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and evaluation setup] Abstract and evaluation setup: the central claims that claude-code's 19/30 performance shows general-purpose agents are strong baselines and that persistent barriers are revealed rest on the 30 challenges being representative across 14 classes, yet no selection criteria, difficulty calibration against human solvers, or coverage statistics are provided; this risks selection effects favoring easier web vulnerabilities such as SQLi and XSS.

    Authors: We acknowledge that the abstract and evaluation setup do not provide explicit selection criteria, difficulty calibration details, or coverage statistics. The 30 challenges were selected from public CTF platforms specifically to span 14 distinct web vulnerability classes, with the intent of covering a representative sample of common offensive security tasks. To address the concern about potential selection effects, we will revise the evaluation setup section to include a clear description of the challenge sources, the rationale for class coverage, and any available information on typical difficulty levels from CTF leaderboards. While comprehensive human solve-rate calibration data is not uniformly available across all challenges, we can add references to public benchmarks where they exist. These additions will clarify the representativeness of the set without altering the core findings. revision: yes

  2. Referee: [Results section] Results section: the assertion of improved run-to-run consistency from structured orchestration lacks reported exact run counts per configuration or statistical tests supporting the consistency and cost-reduction claims, which are load-bearing for the third main finding.

    Authors: We agree that the results section would be strengthened by reporting the exact number of runs performed and supporting statistical information. Our experiments involved multiple independent executions per agent configuration to observe consistency and cost differences, but these were summarized at a high level rather than presented with full counts or tests. We will revise the results section to specify the run counts (for example, the number of trials conducted for each architecture and backbone), report variance or standard deviation in success rates across runs to quantify consistency improvements, and include comparative cost metrics such as average token usage or execution time. This will provide the quantitative backing needed for the third main finding. revision: yes

Circularity Check

0 steps flagged

Empirical performance counts on external CTF benchmarks show no circular derivation

full rationale

The paper reports direct experimental results from running engineered and general-purpose LLM agents on a fixed set of 30 web-based CTF challenges, yielding counts such as 19/30 solved tasks and category-specific struggles. These outcomes are measured against external benchmarks rather than derived from fitted parameters, self-referential equations, or load-bearing self-citations. Architectures are manually specified and compared without any reduction of claims to prior author work or ansatz smuggling. The evaluation chain is therefore self-contained against observable performance data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen CTF set adequately samples real-world web vulnerabilities and that agent success rates on these puzzles generalize to broader offensive security utility.

axioms (1)
  • domain assumption The 30 web-based CTF challenges spanning 14 vulnerability classes form a representative testbed for evaluating LLM agent capabilities in offensive security.
    Invoked when generalizing from the observed 19/30 success rate and shared failure categories to statements about persistent barriers below human-level performance.

pith-pipeline@v0.9.0 · 5727 in / 1180 out tokens · 35953 ms · 2026-05-22T01:02:09.212856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    About penetration testing,

    M. Bishop, “About penetration testing,”IEEE Security & Privacy, vol. 5, no. 6, pp. 84–87, 2007

  2. [2]

    Technical Guide to Information Security Testing and Assessment,

    Scarfone, Karen, Souppaya, Murugiah, and Cody, Amanda, “Technical Guide to Information Security Testing and Assessment,” National Insti- tute of Standards and Technology, Tech. Rep. NIST Special Publication 800-115, 2008

  3. [3]

    2024 isc2 cybersecurity workforce study,

    ISC2, “2024 isc2 cybersecurity workforce study,” https://www.isc2. org/Insights/2024/10/ISC2-2024-Cybersecurity-Workforce-Study, Octo- ber 31 2024, accessed: 2026-03-09

  4. [4]

    2025 unit 42 global incident response report,

    Palo Alto Networks Unit 42, “2025 unit 42 global incident response report,” https://www.paloaltonetworks.com/engage/ unit42-2025-global-incident-response-report, Palo Alto Networks, 2025, accessed: 2026-03-09

  5. [5]

    When llms meet cybersecu- rity: A systematic literature review,

    J. Zhang, H. Bu, H. Wen, and Y . e. a. Liu, “When llms meet cybersecu- rity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, 2025

  6. [6]

    PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,

    G. Deng, Y . Liu, V . Mayoral-Vilches, and P. L. et al., “PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,” in33rd USENIX Security Symposium, Aug. 2024, pp. 847–864

  7. [7]

    Getting pwn’d by ai: Penetration testing with large language models,

    A. Happe and J. Cito, “Getting pwn’d by ai: Penetration testing with large language models,” inProceeding of the European Software Engineering Conference, 2023, pp. 2082–2086

  8. [8]

    Au- topenbench: A vulnerability testing benchmark for generative agents,

    L. Gioacchini, A. Delsanto, I. Drago, and M. e. a. Mellia, “Au- topenbench: A vulnerability testing benchmark for generative agents,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 1615–1624

  9. [9]

    Teams of llm agents can exploit zero-day vulnerabilities,

    Y . Zhu, A. Kellermann, A. Gupta, and P. L. et al., “Teams of llm agents can exploit zero-day vulnerabilities,” 2025

  10. [10]

    Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,

    H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,” 2025

  11. [11]

    Multi-agent penetration testing ai for the web,

    I. David and A. Gervais, “Multi-agent penetration testing ai for the web,” 2025

  12. [12]

    Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

    A. K. Zhang, N. Perry, R. Dulepet, and J. J. et al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,” inInternational Conference on Learning Representations, 2025

  13. [13]

    CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,

    Y . Zhu, A. Kellermann, D. Bowman, and e. a. Philip Li, “CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,” inInternational Conference on Machine Learning, 2025

  14. [14]

    Claude is competitive with humans in (some) cyber com- petitions,

    Anthropic, “Claude is competitive with humans in (some) cyber com- petitions,” https://red.anthropic.com/2025/cyber-competitions/, August 9 2025, accessed: 2026-03-09

  15. [15]

    The road to top 1: How xbow did it,

    XBOW, “The road to top 1: How xbow did it,” https://xbow.com/blog/ top-1-how-xbow-did-it, June 24 2025, accessed: 2026-03-09

  16. [16]

    Comparing AI agents to cybersecurity professionals in real-world penetration testing,

    J. W. Lin, E. K. Jones, D. J. Jasper, and E. J. shen Ho et al., “Comparing AI agents to cybersecurity professionals in real-world penetration testing,” inThe Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    Ten years of{iCTF}: The good, the bad, and the ugly,

    G. Vigna, K. Borgolte, J. Corbetta, and e. a. Doup ´e, Adam, “Ten years of{iCTF}: The good, the bad, and the ugly,” in2014 USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14), 2014

  18. [18]

    Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,

    M. Shao, S. Jancheska, M. Udeshi, and B. e. a. Dolan-Gavitt, “Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 472–57 498, 2024

  19. [19]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

    L. Wang, W. Xu, Y . Lan, and e. a. Hu, Zhiqiang, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the Association for Computational Linguistics, Jul. 2023, pp. 2609–2634

  20. [20]

    Evaluation and benchmarking of llm agents: A survey,

    M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, 2025, pp. 6129–6139

  21. [21]

    Cognitive architectures for language agents,

    T. Sumers, S. Yao, K. R. Narasimhan, and T. L. Griffiths, “Cognitive architectures for language agents,”Transactions on Machine Learning Research, 2023

  22. [22]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, and e. a. Du, Nan, “React: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations, 2022

  23. [23]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, and e. a. Raileanu, Roberta, “Toolformer: Language models can teach themselves to use tools,” Advances in neural information processing systems, vol. 36, pp. 68 539– 68 551, 2023

  24. [24]

    Autogen: Enabling next-gen llm applications via multi-agent conversations,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, and e. a. Li, Beibin, “Autogen: Enabling next-gen llm applications via multi-agent conversations,” in Conference on language modeling, 2024

  25. [25]

    Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,

    W. Chen, Y . Su, J. Zuo, and e. a. Yang, Cheng, “Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,” in International Conference on Learning Representations, 2023

  26. [26]

    Cybersleuth: Autonomous blue-team llm agent for web attack forensics,

    S. Fumero, K. Huang, M. Boffa, D. Giordano, M. Mellia, Z. B. Houidi, and D. Rossi, “Cybersleuth: Autonomous blue-team llm agent for web attack forensics,”arXiv preprint arXiv:2508.20643, 2025

  27. [27]

    From generation to judgment: Opportunities and challenges of llm-as-a-judge,

    D. Li, B. Jiang, L. Huang, and e. a. Beigi, Alimohammad, “From generation to judgment: Opportunities and challenges of llm-as-a-judge,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2757–2791

  28. [28]

    Claude code overview,

    “Claude code overview,” https://code.claude.com/docs/en/overview, An- thropic, 2026, claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with development tools

  29. [29]

    How claude remembers your project,

    “How claude remembers your project,” https://code.claude.com/docs/en/ memory, Anthropic, 2026, describes CLAUDE.md and auto memory mechanisms that allow persistent context across sessions

  30. [30]

    Introducing gpt-4.1 in the api,

    OpenAI, “Introducing gpt-4.1 in the api,” https://openai.com/index/ gpt-4-1/, 2025, official OpenAI model release announcement for GPT- 4.1

  31. [31]

    Gpt-5 system card,

    ——, “Gpt-5 system card,” https://openai.com/index/ gpt-5-system-card/, 2025, official OpenAI system card describing GPT-5 architecture and safety

  32. [32]

    Claude opus 4.6 system card,

    Anthropic, “Claude opus 4.6 system card,” https://www-cdn.anthropic. com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026, official Anthropic model card for Claude Opus 4.6. APPENDIX A. Scholar-Like Enumeration – Succesful Execution At step 9, the agent confirms an IDOR vulnerability (arbi- trary order IDs accepted). Rather than exploiting it immedi- at...