Autonomous LLM Agents & CTFs: A Second Look

Dario Rossi; Idilio Drago; Marco Mellia; Matteo Boffa; Thanh Minh Bui; Youness Bouchari

arxiv: 2605.21497 · v1 · pith:VOT4SMH3new · submitted 2026-04-29 · 💻 cs.CR · cs.AI

Autonomous LLM Agents & CTFs: A Second Look

Youness Bouchari , Matteo Boffa , Marco Mellia , Idilio Drago , Thanh Minh Bui , Dario Rossi This is my paper

Pith reviewed 2026-05-22 01:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsCapture the FlagCTF challengesoffensive securityagent architecturesweb vulnerabilitiessecurity automation

0 comments

The pith

A general-purpose LLM agent matches the success rate of custom-engineered architectures on 30 web CTF challenges by solving 19 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper takes a second look at reports that LLM agents can automate offensive security tasks at near human levels. It builds several agent architectures with rising complexity and modularity, tests them with multiple language model backbones on 30 web-based CTF challenges that span 14 vulnerability classes, and compares the outcomes to claude-code, a general-purpose agent that sets its own internal structure. The results show the general agent performs about as well as the custom versions, that all agents fail on the same hard categories, and that adding structured roles improves consistency while lowering costs. Readers focused on security tools would care because the work indicates that simpler, off-the-shelf agents can serve as effective starting points for automation.

Core claim

The paper shows that claude-code achieves performance comparable to the engineered architectures, solving 19 out of 30 tasks. Both the custom architectures and claude-code encounter the same difficulties in specific challenge categories, which points to barriers that keep current agents below human-level capability. By using the manually designed architectures, the authors measure the effect of added components and find that structured orchestration of specialized roles outperforms monolithic designs, which improves run-to-run consistency and reduces execution costs.

What carries the argument

claude-code, the general-purpose agent that automatically determines its internal architecture, serving as a baseline against custom modular designs of increasing complexity

If this is right

General-purpose agents act as strong baselines for offensive security tasks without requiring heavy custom engineering.
Certain vulnerability classes create persistent barriers that limit all current agents below human performance.
Structured orchestration of specialized roles produces higher consistency and lower costs than monolithic agent designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could prioritize base model improvements over added architectural complexity for these security tasks.
Applying the agents to real-world security operations rather than isolated CTFs would test practical readiness.
Teams might begin with a general agent and layer in modules only for the categories that show consistency problems.

Load-bearing premise

The 30 selected web-based CTF challenges across 14 vulnerability classes are representative enough to support general conclusions about agent capabilities and barriers.

What would settle it

Testing the same set of agents on a new collection of 30 CTF challenges from additional vulnerability classes or non-web settings and checking whether the 19 out of 30 success rate and shared failure patterns remain.

Figures

Figures reproduced from arXiv: 2605.21497 by Dario Rossi, Idilio Drago, Marco Mellia, Matteo Boffa, Thanh Minh Bui, Youness Bouchari.

**Figure 1.** Figure 1: Tested agent architectures. We progress from a single-agent Executor (a) to a structured multi-agent configuration (c) consisting of a Recon Node, Planner, and Evaluator. We use claude code (not shown) as a baseline. All systems are granted access to a vulnerable (Dockerized) service and terminate by outputting the flag, if successfully captured. state of the environment [21]. An LLM can function as the de… view at source ↗

**Figure 2.** Figure 2: Tool calls (steps) vs runs status and architectures. For failed runs, simpler architectures often hits the maximum number of steps, whereas structured planning leads to earlier and more deliberate termination. analysis, we examine whether the Planner correctly identifies the target vulnerability. It succeeds in 23 out of 30 benchmarks. The remaining 7 challenges correspond to vulnerabilities that neither o… view at source ↗

read the original abstract

Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human-level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run-to-run consistency, and reducing execution costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

General-purpose agents like claude-code perform comparably to engineered ones on these CTFs, but questions remain about how well the 30 challenges represent broader agent capabilities.

read the letter

The key point from this paper is that a general-purpose agent called claude-code solves 19 out of 30 web-based CTF challenges, performing at a level similar to the authors' custom-engineered agent architectures. This suggests that off-the-shelf tools can serve as decent baselines for these offensive security tasks without needing heavy customization. What stands out is the systematic comparison. They built agents with increasing complexity and modularity, tested them with different LLM backbones, and directly pitted them against claude-code, which figures out its own structure automatically. They also measured how adding structured roles for specialized tasks improves consistency across runs and cuts down on execution costs. These are concrete data points that extend earlier work on LLM agents in CTFs by providing side-by-side results on the same set of challenges. The paper does a good job highlighting that both the simple and complex setups struggle in the same categories, pointing to real barriers like certain vulnerability types that current agents can't handle well, keeping them below human-level performance. On the soft side, the selection of those 30 challenges across 14 classes raises questions about how representative they are. All are web-based, and without clear criteria for picking them or calibration against human performance, it's possible the set favors easier web exploits where tool-using LLMs already do okay, such as SQL injection or cross-site scripting. That could weaken the claim about persistent barriers applying more generally to other CTF types like binary exploitation. The abstract also doesn't spell out run counts or any statistical checks for the consistency claims, so those need more backing from the full methods section to be fully convincing. This work is aimed at people studying AI for cybersecurity and automation of offensive tasks. Anyone looking for empirical comparisons of agent designs and baselines will find useful numbers here. It has enough new observations to warrant a serious referee, even if revisions are needed on the methodology details and perhaps expanding the challenge set. I recommend sending it out for peer review to get feedback on the generalizability.

Referee Report

2 major / 2 minor

Summary. The paper evaluates engineered LLM agent architectures of increasing complexity and modularity against the general-purpose claude-code agent on 30 web-based CTF challenges spanning 14 vulnerability classes. It reports that claude-code solves 19/30 tasks with performance comparable to the engineered designs, identifies shared struggles across challenge categories as persistent barriers below human level, and finds that structured orchestration of specialized roles improves run-to-run consistency while reducing execution costs.

Significance. If the challenge selection is representative, the work establishes general-purpose agents as strong baselines for offensive security tasks and provides actionable evidence on the value of modular designs. The multi-backbone comparison and direct performance counts add empirical weight to claims about agent limitations in cybersecurity.

major comments (2)

[Abstract and evaluation setup] Abstract and evaluation setup: the central claims that claude-code's 19/30 performance shows general-purpose agents are strong baselines and that persistent barriers are revealed rest on the 30 challenges being representative across 14 classes, yet no selection criteria, difficulty calibration against human solvers, or coverage statistics are provided; this risks selection effects favoring easier web vulnerabilities such as SQLi and XSS.
[Results section] Results section: the assertion of improved run-to-run consistency from structured orchestration lacks reported exact run counts per configuration or statistical tests supporting the consistency and cost-reduction claims, which are load-bearing for the third main finding.

minor comments (2)

[Architecture descriptions] Notation for agent components could be standardized across sections to improve readability of the architecture comparisons.
[Results] A table summarizing solved tasks per vulnerability class would help readers assess the distribution of successes and failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract and evaluation setup] Abstract and evaluation setup: the central claims that claude-code's 19/30 performance shows general-purpose agents are strong baselines and that persistent barriers are revealed rest on the 30 challenges being representative across 14 classes, yet no selection criteria, difficulty calibration against human solvers, or coverage statistics are provided; this risks selection effects favoring easier web vulnerabilities such as SQLi and XSS.

Authors: We acknowledge that the abstract and evaluation setup do not provide explicit selection criteria, difficulty calibration details, or coverage statistics. The 30 challenges were selected from public CTF platforms specifically to span 14 distinct web vulnerability classes, with the intent of covering a representative sample of common offensive security tasks. To address the concern about potential selection effects, we will revise the evaluation setup section to include a clear description of the challenge sources, the rationale for class coverage, and any available information on typical difficulty levels from CTF leaderboards. While comprehensive human solve-rate calibration data is not uniformly available across all challenges, we can add references to public benchmarks where they exist. These additions will clarify the representativeness of the set without altering the core findings. revision: yes
Referee: [Results section] Results section: the assertion of improved run-to-run consistency from structured orchestration lacks reported exact run counts per configuration or statistical tests supporting the consistency and cost-reduction claims, which are load-bearing for the third main finding.

Authors: We agree that the results section would be strengthened by reporting the exact number of runs performed and supporting statistical information. Our experiments involved multiple independent executions per agent configuration to observe consistency and cost differences, but these were summarized at a high level rather than presented with full counts or tests. We will revise the results section to specify the run counts (for example, the number of trials conducted for each architecture and backbone), report variance or standard deviation in success rates across runs to quantify consistency improvements, and include comparative cost metrics such as average token usage or execution time. This will provide the quantitative backing needed for the third main finding. revision: yes

Circularity Check

0 steps flagged

Empirical performance counts on external CTF benchmarks show no circular derivation

full rationale

The paper reports direct experimental results from running engineered and general-purpose LLM agents on a fixed set of 30 web-based CTF challenges, yielding counts such as 19/30 solved tasks and category-specific struggles. These outcomes are measured against external benchmarks rather than derived from fitted parameters, self-referential equations, or load-bearing self-citations. Architectures are manually specified and compared without any reduction of claims to prior author work or ansatz smuggling. The evaluation chain is therefore self-contained against observable performance data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen CTF set adequately samples real-world web vulnerabilities and that agent success rates on these puzzles generalize to broader offensive security utility.

axioms (1)

domain assumption The 30 web-based CTF challenges spanning 14 vulnerability classes form a representative testbed for evaluating LLM agent capabilities in offensive security.
Invoked when generalizing from the observed 19/30 success rate and shared failure categories to statements about persistent barriers below human-level performance.

pith-pipeline@v0.9.0 · 5727 in / 1180 out tokens · 35953 ms · 2026-05-22T01:02:09.212856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

About penetration testing,

M. Bishop, “About penetration testing,”IEEE Security & Privacy, vol. 5, no. 6, pp. 84–87, 2007

work page 2007
[2]

Technical Guide to Information Security Testing and Assessment,

Scarfone, Karen, Souppaya, Murugiah, and Cody, Amanda, “Technical Guide to Information Security Testing and Assessment,” National Insti- tute of Standards and Technology, Tech. Rep. NIST Special Publication 800-115, 2008

work page 2008
[3]

2024 isc2 cybersecurity workforce study,

ISC2, “2024 isc2 cybersecurity workforce study,” https://www.isc2. org/Insights/2024/10/ISC2-2024-Cybersecurity-Workforce-Study, Octo- ber 31 2024, accessed: 2026-03-09

work page 2024
[4]

2025 unit 42 global incident response report,

Palo Alto Networks Unit 42, “2025 unit 42 global incident response report,” https://www.paloaltonetworks.com/engage/ unit42-2025-global-incident-response-report, Palo Alto Networks, 2025, accessed: 2026-03-09

work page 2025
[5]

When llms meet cybersecu- rity: A systematic literature review,

J. Zhang, H. Bu, H. Wen, and Y . e. a. Liu, “When llms meet cybersecu- rity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, 2025

work page 2025
[6]

PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,

G. Deng, Y . Liu, V . Mayoral-Vilches, and P. L. et al., “PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,” in33rd USENIX Security Symposium, Aug. 2024, pp. 847–864

work page 2024
[7]

Getting pwn’d by ai: Penetration testing with large language models,

A. Happe and J. Cito, “Getting pwn’d by ai: Penetration testing with large language models,” inProceeding of the European Software Engineering Conference, 2023, pp. 2082–2086

work page 2023
[8]

Au- topenbench: A vulnerability testing benchmark for generative agents,

L. Gioacchini, A. Delsanto, I. Drago, and M. e. a. Mellia, “Au- topenbench: A vulnerability testing benchmark for generative agents,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 1615–1624

work page 2025
[9]

Teams of llm agents can exploit zero-day vulnerabilities,

Y . Zhu, A. Kellermann, A. Gupta, and P. L. et al., “Teams of llm agents can exploit zero-day vulnerabilities,” 2025

work page 2025
[10]

Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,

H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,” 2025

work page 2025
[11]

Multi-agent penetration testing ai for the web,

I. David and A. Gervais, “Multi-agent penetration testing ai for the web,” 2025

work page 2025
[12]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

A. K. Zhang, N. Perry, R. Dulepet, and J. J. et al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,” inInternational Conference on Learning Representations, 2025

work page 2025
[13]

CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,

Y . Zhu, A. Kellermann, D. Bowman, and e. a. Philip Li, “CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,” inInternational Conference on Machine Learning, 2025

work page 2025
[14]

Claude is competitive with humans in (some) cyber com- petitions,

Anthropic, “Claude is competitive with humans in (some) cyber com- petitions,” https://red.anthropic.com/2025/cyber-competitions/, August 9 2025, accessed: 2026-03-09

work page 2025
[15]

The road to top 1: How xbow did it,

XBOW, “The road to top 1: How xbow did it,” https://xbow.com/blog/ top-1-how-xbow-did-it, June 24 2025, accessed: 2026-03-09

work page 2025
[16]

Comparing AI agents to cybersecurity professionals in real-world penetration testing,

J. W. Lin, E. K. Jones, D. J. Jasper, and E. J. shen Ho et al., “Comparing AI agents to cybersecurity professionals in real-world penetration testing,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[17]

Ten years of{iCTF}: The good, the bad, and the ugly,

G. Vigna, K. Borgolte, J. Corbetta, and e. a. Doup ´e, Adam, “Ten years of{iCTF}: The good, the bad, and the ugly,” in2014 USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14), 2014

work page 2014
[18]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,

M. Shao, S. Jancheska, M. Udeshi, and B. e. a. Dolan-Gavitt, “Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 472–57 498, 2024

work page 2024
[19]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, and e. a. Hu, Zhiqiang, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the Association for Computational Linguistics, Jul. 2023, pp. 2609–2634

work page 2023
[20]

Evaluation and benchmarking of llm agents: A survey,

M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, 2025, pp. 6129–6139

work page 2025
[21]

Cognitive architectures for language agents,

T. Sumers, S. Yao, K. R. Narasimhan, and T. L. Griffiths, “Cognitive architectures for language agents,”Transactions on Machine Learning Research, 2023

work page 2023
[22]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, and e. a. Du, Nan, “React: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations, 2022

work page 2022
[23]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, and e. a. Raileanu, Roberta, “Toolformer: Language models can teach themselves to use tools,” Advances in neural information processing systems, vol. 36, pp. 68 539– 68 551, 2023

work page 2023
[24]

Autogen: Enabling next-gen llm applications via multi-agent conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, and e. a. Li, Beibin, “Autogen: Enabling next-gen llm applications via multi-agent conversations,” in Conference on language modeling, 2024

work page 2024
[25]

Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,

W. Chen, Y . Su, J. Zuo, and e. a. Yang, Cheng, “Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,” in International Conference on Learning Representations, 2023

work page 2023
[26]

Cybersleuth: Autonomous blue-team llm agent for web attack forensics,

S. Fumero, K. Huang, M. Boffa, D. Giordano, M. Mellia, Z. B. Houidi, and D. Rossi, “Cybersleuth: Autonomous blue-team llm agent for web attack forensics,”arXiv preprint arXiv:2508.20643, 2025

work page arXiv 2025
[27]

From generation to judgment: Opportunities and challenges of llm-as-a-judge,

D. Li, B. Jiang, L. Huang, and e. a. Beigi, Alimohammad, “From generation to judgment: Opportunities and challenges of llm-as-a-judge,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2757–2791

work page 2025
[28]

Claude code overview,

“Claude code overview,” https://code.claude.com/docs/en/overview, An- thropic, 2026, claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with development tools

work page 2026
[29]

How claude remembers your project,

“How claude remembers your project,” https://code.claude.com/docs/en/ memory, Anthropic, 2026, describes CLAUDE.md and auto memory mechanisms that allow persistent context across sessions

work page 2026
[30]

Introducing gpt-4.1 in the api,

OpenAI, “Introducing gpt-4.1 in the api,” https://openai.com/index/ gpt-4-1/, 2025, official OpenAI model release announcement for GPT- 4.1

work page 2025
[31]

Gpt-5 system card,

——, “Gpt-5 system card,” https://openai.com/index/ gpt-5-system-card/, 2025, official OpenAI system card describing GPT-5 architecture and safety

work page 2025
[32]

Claude opus 4.6 system card,

Anthropic, “Claude opus 4.6 system card,” https://www-cdn.anthropic. com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026, official Anthropic model card for Claude Opus 4.6. APPENDIX A. Scholar-Like Enumeration – Succesful Execution At step 9, the agent confirms an IDOR vulnerability (arbi- trary order IDs accepted). Rather than exploiting it immedi- at...

work page 2026

[1] [1]

About penetration testing,

M. Bishop, “About penetration testing,”IEEE Security & Privacy, vol. 5, no. 6, pp. 84–87, 2007

work page 2007

[2] [2]

Technical Guide to Information Security Testing and Assessment,

Scarfone, Karen, Souppaya, Murugiah, and Cody, Amanda, “Technical Guide to Information Security Testing and Assessment,” National Insti- tute of Standards and Technology, Tech. Rep. NIST Special Publication 800-115, 2008

work page 2008

[3] [3]

2024 isc2 cybersecurity workforce study,

ISC2, “2024 isc2 cybersecurity workforce study,” https://www.isc2. org/Insights/2024/10/ISC2-2024-Cybersecurity-Workforce-Study, Octo- ber 31 2024, accessed: 2026-03-09

work page 2024

[4] [4]

2025 unit 42 global incident response report,

Palo Alto Networks Unit 42, “2025 unit 42 global incident response report,” https://www.paloaltonetworks.com/engage/ unit42-2025-global-incident-response-report, Palo Alto Networks, 2025, accessed: 2026-03-09

work page 2025

[5] [5]

When llms meet cybersecu- rity: A systematic literature review,

J. Zhang, H. Bu, H. Wen, and Y . e. a. Liu, “When llms meet cybersecu- rity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, 2025

work page 2025

[6] [6]

PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,

G. Deng, Y . Liu, V . Mayoral-Vilches, and P. L. et al., “PentestGPT: Evaluating and harnessing large language models for automated pene- tration testing,” in33rd USENIX Security Symposium, Aug. 2024, pp. 847–864

work page 2024

[7] [7]

Getting pwn’d by ai: Penetration testing with large language models,

A. Happe and J. Cito, “Getting pwn’d by ai: Penetration testing with large language models,” inProceeding of the European Software Engineering Conference, 2023, pp. 2082–2086

work page 2023

[8] [8]

Au- topenbench: A vulnerability testing benchmark for generative agents,

L. Gioacchini, A. Delsanto, I. Drago, and M. e. a. Mellia, “Au- topenbench: A vulnerability testing benchmark for generative agents,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 1615–1624

work page 2025

[9] [9]

Teams of llm agents can exploit zero-day vulnerabilities,

Y . Zhu, A. Kellermann, A. Gupta, and P. L. et al., “Teams of llm agents can exploit zero-day vulnerabilities,” 2025

work page 2025

[10] [10]

Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,

H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,” 2025

work page 2025

[11] [11]

Multi-agent penetration testing ai for the web,

I. David and A. Gervais, “Multi-agent penetration testing ai for the web,” 2025

work page 2025

[12] [12]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

A. K. Zhang, N. Perry, R. Dulepet, and J. J. et al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,” inInternational Conference on Learning Representations, 2025

work page 2025

[13] [13]

CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,

Y . Zhu, A. Kellermann, D. Bowman, and e. a. Philip Li, “CVE- bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities,” inInternational Conference on Machine Learning, 2025

work page 2025

[14] [14]

Claude is competitive with humans in (some) cyber com- petitions,

Anthropic, “Claude is competitive with humans in (some) cyber com- petitions,” https://red.anthropic.com/2025/cyber-competitions/, August 9 2025, accessed: 2026-03-09

work page 2025

[15] [15]

The road to top 1: How xbow did it,

XBOW, “The road to top 1: How xbow did it,” https://xbow.com/blog/ top-1-how-xbow-did-it, June 24 2025, accessed: 2026-03-09

work page 2025

[16] [16]

Comparing AI agents to cybersecurity professionals in real-world penetration testing,

J. W. Lin, E. K. Jones, D. J. Jasper, and E. J. shen Ho et al., “Comparing AI agents to cybersecurity professionals in real-world penetration testing,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[17] [17]

Ten years of{iCTF}: The good, the bad, and the ugly,

G. Vigna, K. Borgolte, J. Corbetta, and e. a. Doup ´e, Adam, “Ten years of{iCTF}: The good, the bad, and the ugly,” in2014 USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14), 2014

work page 2014

[18] [18]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,

M. Shao, S. Jancheska, M. Udeshi, and B. e. a. Dolan-Gavitt, “Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 472–57 498, 2024

work page 2024

[19] [19]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, and e. a. Hu, Zhiqiang, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the Association for Computational Linguistics, Jul. 2023, pp. 2609–2634

work page 2023

[20] [20]

Evaluation and benchmarking of llm agents: A survey,

M. Mohammadi, Y . Li, J. Lo, and W. Yip, “Evaluation and benchmarking of llm agents: A survey,” inProceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, 2025, pp. 6129–6139

work page 2025

[21] [21]

Cognitive architectures for language agents,

T. Sumers, S. Yao, K. R. Narasimhan, and T. L. Griffiths, “Cognitive architectures for language agents,”Transactions on Machine Learning Research, 2023

work page 2023

[22] [22]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, and e. a. Du, Nan, “React: Synergizing reasoning and acting in language models,” inInternational Conference on Learning Representations, 2022

work page 2022

[23] [23]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, and e. a. Raileanu, Roberta, “Toolformer: Language models can teach themselves to use tools,” Advances in neural information processing systems, vol. 36, pp. 68 539– 68 551, 2023

work page 2023

[24] [24]

Autogen: Enabling next-gen llm applications via multi-agent conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, and e. a. Li, Beibin, “Autogen: Enabling next-gen llm applications via multi-agent conversations,” in Conference on language modeling, 2024

work page 2024

[25] [25]

Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,

W. Chen, Y . Su, J. Zuo, and e. a. Yang, Cheng, “Agentverse: Facili- tating multi-agent collaboration and exploring emergent behaviors,” in International Conference on Learning Representations, 2023

work page 2023

[26] [26]

Cybersleuth: Autonomous blue-team llm agent for web attack forensics,

S. Fumero, K. Huang, M. Boffa, D. Giordano, M. Mellia, Z. B. Houidi, and D. Rossi, “Cybersleuth: Autonomous blue-team llm agent for web attack forensics,”arXiv preprint arXiv:2508.20643, 2025

work page arXiv 2025

[27] [27]

From generation to judgment: Opportunities and challenges of llm-as-a-judge,

D. Li, B. Jiang, L. Huang, and e. a. Beigi, Alimohammad, “From generation to judgment: Opportunities and challenges of llm-as-a-judge,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2757–2791

work page 2025

[28] [28]

Claude code overview,

“Claude code overview,” https://code.claude.com/docs/en/overview, An- thropic, 2026, claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with development tools

work page 2026

[29] [29]

How claude remembers your project,

“How claude remembers your project,” https://code.claude.com/docs/en/ memory, Anthropic, 2026, describes CLAUDE.md and auto memory mechanisms that allow persistent context across sessions

work page 2026

[30] [30]

Introducing gpt-4.1 in the api,

OpenAI, “Introducing gpt-4.1 in the api,” https://openai.com/index/ gpt-4-1/, 2025, official OpenAI model release announcement for GPT- 4.1

work page 2025

[31] [31]

Gpt-5 system card,

——, “Gpt-5 system card,” https://openai.com/index/ gpt-5-system-card/, 2025, official OpenAI system card describing GPT-5 architecture and safety

work page 2025

[32] [32]

Claude opus 4.6 system card,

Anthropic, “Claude opus 4.6 system card,” https://www-cdn.anthropic. com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026, official Anthropic model card for Claude Opus 4.6. APPENDIX A. Scholar-Like Enumeration – Succesful Execution At step 9, the agent confirms an IDOR vulnerability (arbi- trary order IDs accepted). Rather than exploiting it immedi- at...

work page 2026