pith. machine review for the scientific record. sign in

arxiv: 2605.10834 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CR

Recognition: no theorem link

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords AI pentesting agentsevaluation protocolvulnerability discoverypenetration testingLLM semantic matchingreal-world benchmarksoffensive security
0
0 comments X

The pith

A new protocol evaluates AI pentesting agents by validated vulnerability discovery in complex targets instead of scripted tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks for AI pentesting agents focus on narrow goals like capture-the-flag or exploit reproduction in simplified settings, which overlook the open-ended exploration needed in practice. The paper presents an evaluation protocol that centers on validated vulnerability discovery across multiple attack surfaces and classes in realistic targets. It relies on structured ground truth, LLM-based semantic matching, bipartite resolution to handle scoring ambiguity, repeated testing of stochastic agents, efficiency metrics, and reduced test suites. This change would allow comparisons that better indicate which agents can perform in actual operational security work. If the protocol holds, it supports more sustainable and informative agent development.

Core claim

The paper claims that shifting from task completion metrics to validated vulnerability discovery, supported by structured ground-truth combined with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation, efficiency metrics, and reduced-suite selection, enables a more realistic and operationally informative comparison of AI pentesting agents.

What carries the argument

The evaluation protocol, which uses structured ground-truth with LLM-based semantic matching for vulnerability identification and bipartite resolution for scoring under ambiguity, along with continuous maintenance, repeated stochastic evaluations, efficiency metrics, and reduced-suite selection.

If this is right

  • Agents can be compared directly on their ability to discover actual vulnerabilities rather than complete predefined tasks.
  • Evaluation extends to targets spanning multiple attack surfaces and vulnerability classes.
  • Repeated and cumulative testing accounts for stochastic behavior in agent performance.
  • Efficiency metrics provide insight into practical resource use during assessments.
  • Reduced-suite selection enables ongoing experimentation without prohibitive costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The protocol could expose performance differences between agents that narrow benchmarks hide, helping prioritize development efforts.
  • Security teams might apply similar structured ground-truth methods to benchmark commercial tools against live environments.
  • Shared annotated ground truth could support community extensions for testing across additional target types.

Load-bearing premise

LLM-based semantic matching combined with bipartite resolution can reliably identify and score vulnerabilities under realistic ambiguity without introducing significant false positives or negatives that distort comparisons.

What would settle it

An independent expert manual review of the same targets that finds substantially different vulnerability identifications or scores than the protocol's automated matching and bipartite resolution.

Figures

Figures reproduced from arXiv: 2605.10834 by Andr\'e Baptista, Bruno Mendes, Henrique Branquinho, Nuno Moniz, Pedro Conde, Valerio Mazzone.

Figure 1
Figure 1. Figure 1: Findings-to-ground-truth matching system comparison with human triaged findings as baseline across 3 runs, with mean values and standard deviation. 4.2 Agentic pentesting results [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall comparison for all experimental setups, averaged across 3 runs on all targets, with [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: First row – comparison considering findings accumulated across 3 runs for all targets. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of the proposed evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall comparison for all experimental setups on [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall comparison for all experimental setups on [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall comparison for all experimental setups on [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: First row – comparison considering findings accumulated across 3 runs for [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: First row – comparison considering findings accumulated across 3 runs for [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: First row – comparison considering findings accumulated across 3 runs for [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Temporal evolution of evaluation metrics across all Claude Code runs, grouped by target. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an evaluation protocol for AI pentesting agents that shifts from simplified CTF-style or exploit-reproduction benchmarks to validated vulnerability discovery in complex, multi-surface real-world targets. The protocol integrates structured ground truth, LLM-based semantic matching for identifying findings, bipartite resolution to handle scoring under ambiguity, continuous ground-truth maintenance, repeated evaluations to account for stochasticity, efficiency metrics, and reduced-suite selection. The authors release expert-annotated ground truth and implementation code via GitHub to support reproducibility.

Significance. If the LLM semantic matching and bipartite resolution components prove reliable, the protocol would advance the field by enabling more operationally relevant comparisons of pentesting agents that capture open-ended exploration, strategic decision-making, and performance across diverse vulnerability classes. The release of artifacts is a clear strength that facilitates community scrutiny and extension of the framework.

major comments (2)
  1. [§3 (Protocol Components, LLM-based semantic matching subsection)] §3 (Protocol Components, LLM-based semantic matching subsection): The central claim of delivering 'validated vulnerability discovery' and 'operationally informative' agent comparisons rests on the LLM semantic matcher plus bipartite resolution correctly mapping agent outputs to expert ground truth under realistic ambiguity. No precision/recall figures, inter-annotator agreement scores, or error analysis on held-out pentesting reports are provided to quantify the matcher's accuracy or potential class-specific biases (e.g., logic flaws vs. misconfigurations). This absence directly undermines the reliability of cumulative scores and efficiency metrics.
  2. [§4 (Evaluation and Metrics)] §4 (Evaluation and Metrics): The manuscript asserts that the protocol supports 'repeated and cumulative evaluation of stochastic agents' and 'reduced-suite selection,' yet provides no concrete examples, pseudocode, or results demonstrating how bipartite resolution resolves ambiguous cases or how the reduced suite preserves coverage of attack surfaces and vulnerability classes. Without such grounding, it is difficult to assess whether the framework achieves its stated realism gains over prior benchmarks.
minor comments (1)
  1. [Abstract and §5] The GitHub link in the abstract is useful, but the manuscript should include a brief summary table in §5 or the appendix listing the exact number of targets, vulnerability classes, and expert annotations released.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate revisions to strengthen the manuscript's support for the proposed protocol.

read point-by-point responses
  1. Referee: [§3 (Protocol Components, LLM-based semantic matching subsection)] §3 (Protocol Components, LLM-based semantic matching subsection): The central claim of delivering 'validated vulnerability discovery' and 'operationally informative' agent comparisons rests on the LLM semantic matcher plus bipartite resolution correctly mapping agent outputs to expert ground truth under realistic ambiguity. No precision/recall figures, inter-annotator agreement scores, or error analysis on held-out pentesting reports are provided to quantify the matcher's accuracy or potential class-specific biases (e.g., logic flaws vs. misconfigurations). This absence directly undermines the reliability of cumulative scores and efficiency metrics.

    Authors: We agree that quantitative validation of the LLM semantic matcher is essential to substantiate the protocol's reliability. The current manuscript describes the matcher as part of the overall framework and releases the expert-annotated ground truth to enable such analysis, but does not report precision/recall, inter-annotator agreement, or error analysis. In the revised version we will add these metrics (computed on held-out reports), agreement scores, and a class-specific error breakdown to the LLM-based semantic matching subsection of §3. revision: yes

  2. Referee: [§4 (Evaluation and Metrics)] §4 (Evaluation and Metrics): The manuscript asserts that the protocol supports 'repeated and cumulative evaluation of stochastic agents' and 'reduced-suite selection,' yet provides no concrete examples, pseudocode, or results demonstrating how bipartite resolution resolves ambiguous cases or how the reduced suite preserves coverage of attack surfaces and vulnerability classes. Without such grounding, it is difficult to assess whether the framework achieves its stated realism gains over prior benchmarks.

    Authors: We concur that explicit demonstrations would make the claims more concrete. The manuscript outlines the concepts of repeated evaluation, bipartite resolution, and reduced-suite selection but does not include pseudocode or illustrative results. In revision we will add pseudocode for bipartite resolution (with examples of ambiguous mappings), plus analysis or tables showing how the reduced suite retains coverage of attack surfaces and vulnerability classes, drawing on the released ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological protocol without derivations or self-referential reductions

full rationale

The paper presents a practical evaluation protocol for AI pentesting agents, combining structured ground-truth, LLM-based semantic matching, bipartite resolution, efficiency metrics, and artifact release. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The protocol components are introduced as design choices for realistic evaluation rather than derived from prior elements by construction. The claim of extending the state of the art is supported by the released ground truth and code, not by any reduction to inputs. This is a self-contained methodological framework with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The protocol rests on the assumption that expert ground truth can be maintained continuously and that LLM semantic matching provides sufficient accuracy for vulnerability identification; no free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption LLM-based semantic matching can accurately align agent findings to expert ground truth under ambiguity
    Invoked in the description of how vulnerabilities are identified and scored.
  • domain assumption Bipartite resolution can handle realistic ambiguity in vulnerability validation without systematic bias
    Central to the scoring mechanism for findings.

pith-pipeline@v0.9.0 · 5524 in / 1303 out tokens · 39020 ms · 2026-05-12T03:31:12.175727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 11 internal anchors

  1. [1]

    Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, A

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, K. Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, A. Irpan, Eric Jang, Rosario M Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, N. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, S....

  2. [2]

    Claude code: Agentic coding tool

    Anthropic. Claude code: Agentic coding tool. https://github.com/anthropics/claude-code,

  3. [3]

    Version 2.1.107, Proprietary License (© Anthropic PBC)

  4. [4]

    Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, and William W. Pearce. Pentestjudge: Judging Agent Behavior Against Operational Requirements.ArXiv, abs/2508.02921, aug 2025

  5. [5]

    vuln-bank

    Commando-X. vuln-bank. https://github.com/Commando-X/vuln-bank, 2025. Accessed: 2026- 01-05, MIT License

  6. [6]

    Multi-agent penetration testing ai for the web,

    Isaac David and Arthur Gervais. Multi-agent Penetration Testing AI for the Web.ArXiv, abs/2508.20816, aug 2025

  7. [7]

    The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

    Mostafa Dehghani, Yi Tay, A. Gritsenko, Zhe Zhao, N. Houlsby, Fernando Diaz, Donald Metzler, and O. Vinyals. The Benchmark Lottery.ArXiv, abs/2107.07002, jul 2021

  8. [8]

    Vilches, Peng Liu, Yuekang Li, Yuan Xu, Martin Pinzger, Stefan Rass, Tianwei Zhang, and Yang Liu

    Gelei Deng, Yi Liu, V . Vilches, Peng Liu, Yuekang Li, Yuan Xu, Martin Pinzger, Stefan Rass, Tianwei Zhang, and Yang Liu. Pentestgpt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.USENIX Security Symposium, 2024

  9. [9]

    The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models.ArXiv, abs/2510.23191, oct 2025

    Timo Freiesleben and Sebastian Zezulka. The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models.ArXiv, abs/2510.23191, oct 2025

  10. [10]

    Siracusano, and Roberto Bifulco

    Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, G. Siracusano, and Roberto Bifulco. Autopenbench: Benchmarking Generative Agents for Penetration Testing.ArXiv, abs/2410.03225, oct 2024

  11. [11]

    Happe and Jürgen Cito

    A. Happe and Jürgen Cito. Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design.ArXiv, abs/2504.10112, apr 2025

  12. [12]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, F. Xia, Ted Xiao, Harris Chan, Jacky Liang, Peter R. Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, P. Sermanet, Noah Brown, Tomas Jackson, Linda Luu, S. Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models.ArXiv, abs/2207.05608, jul 2022

  13. [13]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of LLM agents: A survey.ArXiv, abs/2402.02716, feb 2024

  14. [14]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai Agents That Matter.Trans. Mach. Learn. Res., 2025, jul 2024

  15. [15]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Y . Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, N. Rozen, Erez Schwartz, Gal Shachaf, S. Shalev- Shwartz, A. Shashua, and Moshe Tenenholtz. Mrkl Systems: A modular, neuro-symbolic architec- ture that combines large language models, external knowledge sou...

  16. [16]

    Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

    He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

  17. [17]

    Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S

    Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard- Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, and Percy Liang. Evaluating Human-Language Model Interaction.Trans. Mach. Learn. Res., 2023, dec 2022. 10

  18. [18]

    Ditch the Gold Standard: Re-evaluating Conversational Question Answering.ArXiv, abs/2112.08812, dec 2021

    Huihan Li, Tianyu Gao, Manan Goenka, and Danqi Chen. Ditch the Gold Standard: Re-evaluating Conversational Question Answering.ArXiv, abs/2112.08812, dec 2021

  19. [19]

    Pacebench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities.ArXiv, abs/2510.11688, oct 2025

    Zicheng Liu, Lige Huang, Jie Zhang, Dongrui Liu, Yuan Tian, and Jing Shao. Pacebench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities.ArXiv, abs/2510.11688, oct 2025

  20. [20]

    Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

    Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xu Pan, Yuan Zhang, and Min Yang. Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing. ArXiv, abs/2509.09207, sep 2025

  21. [21]

    Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

    Lajos Muzsai, David Imolai, and András Lukács. Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

  22. [22]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, S. Balaji, Jeff Wu, Ouyang Long, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, W. Saunders, Xu Jiang, K. Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question- answering with human feedback.ArXiv, abs/2112.09332, dec 2021

  23. [23]

    Rapidpen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents.ArXiv, abs/2502.16730, feb 2025

    Sho Nakatani. Rapidpen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents.ArXiv, abs/2502.16730, feb 2025

  24. [24]

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bingqian Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking Web Agents in Online Environments.ArXiv, abs/2406.12373, jun 2024

  25. [25]

    Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366, 2021

    Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily L. Denton, and A. Hanna. Ai and the Everything in the Whole Wide World Benchmark.ArXiv, abs/2111.15366, nov 2021

  26. [26]

    Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R

    Mar’ia Sanz-G’omez, V . Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R. J. Veas Chavez, and Maite del Mundo de Torres. Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents.ArXiv, abs/2510.24317, oct 2025

  27. [27]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, R. Raileanu, M. Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761, feb 2023

  28. [28]

    Krishnamurthy, F

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, P. Krishnamurthy, F. Khorrami, Ramesh Karri, and Muhammad Shafique. Nyu CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security.Advances in Neural Information Processing Systems 37, jun 2024

  29. [29]

    Pentestagent: Incorporating LLM Agents to Automated Penetration Testing.Proceedings of the 20th ACM Asia Conference on Computer and Communications Security, nov 2024

    Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Weijuan Ruan. Pentestagent: Incorporating LLM Agents to Automated Penetration Testing.Proceedings of the 20th ACM Asia Conference on Computer and Communications Security, nov 2024

  30. [30]

    Gopinath, Karthik Narasimhan, and Shunyu Yao

    Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflex- ion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, mar 2023

  31. [31]

    stuxctf. Paygoat. https://github.com/stuxctf/PAYGoat, 2025. Accessed: 2026-01-05, MIT License

  32. [32]

    Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

    T. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive Architectures for Language Agents.Trans. Mach. Learn. Res., 2024, sep 2023

  33. [33]

    Strix: Autonomous ai agents for vulnerability detection

    usestrix. Strix: Autonomous ai agents for vulnerability detection. https://github.com/usestrix/ strix, 2025. Version v0.8.3, Apache 2.0 License

  34. [34]

    Pentagi: Autonomous ai agents for penetration testing

    vxcontrol. Pentagi: Autonomous ai agents for penetration testing. https://github.com/vxcontrol/ pentagi, 2025. Version 2.0.0, MIT License

  35. [35]

    Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P

    Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. A. Dow, J. Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew V ogel, Hannah Washington, Abigail Z. Jacobs, and Microsoft Research. Evaluatin...

  36. [36]

    Ho, and Oluwasanmi Koyejo

    Angelina Wang, Daniel E. Ho, and Oluwasanmi Koyejo. The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior.ArXiv, abs/2509.19364, sep 2025. 11

  37. [37]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, A. Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models.ArXiv, abs/2305.16291, may 2023

  38. [38]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18, aug 2023

    Lei Wang, Chengbang Ma, Xueyang Feng, Zeyu Zhang, Hao-ran Yang, Jingsen Zhang, Zhi-Yang Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18, aug 2023

  39. [39]

    Cai, Jialin Zhang, and D

    Zhun Wang, Tianneng Shi, Jingxuan He, M. Cai, Jialin Zhang, and D. Song. Cybergym: Evaluating AI Agents’Real-World Cybersecurity Capabilities at Scale, jun 2025

  40. [40]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models.ArXiv, abs/2201.11903, jan 2022

  41. [41]

    Toward an evaluation science for generative ai systems,

    Laura Weidinger, Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Sayash Kapoor, Deep Ganguli, Sanmi Koyejo, and William Isaac. Toward an Evaluation Science for Generative AI Systems.ArXiv, abs/2503.05336, mar 2025

  42. [42]

    validation-benchmarks

    xbow engineering. validation-benchmarks. https://github.com/xbow-engineering/ validation-benchmarks, 2024. Accessed: 2026-01-05, Apache 2.0 License

  43. [43]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Qin Liu, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, X. Huan,...

  44. [44]

    Xue et al

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, D. Song, Huan Sun, and Yu Su. An Illusion of Progress? Assessing the Current State of Web Agents.ArXiv, abs/2504.01382, apr 2025

  45. [45]

    Penteste- val: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design.ArXiv, abs/2512.14233, dec 2025

    Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, and Xiaofei Xie. Penteste- val: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design.ArXiv, abs/2512.14233, dec 2025

  46. [46]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A Benchmark for Tool- Agent-User Interaction in Real-World Domains.ArXiv, abs/2406.12045, jun 2024

  47. [47]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, T. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.ArXiv, abs/2305.10601, may 2023

  48. [48]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing Reasoning and Acting in Language Models.ArXiv, abs/2210.03629, oct 2022

  49. [49]

    Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T

    Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T. Qin, Rong Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, O. Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cui Chen, Ryan Li, Weiran Xu, J. Z. Ye, Prerit Choudhary, Siddharth M. B...

  50. [50]

    Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

    Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, D. Jasper, Pura Peetathawatchai, Ari Glenn, V . Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi K. Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghu...

  51. [51]

    Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Phil Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, A. Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities.ArXiv, abs/2503.17332, mar 2025. 12 A Main perfor...