Recognition: no theorem link
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3
The pith
A new protocol evaluates AI pentesting agents by validated vulnerability discovery in complex targets instead of scripted tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that shifting from task completion metrics to validated vulnerability discovery, supported by structured ground-truth combined with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation, efficiency metrics, and reduced-suite selection, enables a more realistic and operationally informative comparison of AI pentesting agents.
What carries the argument
The evaluation protocol, which uses structured ground-truth with LLM-based semantic matching for vulnerability identification and bipartite resolution for scoring under ambiguity, along with continuous maintenance, repeated stochastic evaluations, efficiency metrics, and reduced-suite selection.
If this is right
- Agents can be compared directly on their ability to discover actual vulnerabilities rather than complete predefined tasks.
- Evaluation extends to targets spanning multiple attack surfaces and vulnerability classes.
- Repeated and cumulative testing accounts for stochastic behavior in agent performance.
- Efficiency metrics provide insight into practical resource use during assessments.
- Reduced-suite selection enables ongoing experimentation without prohibitive costs.
Where Pith is reading between the lines
- The protocol could expose performance differences between agents that narrow benchmarks hide, helping prioritize development efforts.
- Security teams might apply similar structured ground-truth methods to benchmark commercial tools against live environments.
- Shared annotated ground truth could support community extensions for testing across additional target types.
Load-bearing premise
LLM-based semantic matching combined with bipartite resolution can reliably identify and score vulnerabilities under realistic ambiguity without introducing significant false positives or negatives that distort comparisons.
What would settle it
An independent expert manual review of the same targets that finds substantially different vulnerability identifications or scores than the protocol's automated matching and bipartite resolution.
Figures
read the original abstract
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an evaluation protocol for AI pentesting agents that shifts from simplified CTF-style or exploit-reproduction benchmarks to validated vulnerability discovery in complex, multi-surface real-world targets. The protocol integrates structured ground truth, LLM-based semantic matching for identifying findings, bipartite resolution to handle scoring under ambiguity, continuous ground-truth maintenance, repeated evaluations to account for stochasticity, efficiency metrics, and reduced-suite selection. The authors release expert-annotated ground truth and implementation code via GitHub to support reproducibility.
Significance. If the LLM semantic matching and bipartite resolution components prove reliable, the protocol would advance the field by enabling more operationally relevant comparisons of pentesting agents that capture open-ended exploration, strategic decision-making, and performance across diverse vulnerability classes. The release of artifacts is a clear strength that facilitates community scrutiny and extension of the framework.
major comments (2)
- [§3 (Protocol Components, LLM-based semantic matching subsection)] §3 (Protocol Components, LLM-based semantic matching subsection): The central claim of delivering 'validated vulnerability discovery' and 'operationally informative' agent comparisons rests on the LLM semantic matcher plus bipartite resolution correctly mapping agent outputs to expert ground truth under realistic ambiguity. No precision/recall figures, inter-annotator agreement scores, or error analysis on held-out pentesting reports are provided to quantify the matcher's accuracy or potential class-specific biases (e.g., logic flaws vs. misconfigurations). This absence directly undermines the reliability of cumulative scores and efficiency metrics.
- [§4 (Evaluation and Metrics)] §4 (Evaluation and Metrics): The manuscript asserts that the protocol supports 'repeated and cumulative evaluation of stochastic agents' and 'reduced-suite selection,' yet provides no concrete examples, pseudocode, or results demonstrating how bipartite resolution resolves ambiguous cases or how the reduced suite preserves coverage of attack surfaces and vulnerability classes. Without such grounding, it is difficult to assess whether the framework achieves its stated realism gains over prior benchmarks.
minor comments (1)
- [Abstract and §5] The GitHub link in the abstract is useful, but the manuscript should include a brief summary table in §5 or the appendix listing the exact number of targets, vulnerability classes, and expert annotations released.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate revisions to strengthen the manuscript's support for the proposed protocol.
read point-by-point responses
-
Referee: [§3 (Protocol Components, LLM-based semantic matching subsection)] §3 (Protocol Components, LLM-based semantic matching subsection): The central claim of delivering 'validated vulnerability discovery' and 'operationally informative' agent comparisons rests on the LLM semantic matcher plus bipartite resolution correctly mapping agent outputs to expert ground truth under realistic ambiguity. No precision/recall figures, inter-annotator agreement scores, or error analysis on held-out pentesting reports are provided to quantify the matcher's accuracy or potential class-specific biases (e.g., logic flaws vs. misconfigurations). This absence directly undermines the reliability of cumulative scores and efficiency metrics.
Authors: We agree that quantitative validation of the LLM semantic matcher is essential to substantiate the protocol's reliability. The current manuscript describes the matcher as part of the overall framework and releases the expert-annotated ground truth to enable such analysis, but does not report precision/recall, inter-annotator agreement, or error analysis. In the revised version we will add these metrics (computed on held-out reports), agreement scores, and a class-specific error breakdown to the LLM-based semantic matching subsection of §3. revision: yes
-
Referee: [§4 (Evaluation and Metrics)] §4 (Evaluation and Metrics): The manuscript asserts that the protocol supports 'repeated and cumulative evaluation of stochastic agents' and 'reduced-suite selection,' yet provides no concrete examples, pseudocode, or results demonstrating how bipartite resolution resolves ambiguous cases or how the reduced suite preserves coverage of attack surfaces and vulnerability classes. Without such grounding, it is difficult to assess whether the framework achieves its stated realism gains over prior benchmarks.
Authors: We concur that explicit demonstrations would make the claims more concrete. The manuscript outlines the concepts of repeated evaluation, bipartite resolution, and reduced-suite selection but does not include pseudocode or illustrative results. In revision we will add pseudocode for bipartite resolution (with examples of ambiguous mappings), plus analysis or tables showing how the reduced suite retains coverage of attack surfaces and vulnerability classes, drawing on the released ground truth. revision: yes
Circularity Check
No circularity: methodological protocol without derivations or self-referential reductions
full rationale
The paper presents a practical evaluation protocol for AI pentesting agents, combining structured ground-truth, LLM-based semantic matching, bipartite resolution, efficiency metrics, and artifact release. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The protocol components are introduced as design choices for realistic evaluation rather than derived from prior elements by construction. The claim of extending the state of the art is supported by the released ground truth and code, not by any reduction to inputs. This is a self-contained methodological framework with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-based semantic matching can accurately align agent findings to expert ground truth under ambiguity
- domain assumption Bipartite resolution can handle realistic ambiguity in vulnerability validation without systematic bias
Reference graph
Works this paper leans on
-
[1]
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, K. Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, A. Irpan, Eric Jang, Rosario M Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, N. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, S....
work page 2022
-
[2]
Claude code: Agentic coding tool
Anthropic. Claude code: Agentic coding tool. https://github.com/anthropics/claude-code,
-
[3]
Version 2.1.107, Proprietary License (© Anthropic PBC)
- [4]
- [5]
-
[6]
Multi-agent penetration testing ai for the web,
Isaac David and Arthur Gervais. Multi-agent Penetration Testing AI for the Web.ArXiv, abs/2508.20816, aug 2025
-
[7]
The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021
Mostafa Dehghani, Yi Tay, A. Gritsenko, Zhe Zhao, N. Houlsby, Fernando Diaz, Donald Metzler, and O. Vinyals. The Benchmark Lottery.ArXiv, abs/2107.07002, jul 2021
-
[8]
Vilches, Peng Liu, Yuekang Li, Yuan Xu, Martin Pinzger, Stefan Rass, Tianwei Zhang, and Yang Liu
Gelei Deng, Yi Liu, V . Vilches, Peng Liu, Yuekang Li, Yuan Xu, Martin Pinzger, Stefan Rass, Tianwei Zhang, and Yang Liu. Pentestgpt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.USENIX Security Symposium, 2024
work page 2024
-
[9]
Timo Freiesleben and Sebastian Zezulka. The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models.ArXiv, abs/2510.23191, oct 2025
-
[10]
Siracusano, and Roberto Bifulco
Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, G. Siracusano, and Roberto Bifulco. Autopenbench: Benchmarking Generative Agents for Penetration Testing.ArXiv, abs/2410.03225, oct 2024
-
[11]
A. Happe and Jürgen Cito. Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design.ArXiv, abs/2504.10112, apr 2025
-
[12]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, F. Xia, Ted Xiao, Harris Chan, Jacky Liang, Peter R. Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, P. Sermanet, Noah Brown, Tomas Jackson, Linda Luu, S. Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models.ArXiv, abs/2207.05608, jul 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of LLM agents: A survey.ArXiv, abs/2402.02716, feb 2024
work page internal anchor Pith review arXiv 2024
-
[14]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai Agents That Matter.Trans. Mach. Learn. Res., 2025, jul 2024
work page 2025
-
[15]
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Y . Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, N. Rozen, Erez Schwartz, Gal Shachaf, S. Shalev- Shwartz, A. Shashua, and Moshe Tenenholtz. Mrkl Systems: A modular, neuro-symbolic architec- ture that combines large language models, external knowledge sou...
work page internal anchor Pith review arXiv 2022
-
[16]
He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025
-
[17]
Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S
Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard- Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, and Percy Liang. Evaluating Human-Language Model Interaction.Trans. Mach. Learn. Res., 2023, dec 2022. 10
work page 2023
-
[18]
Huihan Li, Tianyu Gao, Manan Goenka, and Danqi Chen. Ditch the Gold Standard: Re-evaluating Conversational Question Answering.ArXiv, abs/2112.08812, dec 2021
-
[19]
Zicheng Liu, Lige Huang, Jie Zhang, Dongrui Liu, Yuan Tian, and Jing Shao. Pacebench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities.ArXiv, abs/2510.11688, oct 2025
-
[20]
Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xu Pan, Yuan Zhang, and Min Yang. Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing. ArXiv, abs/2509.09207, sep 2025
-
[21]
Lajos Muzsai, David Imolai, and András Lukács. Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024
-
[22]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, S. Balaji, Jeff Wu, Ouyang Long, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, W. Saunders, Xu Jiang, K. Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question- answering with human feedback.ArXiv, abs/2112.09332, dec 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Sho Nakatani. Rapidpen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents.ArXiv, abs/2502.16730, feb 2025
-
[24]
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bingqian Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, and Zhengyang Wu. Webcanvas: Benchmarking Web Agents in Online Environments.ArXiv, abs/2406.12373, jun 2024
-
[25]
Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366, 2021
Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily L. Denton, and A. Hanna. Ai and the Everything in the Whole Wide World Benchmark.ArXiv, abs/2111.15366, nov 2021
-
[26]
Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R
Mar’ia Sanz-G’omez, V . Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R. J. Veas Chavez, and Maite del Mundo de Torres. Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents.ArXiv, abs/2510.24317, oct 2025
-
[27]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, R. Raileanu, M. Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761, feb 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, P. Krishnamurthy, F. Khorrami, Ramesh Karri, and Muhammad Shafique. Nyu CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security.Advances in Neural Information Processing Systems 37, jun 2024
work page 2024
-
[29]
Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Weijuan Ruan. Pentestagent: Incorporating LLM Agents to Automated Penetration Testing.Proceedings of the 20th ACM Asia Conference on Computer and Communications Security, nov 2024
work page 2024
-
[30]
Gopinath, Karthik Narasimhan, and Shunyu Yao
Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflex- ion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems 36, mar 2023
work page 2023
-
[31]
stuxctf. Paygoat. https://github.com/stuxctf/PAYGoat, 2025. Accessed: 2026-01-05, MIT License
work page 2025
-
[32]
Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L
T. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive Architectures for Language Agents.Trans. Mach. Learn. Res., 2024, sep 2023
work page 2024
-
[33]
Strix: Autonomous ai agents for vulnerability detection
usestrix. Strix: Autonomous ai agents for vulnerability detection. https://github.com/usestrix/ strix, 2025. Version v0.8.3, Apache 2.0 License
work page 2025
-
[34]
Pentagi: Autonomous ai agents for penetration testing
vxcontrol. Pentagi: Autonomous ai agents for penetration testing. https://github.com/vxcontrol/ pentagi, 2025. Version 2.0.0, MIT License
work page 2025
-
[35]
Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. A. Dow, J. Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew V ogel, Hannah Washington, Abigail Z. Jacobs, and Microsoft Research. Evaluatin...
-
[36]
Angelina Wang, Daniel E. Ho, and Oluwasanmi Koyejo. The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior.ArXiv, abs/2509.19364, sep 2025. 11
-
[37]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, A. Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models.ArXiv, abs/2305.16291, may 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18, aug 2023
Lei Wang, Chengbang Ma, Xueyang Feng, Zeyu Zhang, Hao-ran Yang, Jingsen Zhang, Zhi-Yang Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18, aug 2023
work page 2023
-
[39]
Zhun Wang, Tianneng Shi, Jingxuan He, M. Cai, Jialin Zhang, and D. Song. Cybergym: Evaluating AI Agents’Real-World Cybersecurity Capabilities at Scale, jun 2025
work page 2025
-
[40]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models.ArXiv, abs/2201.11903, jan 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Toward an evaluation science for generative ai systems,
Laura Weidinger, Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Sayash Kapoor, Deep Ganguli, Sanmi Koyejo, and William Isaac. Toward an Evaluation Science for Generative AI Systems.ArXiv, abs/2503.05336, mar 2025
-
[42]
xbow engineering. validation-benchmarks. https://github.com/xbow-engineering/ validation-benchmarks, 2024. Accessed: 2026-01-05, Apache 2.0 License
work page 2024
-
[43]
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Qin Liu, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, X. Huan,...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [44]
-
[45]
Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, and Xiaofei Xie. Penteste- val: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design.ArXiv, abs/2512.14233, dec 2025
-
[46]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A Benchmark for Tool- Agent-User Interaction in Real-World Domains.ArXiv, abs/2406.12045, jun 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, T. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.ArXiv, abs/2305.10601, may 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing Reasoning and Acting in Language Models.ArXiv, abs/2210.03629, oct 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T
Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T. Qin, Rong Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, O. Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cui Chen, Ryan Li, Weiran Xu, J. Z. Ye, Prerit Choudhary, Siddharth M. B...
-
[50]
Cybench: A framework for evaluating cybersecurity capabilities and risks of language models
Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, D. Jasper, Pura Peetathawatchai, Ari Glenn, V . Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi K. Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghu...
-
[51]
Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Phil Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, A. Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities.ArXiv, abs/2503.17332, mar 2025. 12 A Main perfor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.