Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Fabian Kaczmarczyck; Ilia Shumailov; Ivan Petrov; Jamie Hayes; Luca Invernizzi; Mark Vero; Martin Vechev; Niels Heinen; Tianqi Fan

arxiv: 2605.29963 · v1 · pith:GAASFT3Knew · submitted 2026-05-28 · 💻 cs.CR · cs.AI· cs.LG

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Mark Vero , Fabian Kaczmarczyck , Ivan Petrov , Ilia Shumailov , Jamie Hayes , Niels Heinen , Tianqi Fan , Luca Invernizzi

show 1 more author

Martin Vechev

This is my paper

Pith reviewed 2026-06-29 06:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords honeypotsLLMHTTPevaluation frameworkAI agentscybersecuritydeception systemscost analysis

0 comments

The pith

LLM-powered HTTP honeypots produce longer attacker sessions, lower detection rates, and cost advantages over rule-based baselines when tested against AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Honeyval, a framework that evaluates LLM-powered HTTP honeypots by connecting them to 16 real backend applications and letting AI hacking agents attempt defined exploit goals. It adds two control tasks to track whether the honeypot or the attacker maintains capability during customizations. This setup replaces earlier ad-hoc tests with scalable, reproducible measurements that reflect practical attack patterns. Experiments using the framework find that LLM honeypots sustain substantially longer interactions, get detected less often even by frontier models, and keep an average running-cost edge against agentic attackers. The same framework also surfaces trade-offs when defenders add counter-offensive features.

Core claim

Honeyval grounds honeypot evaluations in 16 backend applications, deploys AI hacking agents as attackers, monitors both sides with two control tasks, and measures success against clear exploit goals; under this protocol, LLM-powered honeypots produce substantially longer interactions with the attacker than rule-based baselines, are detected far less frequently even by frontier models, and preserve an average running-cost advantage against agentic attackers.

What carries the argument

Honeyval evaluation protocol that pairs LLM honeypots with AI hacking agents pursuing verifiable exploit goals on 16 backend applications while running control tasks to track capability.

If this is right

LLM honeypots can be tuned via configuration changes that trade longer sessions for higher detection risk.
The cost advantage allows wider deployment of high-interaction honeypots without increasing average running expense against automated attackers.
Control tasks provide a reproducible way to verify that honeypot customizations preserve both deception and operational stability.
The framework supports direct comparison of new cost-efficient LLMs as they become available for honeypot backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If AI agents prove too predictable, real attackers might shorten sessions or increase detection, narrowing the reported advantage.
The framework could be extended to measure how well LLM honeypots handle mixed human-plus-automated attack campaigns.
Developers might combine LLM and rule-based components in the same honeypot to balance the observed trade-offs.
Similar agent-driven testing could be applied to non-HTTP honeypots once equivalent backend applications and exploit goals are defined.

Load-bearing premise

AI hacking agents serve as realistic proxies for practical real-world attacks and the defined exploit goals accurately measure honeypot performance across configurations.

What would settle it

Direct comparison of session lengths and detection rates when the same LLM honeypots face real human attackers or non-AI exploit tools instead of the AI agents used in the Honeyval tests.

Figures

Figures reproduced from arXiv: 2605.29963 by Fabian Kaczmarczyck, Ilia Shumailov, Ivan Petrov, Jamie Hayes, Luca Invernizzi, Mark Vero, Martin Vechev, Niels Heinen, Tianqi Fan.

**Figure 1.** Figure 1: Overview of Honeyval. The framework employs 16 backend applications that the honeypots simulate. Honeyval defines three evaluation tasks. In the main task (middle), the LLMpowered honeypot is evaluated directly against an agentic attacker, simulating a practical attack scenario in a scalable way. To measure the baseline capabilities of both the agentic attacker and the honeypot, we employ two control task… view at source ↗

**Figure 2.** Figure 2: Honeypot detection rates by the agents. The false positive rate (FPR) of the detection is measured on the agents’ control task; i.e., how often they mark a real implementation of the app as a honeypot. While all honeypots are detected at a higher rate than their corresponding FPRs, LLM-powered honeypots remain significantly more stealthy than the rule-based baseline. Can agents detect our honeypots? Note t… view at source ↗

**Figure 3.** Figure 3: Honeypot relative running cost advantage against the hacking agents. The relative cost advantage is calculated by dividing the hacking agents’ costs with the honeypot running costs. On average, across direct interactions, the LLM-powered honeypots are more cost-efficient than the attacking agent. Cost asymmetries Next, we examine the running cost of our LLM-powered honeypots against the hacking agents. … view at source ↗

read the original abstract

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Honeyval gives a reproducible way to benchmark LLM honeypots against AI agents across 16 backends, but the headline advantages rest on an unvalidated proxy for real attackers.

read the letter

The paper's main contribution is Honeyval, a framework that replaces ad-hoc tests with a setup using 16 backend apps, AI hacking agents as attackers, two control tasks, and explicit exploit goals. This is new for the subfield and lets researchers run consistent comparisons across honeypot configs, including counter-offensive variants. The experiments show LLM backbones produce longer sessions, lower detection rates by frontier models, and better average cost than rule-based baselines.

That setup is useful because prior work relied on fixed commands or manual checks that do not scale or adapt well. The control tasks and verifiable goals add a layer of monitoring that makes results easier to interpret and reproduce.

The soft spot is the attacker model. All the reported gains come from interactions with the same class of AI agents. The paper does not show that these agents match the exploration patterns, goals, or detection triggers of human red teams or non-LLM tools. If the agents respond differently to LLM-generated replies than real attackers would, the longer sessions and lower detection numbers will not carry over. The abstract and high-level results give no external validation against public exploit traces or human logs, so the practical claim stays conditional on that assumption.

The work is aimed at researchers building or evaluating LLM-based security tools. A reader already working on honeypots or LLM security will get concrete methods and trade-off data they can extend. The framework itself is worth referee time even if the proxy question needs more evidence; the structure is clear enough that reviewers can ask for targeted additions rather than reject outright.

Referee Report

3 major / 2 minor

Summary. The paper proposes Honeyval, a framework for evaluating LLM-powered HTTP honeypots. It grounds evaluations in 16 backend applications, employs AI hacking agents as attackers, uses two control tasks to monitor capabilities, and defines verifiable exploit goals. Experiments with cost-efficient LLMs as honeypots claim substantially longer attacker interactions than rule-based baselines, lower detection rates even by frontier models, and average running cost advantages, with additional tests on counter-offensive configurations showing trade-offs such as longer interactions at the expense of higher detection.

Significance. If the AI-agent proxy assumption holds, the framework provides a scalable, reproducible evaluation method that improves on prior ad-hoc testing by incorporating control tasks and explicit goals; this could enable systematic comparison of honeypot configurations. The explicit use of defined exploit goals and control tasks is a strength for reproducibility and falsifiability within the agent-based setting.

major comments (3)

[Abstract and §4 (Experiments)] Abstract and evaluation methodology section: the central claim that LLM-powered honeypots show 'promise' for practical use (longer interactions, lower detection, cost advantage) rests entirely on measurements against AI hacking agents; no external validation against human red-team logs, public exploit datasets, or non-LLM tools is described, which directly affects whether the reported advantages generalize beyond the proxy.
[§3 (Framework)] Framework description (likely §3): the two control tasks and defined exploit goals are presented as accurately measuring honeypot performance, yet the manuscript provides no evidence that these tasks reproduce attacker behavior patterns observed outside the evaluation (e.g., differences in exploration or response to LLM-generated vs. rule-based replies), making the proxy assumption load-bearing for all quantitative results.
[§4 (Results)] Results on detection and interaction length: claims of 'far less frequently detected even by frontier models' are measured only within the agent-attacker loop; without a comparison baseline using non-agent attackers or real-world traces, it is unclear whether lower detection reflects honeypot quality or limitations in how the agents model detection.

minor comments (2)

[Abstract] The abstract states results without referencing specific tables or figures for the quantitative claims (e.g., interaction lengths or detection rates); adding these cross-references would improve clarity.
[§3] Notation for the 16 backend applications and the two control tasks could be more explicitly defined early in the framework section to aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback correctly identifies that our evaluation framework relies on an AI-agent proxy for attackers, which is a core design decision for scalability and reproducibility. We address each major comment below and will incorporate clarifications and an expanded limitations discussion in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and evaluation methodology section: the central claim that LLM-powered honeypots show 'promise' for practical use (longer interactions, lower detection, cost advantage) rests entirely on measurements against AI hacking agents; no external validation against human red-team logs, public exploit datasets, or non-LLM tools is described, which directly affects whether the reported advantages generalize beyond the proxy.

Authors: We agree that all quantitative results are obtained using AI hacking agents as attackers. This choice enables controlled, reproducible experiments across many configurations, as motivated in the introduction and framework sections. The manuscript frames Honeyval as an agent-based evaluation framework rather than a direct simulator of all attacker types. In revision we will update the abstract and §4 to state explicitly that advantages are measured against AI agents and add a limitations subsection discussing the proxy assumption along with suggestions for future validation using human red-team data or public exploit traces. revision: partial
Referee: [§3 (Framework)] Framework description (likely §3): the two control tasks and defined exploit goals are presented as accurately measuring honeypot performance, yet the manuscript provides no evidence that these tasks reproduce attacker behavior patterns observed outside the evaluation (e.g., differences in exploration or response to LLM-generated vs. rule-based replies), making the proxy assumption load-bearing for all quantitative results.

Authors: The control tasks and exploit goals function as internal consistency checks to confirm that observed differences arise from honeypot customizations rather than agent or LLM capability collapse. We acknowledge that the manuscript supplies no external empirical comparison showing these tasks replicate real-world attacker patterns. We will revise §3 to clarify their role as capability monitors within the agent setting and to note the absence of external behavioral validation as a limitation. revision: partial
Referee: [§4 (Results)] Results on detection and interaction length: claims of 'far less frequently detected even by frontier models' are measured only within the agent-attacker loop; without a comparison baseline using non-agent attackers or real-world traces, it is unclear whether lower detection reflects honeypot quality or limitations in how the agents model detection.

Authors: Detection and interaction metrics are computed inside the closed agent-attacker loop, which isolates the influence of honeypot responses on agent continuation and success. We will revise §4 to emphasize that these outcomes are specific to the agent attacker model and to discuss how the agents' internal detection heuristics may affect the measured rates, while preserving the comparative results against the rule-based baseline within the same evaluation setting. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation framework

full rationale

The paper introduces Honeyval as an empirical evaluation framework for LLM-powered HTTP honeypots, comparing them against baselines using AI hacking agents, control tasks, and defined exploit goals. No derivation chain, equations, or first-principles predictions are claimed; results are reported from direct experiments on interaction length, detection rates, and costs. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. The central claims rest on observable experimental outcomes rather than any construction that reduces to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5831 in / 1072 out tokens · 19605 ms · 2026-06-29T06:59:56.561884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 21 canonical work pages · 4 internal anchors

[1]

An evening with berferd in which a cracker is lured, endured, and studied

Bill Cheswick. An evening with berferd in which a cracker is lured, endured, and studied. In Proceedings of the Winter 1992 USENIX Conference, pages 163–173, 1992

1992
[2]

A virtual honeypot framework

Niels Provos. A virtual honeypot framework. In13th USENIX Security Symposium (USENIX Security 04), San Diego, CA, August 2004. USENIX As- sociation. URL https://www.usenix.org/conference/13th-usenix-security- symposium/virtual-honeypot-framework

2004
[3]

Chatbots in a honeypot world, 2023

Forrest McKee and David Noever. Chatbots in a honeypot world, 2023. URL https://arxiv. org/abs/2301.03771

work page arXiv 2023
[4]

Strong supermartingales and limits of nonnegative martingales

Muris Sladi ´c, Veronica Valeros, Carlos Catania, and Sebastian Garcia. Llm in the shell: Generative honeypots. In2024 IEEE European Symposium on Security and Privacy Workshops, page 430–435. IEEE, July 2024. doi: 10.1109/eurospw61312.2024.00054. URL http://dx. doi.org/10.1109/EuroSPW61312.2024.00054

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/eurospw61312.2024.00054 2024
[5]

Beelzebub: Ai-native security platforms

Beelzebub. Beelzebub: Ai-native security platforms. https://beelzebub.ai/, 2026. Last accessed: 18.04.2026

2026
[6]

Bridges, Thomas R

Robert A. Bridges, Thomas R. Mitchell, Mauricio Muñoz, and Ted Henriksson. Sok: Honeypots & llms, more than the sum of their parts?, 2025. URL https://arxiv.org/abs/2510. 25939

2025
[7]

Beekeeper: Accelerating honeypot analysis with llm-driven feedback.IEEE Access, 13:168034–168054, 2025

Niclas Ilg, Dominik Germek, Paul Duplys, and Michael Menth. Beekeeper: Accelerating honeypot analysis with llm-driven feedback.IEEE Access, 13:168034–168054, 2025. doi: 10.1109/ACCESS.2025.3613118

work page doi:10.1109/access.2025.3613118 2025
[8]

Gpt-5 at ctfs: Case studies from top-tier cybersecurity events, 2025

Reworr, Artem Petrov, and Dmitrii V olkov. Gpt-5 at ctfs: Case studies from top-tier cybersecurity events, 2025. URLhttps://arxiv.org/abs/2511.04860

work page arXiv 2025
[9]

Evaluating and mitigating the growing risk of llm-discovered 0-days

Nicholas Carlini, Keane Lucas, Evyatar Ben Asher, Newton Cheng, Hasnain Lakhani, David Forsythe, and Kyla Guru. Evaluating and mitigating the growing risk of llm-discovered 0-days. https://red.anthropic.com/2026/zero-days/, 2026. Last accessed: 17.04.2026

2026
[10]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Claude code, 2026

Anthropic. Claude code, 2026. URLhttps://github.com/anthropics/claude-code

2026
[12]

Gemini cli, 2026

Google. Gemini cli, 2026. URLhttps://github.com/google-gemini/gemini-cli

2026
[13]

Honeypots: concepts, approaches, and challenges

Iyatiti Mokube and Michele Adams. Honeypots: concepts, approaches, and challenges. In Proceedings of the 45th Annual ACM Southeast Conference, ACMSE ’07, page 321–326, New York, NY , USA, 2007. Association for Computing Machinery. ISBN 9781595936295. doi: 10.1145/1233341.1233399. URLhttps://doi.org/10.1145/1233341.1233399

work page doi:10.1145/1233341.1233399 2007
[14]

Cowrie: Advanced ssh honeypot for enterprise security, 2026

Cowrie. Cowrie: Advanced ssh honeypot for enterprise security, 2026. URL https://www. cowrie.org/. Last accessed: 17.04.2026

2026
[15]

Iotcandyjar: Towards an intelligent-interaction honeypot for iot devices

Tongbo Luo, Zhaoyan Xu, Xing Jin, Yanhui Jia, and Xin Ouyang. Iotcandyjar: Towards an intelligent-interaction honeypot for iot devices. InBlack Hat USA, 2017. URL https://blackhat.com/docs/us-17/thursday/us-17-Luo-Iotcandyjar-Towards- An-Intelligent-Interaction-Honeypot-For-IoT-Devices-wp.pdf

2017
[16]

Honeyllm: Enabling shell honeypots with large language models

Chongqi Guan, Guohong Cao, and Sencun Zhu. Honeyllm: Enabling shell honeypots with large language models. In2024 IEEE Conference on Communications and Network Security (CNS),
[17]

doi: 10.1109/CNS62487.2024.10735663

work page doi:10.1109/cns62487.2024.10735663 2024
[18]

Otal and M

Hakan T. Otal and M. Abdullah Canbaz. Llm honeypot: Leveraging large language models as advanced interactive honeypot systems. In2024 IEEE Conference on Communications and Network Security (CNS), page 1–6. IEEE, September 2024. doi: 10.1109/cns62487.2024. 10735607. URLhttp://dx.doi.org/10.1109/CNS62487.2024.10735607

work page doi:10.1109/cns62487.2024 2024
[19]

Llmhoney: A real-time ssh honeypot with large language model-driven dynamic response generation, 2025

Pranjay Malhotra. Llmhoney: A real-time ssh honeypot with large language model-driven dynamic response generation, 2025. URLhttps://arxiv.org/abs/2509.01463

work page arXiv 2025
[20]

Ohra: dynamic multi- protocol llm-based cyber deception

Anastasia Safargalieva, Artur Rüffer, and Emmanouil Vasilomanolakis. Ohra: dynamic multi- protocol llm-based cyber deception. InProceedings of the 30th Nordic Conference on Secure IT Systems (Nordsec 2025). Springer, 2025. 10

2025
[21]

Shellbox: Adversarially enhanced llm-interactive honeypot framework.IEEE Access, 13:143618–143630, 2025

Guan Yang, Zhengzheng Sun, and Yu Wang. Shellbox: Adversarially enhanced llm-interactive honeypot framework.IEEE Access, 13:143618–143630, 2025. doi: 10.1109/ACCESS.2025. 3598779

work page doi:10.1109/access.2025 2025
[22]

Vellmes: A high- interaction ai-based deception framework

Muris Sladi ´c, Veronica Valeros, Carlos Catania, and Sebastian Garcia. Vellmes: A high- interaction ai-based deception framework. In2025 IEEE European Symposium on Security and Privacy Workshops, page 671–679. IEEE, June 2025. doi: 10.1109/eurospw67616.2025.00082. URLhttp://dx.doi.org/10.1109/EuroSPW67616.2025.00082

work page doi:10.1109/eurospw67616.2025.00082 2025
[23]

Honeygpt: Breaking the trilemma in terminal honeypots with large language model, 2025

Ziyang Wang, Jianzhou You, Haining Wang, Tianwei Yuan, Shichao Lv, Yang Wang, and Limin Sun. Honeygpt: Breaking the trilemma in terminal honeypots with large language model, 2025. URLhttps://arxiv.org/abs/2406.01882

work page arXiv 2025
[24]

Pentestgpt: An llm-empowered automatic penetration testing tool,

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetra- tion testing tool.arXiv preprint arXiv:2308.06782, 2023

work page arXiv 2023
[25]

Pentestagent: Incorporating llm agents to automated penetration testing,

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorporating llm agents to automated penetration testing,
[26]

URLhttps://arxiv.org/abs/2411.05185

work page arXiv
[27]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press. Enigma: Interactive tools substantially assist lm agents in finding security vulnerabilities,
[28]

URLhttps://arxiv.org/abs/2409.16165

work page arXiv
[29]

D-cipher: Dynamic collaborative intelli- gent multi-agent system with planner and heterogeneous executors for offensive security, 2025

Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. D-cipher: Dynamic collaborative intelli- gent multi-agent system with planner and heterogeneous executors for offensive security, 2025. UR...

work page arXiv 2025
[30]

Craken: Cybersecurity llm agent with knowledge-based execution,

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. Craken: Cybersecurity llm agent with knowledge-based execution, 2025. URLhttps://arxiv.org/abs/2505.17107

work page arXiv 2025
[31]

Xbow: The intelligence of a hacker at the speed of a machine, 2026

XBOW. Xbow: The intelligence of a hacker at the speed of a machine, 2026. URL https: //xbow.com/. Last accessed: 17.04.2026

2026
[32]

Firecompass: Automated pen testing & red teaming across web, api, cloud & infra., 2026

FireCompass. Firecompass: Automated pen testing & red teaming across web, api, cloud & infra., 2026. URLhttps://firecompass.com/. Last accessed: 17.04.2026

2026
[33]

Hacking ctfs with plain agents,

Rustem Turtayev, Artem Petrov, Dmitrii V olkov, and Denis V olk. Hacking ctfs with plain agents,
[34]

URLhttps://arxiv.org/abs/2412.02776

work page arXiv
[35]

Cybergym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cybergym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. InThe F ourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=2YvbLQEdYt

2026
[36]

Assessing claude mythos preview’s cyberse- curity capabilities

Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, A...

2026
[37]

Baxbench: Can llms generate correct and secure backends?arXiv preprint arXiv:2502.11844, 2025

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovi´c, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends?arXiv preprint arXiv:2502.11844, 2025

work page arXiv 2025
[38]

The openapi specification

OpenAPI Initiative. The openapi specification. https://github.com/OAI/OpenAPI- Specification, 2026. Last accessed: 16.04.2026

2026
[39]

How we hacked mckinsey’s ai platform

CodeWall. How we hacked mckinsey’s ai platform. https://codewall.ai/blog/how-we- hacked-mckinseys-ai-platform, 2026. Last accessed: 16.04.2026. 11

2026
[40]

Gemini 3 flash: frontier intelligence built for speed, 2026

Google. Gemini 3 flash: frontier intelligence built for speed, 2026. URL https:// blog.google/products-and-platforms/products/gemini/gemini-3-flash/ . Last accessed: 21.05.2026

2026
[41]

Claude sonnet 4.6, 2026

Anthropic. Claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/claude- sonnet-4-6. Last accessed: 15.04.2026

2026
[42]

Claude haiku 4.5, 2025

Anthropic. Claude haiku 4.5, 2025. URL https://www.anthropic.com/claude/haiku. Last accessed: 15.04.2026

2025
[43]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[45]

Introducing gpt-5.4 mini and nano, 2026

OpenAI. Introducing gpt-5.4 mini and nano, 2026. URL https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Last accessed: 15.04.2026

2026
[46]

Serverless inference on together ai, 2026

Together AI. Serverless inference on together ai, 2026. URL https://www.together.ai/ serverless-inference. Last accessed: 15.04.2026

2026
[47]

The model is the computer

Taalas. The model is the computer. https://taalas.com/, 2026. Last accessed: 06.05.2026

2026
[48]

Groq delivers fast, low cost inference that doesn’t flake when things get real

Groq. Groq delivers fast, low cost inference that doesn’t flake when things get real. https://groq.com/, 2026. Last accessed: 06.05.2026

2026
[49]

Cloak, honey, trap: Proactive defenses against LLM agents

Daniel Ayzenshteyn, Roy Weiss, and Yisroel Mirsky. Cloak, honey, trap: Proactive defenses against LLM agents. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Se- curity Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 8095–8114. USENIX Association, 2025. URL https://www.usenix.org/conference/ usenixsecurity25/presen...

2025
[50]

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, and Martin Vechev. Auto- baxbuilder: Bootstrapping code security benchmarking, 2025. URL https://arxiv.org/ abs/2512.21132. 12 Appendix A Broader Impact Statement In this paper, we introduced a comprehensive framework for evaluating, testing, and developing LLM- powered HTTP honeypots. As honeyp...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

An evening with berferd in which a cracker is lured, endured, and studied

Bill Cheswick. An evening with berferd in which a cracker is lured, endured, and studied. In Proceedings of the Winter 1992 USENIX Conference, pages 163–173, 1992

1992

[2] [2]

A virtual honeypot framework

Niels Provos. A virtual honeypot framework. In13th USENIX Security Symposium (USENIX Security 04), San Diego, CA, August 2004. USENIX As- sociation. URL https://www.usenix.org/conference/13th-usenix-security- symposium/virtual-honeypot-framework

2004

[3] [3]

Chatbots in a honeypot world, 2023

Forrest McKee and David Noever. Chatbots in a honeypot world, 2023. URL https://arxiv. org/abs/2301.03771

work page arXiv 2023

[4] [4]

Strong supermartingales and limits of nonnegative martingales

Muris Sladi ´c, Veronica Valeros, Carlos Catania, and Sebastian Garcia. Llm in the shell: Generative honeypots. In2024 IEEE European Symposium on Security and Privacy Workshops, page 430–435. IEEE, July 2024. doi: 10.1109/eurospw61312.2024.00054. URL http://dx. doi.org/10.1109/EuroSPW61312.2024.00054

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/eurospw61312.2024.00054 2024

[5] [5]

Beelzebub: Ai-native security platforms

Beelzebub. Beelzebub: Ai-native security platforms. https://beelzebub.ai/, 2026. Last accessed: 18.04.2026

2026

[6] [6]

Bridges, Thomas R

Robert A. Bridges, Thomas R. Mitchell, Mauricio Muñoz, and Ted Henriksson. Sok: Honeypots & llms, more than the sum of their parts?, 2025. URL https://arxiv.org/abs/2510. 25939

2025

[7] [7]

Beekeeper: Accelerating honeypot analysis with llm-driven feedback.IEEE Access, 13:168034–168054, 2025

Niclas Ilg, Dominik Germek, Paul Duplys, and Michael Menth. Beekeeper: Accelerating honeypot analysis with llm-driven feedback.IEEE Access, 13:168034–168054, 2025. doi: 10.1109/ACCESS.2025.3613118

work page doi:10.1109/access.2025.3613118 2025

[8] [8]

Gpt-5 at ctfs: Case studies from top-tier cybersecurity events, 2025

Reworr, Artem Petrov, and Dmitrii V olkov. Gpt-5 at ctfs: Case studies from top-tier cybersecurity events, 2025. URLhttps://arxiv.org/abs/2511.04860

work page arXiv 2025

[9] [9]

Evaluating and mitigating the growing risk of llm-discovered 0-days

Nicholas Carlini, Keane Lucas, Evyatar Ben Asher, Newton Cheng, Hasnain Lakhani, David Forsythe, and Kyla Guru. Evaluating and mitigating the growing risk of llm-discovered 0-days. https://red.anthropic.com/2026/zero-days/, 2026. Last accessed: 17.04.2026

2026

[10] [10]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Claude code, 2026

Anthropic. Claude code, 2026. URLhttps://github.com/anthropics/claude-code

2026

[12] [12]

Gemini cli, 2026

Google. Gemini cli, 2026. URLhttps://github.com/google-gemini/gemini-cli

2026

[13] [13]

Honeypots: concepts, approaches, and challenges

Iyatiti Mokube and Michele Adams. Honeypots: concepts, approaches, and challenges. In Proceedings of the 45th Annual ACM Southeast Conference, ACMSE ’07, page 321–326, New York, NY , USA, 2007. Association for Computing Machinery. ISBN 9781595936295. doi: 10.1145/1233341.1233399. URLhttps://doi.org/10.1145/1233341.1233399

work page doi:10.1145/1233341.1233399 2007

[14] [14]

Cowrie: Advanced ssh honeypot for enterprise security, 2026

Cowrie. Cowrie: Advanced ssh honeypot for enterprise security, 2026. URL https://www. cowrie.org/. Last accessed: 17.04.2026

2026

[15] [15]

Iotcandyjar: Towards an intelligent-interaction honeypot for iot devices

Tongbo Luo, Zhaoyan Xu, Xing Jin, Yanhui Jia, and Xin Ouyang. Iotcandyjar: Towards an intelligent-interaction honeypot for iot devices. InBlack Hat USA, 2017. URL https://blackhat.com/docs/us-17/thursday/us-17-Luo-Iotcandyjar-Towards- An-Intelligent-Interaction-Honeypot-For-IoT-Devices-wp.pdf

2017

[16] [16]

Honeyllm: Enabling shell honeypots with large language models

Chongqi Guan, Guohong Cao, and Sencun Zhu. Honeyllm: Enabling shell honeypots with large language models. In2024 IEEE Conference on Communications and Network Security (CNS),

[17] [17]

doi: 10.1109/CNS62487.2024.10735663

work page doi:10.1109/cns62487.2024.10735663 2024

[18] [18]

Otal and M

Hakan T. Otal and M. Abdullah Canbaz. Llm honeypot: Leveraging large language models as advanced interactive honeypot systems. In2024 IEEE Conference on Communications and Network Security (CNS), page 1–6. IEEE, September 2024. doi: 10.1109/cns62487.2024. 10735607. URLhttp://dx.doi.org/10.1109/CNS62487.2024.10735607

work page doi:10.1109/cns62487.2024 2024

[19] [19]

Llmhoney: A real-time ssh honeypot with large language model-driven dynamic response generation, 2025

Pranjay Malhotra. Llmhoney: A real-time ssh honeypot with large language model-driven dynamic response generation, 2025. URLhttps://arxiv.org/abs/2509.01463

work page arXiv 2025

[20] [20]

Ohra: dynamic multi- protocol llm-based cyber deception

Anastasia Safargalieva, Artur Rüffer, and Emmanouil Vasilomanolakis. Ohra: dynamic multi- protocol llm-based cyber deception. InProceedings of the 30th Nordic Conference on Secure IT Systems (Nordsec 2025). Springer, 2025. 10

2025

[21] [21]

Shellbox: Adversarially enhanced llm-interactive honeypot framework.IEEE Access, 13:143618–143630, 2025

Guan Yang, Zhengzheng Sun, and Yu Wang. Shellbox: Adversarially enhanced llm-interactive honeypot framework.IEEE Access, 13:143618–143630, 2025. doi: 10.1109/ACCESS.2025. 3598779

work page doi:10.1109/access.2025 2025

[22] [22]

Vellmes: A high- interaction ai-based deception framework

Muris Sladi ´c, Veronica Valeros, Carlos Catania, and Sebastian Garcia. Vellmes: A high- interaction ai-based deception framework. In2025 IEEE European Symposium on Security and Privacy Workshops, page 671–679. IEEE, June 2025. doi: 10.1109/eurospw67616.2025.00082. URLhttp://dx.doi.org/10.1109/EuroSPW67616.2025.00082

work page doi:10.1109/eurospw67616.2025.00082 2025

[23] [23]

Honeygpt: Breaking the trilemma in terminal honeypots with large language model, 2025

Ziyang Wang, Jianzhou You, Haining Wang, Tianwei Yuan, Shichao Lv, Yang Wang, and Limin Sun. Honeygpt: Breaking the trilemma in terminal honeypots with large language model, 2025. URLhttps://arxiv.org/abs/2406.01882

work page arXiv 2025

[24] [24]

Pentestgpt: An llm-empowered automatic penetration testing tool,

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetra- tion testing tool.arXiv preprint arXiv:2308.06782, 2023

work page arXiv 2023

[25] [25]

Pentestagent: Incorporating llm agents to automated penetration testing,

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorporating llm agents to automated penetration testing,

[26] [26]

URLhttps://arxiv.org/abs/2411.05185

work page arXiv

[27] [27]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press. Enigma: Interactive tools substantially assist lm agents in finding security vulnerabilities,

[28] [28]

URLhttps://arxiv.org/abs/2409.16165

work page arXiv

[29] [29]

D-cipher: Dynamic collaborative intelli- gent multi-agent system with planner and heterogeneous executors for offensive security, 2025

Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. D-cipher: Dynamic collaborative intelli- gent multi-agent system with planner and heterogeneous executors for offensive security, 2025. UR...

work page arXiv 2025

[30] [30]

Craken: Cybersecurity llm agent with knowledge-based execution,

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. Craken: Cybersecurity llm agent with knowledge-based execution, 2025. URLhttps://arxiv.org/abs/2505.17107

work page arXiv 2025

[31] [31]

Xbow: The intelligence of a hacker at the speed of a machine, 2026

XBOW. Xbow: The intelligence of a hacker at the speed of a machine, 2026. URL https: //xbow.com/. Last accessed: 17.04.2026

2026

[32] [32]

Firecompass: Automated pen testing & red teaming across web, api, cloud & infra., 2026

FireCompass. Firecompass: Automated pen testing & red teaming across web, api, cloud & infra., 2026. URLhttps://firecompass.com/. Last accessed: 17.04.2026

2026

[33] [33]

Hacking ctfs with plain agents,

Rustem Turtayev, Artem Petrov, Dmitrii V olkov, and Denis V olk. Hacking ctfs with plain agents,

[34] [34]

URLhttps://arxiv.org/abs/2412.02776

work page arXiv

[35] [35]

Cybergym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cybergym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. InThe F ourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=2YvbLQEdYt

2026

[36] [36]

Assessing claude mythos preview’s cyberse- curity capabilities

Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, A...

2026

[37] [37]

Baxbench: Can llms generate correct and secure backends?arXiv preprint arXiv:2502.11844, 2025

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovi´c, Jingxuan He, and Martin Vechev. Baxbench: Can llms generate correct and secure backends?arXiv preprint arXiv:2502.11844, 2025

work page arXiv 2025

[38] [38]

The openapi specification

OpenAPI Initiative. The openapi specification. https://github.com/OAI/OpenAPI- Specification, 2026. Last accessed: 16.04.2026

2026

[39] [39]

How we hacked mckinsey’s ai platform

CodeWall. How we hacked mckinsey’s ai platform. https://codewall.ai/blog/how-we- hacked-mckinseys-ai-platform, 2026. Last accessed: 16.04.2026. 11

2026

[40] [40]

Gemini 3 flash: frontier intelligence built for speed, 2026

Google. Gemini 3 flash: frontier intelligence built for speed, 2026. URL https:// blog.google/products-and-platforms/products/gemini/gemini-3-flash/ . Last accessed: 21.05.2026

2026

[41] [41]

Claude sonnet 4.6, 2026

Anthropic. Claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/claude- sonnet-4-6. Last accessed: 15.04.2026

2026

[42] [42]

Claude haiku 4.5, 2025

Anthropic. Claude haiku 4.5, 2025. URL https://www.anthropic.com/claude/haiku. Last accessed: 15.04.2026

2025

[43] [43]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[45] [45]

Introducing gpt-5.4 mini and nano, 2026

OpenAI. Introducing gpt-5.4 mini and nano, 2026. URL https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Last accessed: 15.04.2026

2026

[46] [46]

Serverless inference on together ai, 2026

Together AI. Serverless inference on together ai, 2026. URL https://www.together.ai/ serverless-inference. Last accessed: 15.04.2026

2026

[47] [47]

The model is the computer

Taalas. The model is the computer. https://taalas.com/, 2026. Last accessed: 06.05.2026

2026

[48] [48]

Groq delivers fast, low cost inference that doesn’t flake when things get real

Groq. Groq delivers fast, low cost inference that doesn’t flake when things get real. https://groq.com/, 2026. Last accessed: 06.05.2026

2026

[49] [49]

Cloak, honey, trap: Proactive defenses against LLM agents

Daniel Ayzenshteyn, Roy Weiss, and Yisroel Mirsky. Cloak, honey, trap: Proactive defenses against LLM agents. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Se- curity Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 8095–8114. USENIX Association, 2025. URL https://www.usenix.org/conference/ usenixsecurity25/presen...

2025

[50] [50]

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, and Martin Vechev. Auto- baxbuilder: Bootstrapping code security benchmarking, 2025. URL https://arxiv.org/ abs/2512.21132. 12 Appendix A Broader Impact Statement In this paper, we introduced a comprehensive framework for evaluating, testing, and developing LLM- powered HTTP honeypots. As honeyp...

work page internal anchor Pith review Pith/arXiv arXiv 2025