APT-Agent: Automated Penetration Testing using Large Language Models

(2) CSIRO Data61); Alsharif Abuadbba (2); Dan Dongseong Kim (1) ((1) University of Queensland; Kristen Moore (2); William Guanting Li (1)

arxiv: 2605.24949 · v1 · pith:HNFVYY7Rnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI

APT-Agent: Automated Penetration Testing using Large Language Models

William Guanting Li (1) , Alsharif Abuadbba (2) , Kristen Moore (2) , Dan Dongseong Kim (1) ((1) University of Queensland , (2) CSIRO Data61) This is my paper

Pith reviewed 2026-06-30 00:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords penetration testinglarge language modelsautomationhallucination recoverymemory architecturecybersecurityexploitation success

0 comments

The pith

APT-Agent uses a rectification module and command memory to let an LLM complete full penetration testing sequences at 84 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Penetration testing chains reconnaissance, exploitation, and exfiltration steps on complex systems, but LLMs often generate invalid commands or lose track of prior actions. The paper presents APT-Agent as a framework that adds a hybrid rectification step to fix bad commands and a command-specific memory store to keep operational context across sequences. It evaluates the system on Metasploitable 2 against seven services covering web, database, and network protocols. Under matched conditions the approach reaches 84.29 percent end-to-end exploitation success while baselines achieve 48.57 percent and 18.57 percent. This setup reduces the need for constant human oversight in security assessments.

Core claim

APT-Agent is a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. It introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences, achieving an 84.29 percent end-to-end exploitation success rate on Metasploitable 2 against seven vulnerable services.

What carries the argument

The hybrid rectification module paired with command-specific memory architecture, which corrects invalid LLM outputs and retains attack history for sequential decision making.

If this is right

Enables more complete automation of reconnaissance through exfiltration phases without constant operator input.
Produces higher success rates than both script-based and earlier LLM pen-testing tools on the same target services.
Applies across web, database, and network protocol vulnerabilities in a single framework.
Lowers the cognitive load on security teams by handling error recovery internally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rectification and memory techniques could transfer to other multi-step security tasks such as vulnerability scanning or incident response.
Testing the framework on production networks rather than lab images would reveal how well the modules handle real-world noise and defenses.
The architecture suggests that targeted memory designs may improve LLM reliability in other long-horizon technical domains beyond security.

Load-bearing premise

The rectification module reliably fixes hallucinated commands and the memory system preserves enough context for the LLM to finish multi-step sequences without human fixes.

What would settle it

Measure end-to-end success rates on the same seven services after disabling the rectification module or the command memory and compare against the reported 84.29 percent baseline.

Figures

Figures reproduced from arXiv: 2605.24949 by (2) CSIRO Data61), Alsharif Abuadbba (2), Dan Dongseong Kim (1) ((1) University of Queensland, Kristen Moore (2), William Guanting Li (1).

**Figure 1.** Figure 1: Example of Hallucination Challenge #2: Insufficient long-term contextual memory. Multi-step penetration tests require preserving accurate, verarXiv:2605.24949v1 [cs.CR] 24 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Humans carry context across stages, while LLMs repeat outputs and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of APT-Agent and a CMM, which maintains stage-specific logs of executed commands and outcomes, enabling long-term context retention and adaptive reasoning. Mirroring real-world penetration testing teams, APT-Agent separates strategic reasoning from tactical execution. The Tactic Selection Module acts as the controller, the Command Generation Module issues commands, and the Output Translation Modu… view at source ↗

**Figure 4.** Figure 4: reports service-wise results across seven target services, with ten independent runs per service. APT-Agent consistently outperforms both baselines across nearly all services. The largest gaps appear on Apache 2.2.8, OpenSSH 4.7, and Telnet, where PentestGPT achieves at most 0–5 successes and Script Kiddie 0–2, while APT-Agent succeeds in 7–9 runs. On services including Apache 2.2.8, UnrealIRCd, and Samba… view at source ↗

**Figure 5.** Figure 5: Ablation Study: Success Rate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Rectification Methods Success Rate runs, APT-Agent issued 53 Metasploit module invocations, 16 of which (30.2%) were hallucinated; the Rectification Module recovered 10 of these, yielding a 62.5% in-run correction rate. The CMM’s contribution to efficiency was equally pronounced: on UnrealIRCd, average EXFILTRATE iterations fell from 52 without the CMM to 3 with it, a 17-fold reduction. Together, these re… view at source ↗

**Figure 7.** Figure 7: LLM Context Comparison: Duplication Rate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: LLM Context Methods Comparison: Success Rate [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their scale and complexity. Large Language Models (LLMs) offer new opportunities for automating these tasks, but existing approaches face two persistent challenges: hallucination of technical entities and insufficient long-term contextual memory. To address these issues, we present APT-Agent, a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. APT-Agent introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences. We evaluate our APT-Agent on Metasploitable 2 against seven vulnerable services spanning web, database, and network protocols. APT-Agent achieves an 84.29% end-to-end exploitation success rate, compared to 48.57% (Script Kiddie) and 18.57% (PentestGPT) under matched conditions. By reducing cognitive burden and minimizing reliance on human intervention, APT-Agent represents a step toward scalable, reliable, and cognitively efficient automation for penetration testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APT-Agent adds a rectification module and command memory to LLM pentesting and reports a large success-rate lift on Metasploitable 2, but the evaluation is too thin to judge whether the gains are reliable.

read the letter

The main takeaway is that APT-Agent combines an LLM with a hybrid rectification module to fix hallucinated commands and a command-specific memory to hold context across steps. On Metasploitable 2 with seven services it reaches 84% end-to-end success against 49% and 19% for the two baselines under matched conditions.

The paper does a clean job of naming the two practical problems it targets and showing a measurable difference. The fixed target machine and explicit baseline comparison make the headline claim easy to understand and potentially replicable.

The soft spot is the evaluation. The abstract gives a single success percentage with no information on number of runs, variance, temperature settings, or exactly how exploitation success was scored. Without those controls it is difficult to know whether the improvement comes from the new modules or from other factors in the setup. If the full paper has the same gaps, the result is harder to rely on.

This is the sort of work that matters to people building LLM agents for security tasks. The architecture is concrete enough that others can try the modules even if they rerun the experiments. It is worth sending to peer review because the claim is testable and the reported delta is large enough to be worth checking.

Referee Report

1 major / 0 minor

Summary. The paper presents APT-Agent, an LLM-based automated penetration testing framework that introduces a hybrid rectification module to recover from hallucinated commands and a command-specific memory architecture to maintain context over multi-step attacks. It evaluates the system on Metasploitable 2 targeting seven vulnerable services and reports an 84.29% end-to-end exploitation success rate, outperforming Script Kiddie (48.57%) and PentestGPT (18.57%) under matched conditions.

Significance. If the empirical results hold under rigorous evaluation, the work would demonstrate a meaningful advance in applying LLMs to security automation by addressing hallucination and context retention, with the explicit baseline comparisons on a fixed target environment providing a clear point of reference for the field.

major comments (1)

[Evaluation] The central empirical claim of 84.29% success rate is presented without any description of the experimental protocol, number of trials, success criteria, variance across runs, or statistical comparison to baselines. This is load-bearing for the headline result and prevents assessment of whether the reported delta is robust or reproducible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the evaluation requires substantially more detail on the experimental protocol to support the central claims.

read point-by-point responses

Referee: [Evaluation] The central empirical claim of 84.29% success rate is presented without any description of the experimental protocol, number of trials, success criteria, variance across runs, or statistical comparison to baselines. This is load-bearing for the headline result and prevents assessment of whether the reported delta is robust or reproducible.

Authors: We acknowledge that this comment is correct and that the manuscript as submitted does not contain the requested details. In the revised version we will add a dedicated Experimental Protocol subsection that specifies the number of independent trials, the precise success criteria (end-to-end exploitation resulting in shell access or data exfiltration), observed variance across runs, and statistical comparisons (e.g., appropriate significance tests) to the baselines. This addition will make the 84.29% result reproducible and allow readers to assess its robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical systems contribution: it describes an LLM orchestration framework (hybrid rectification + command-specific memory) and reports measured success rates on Metasploitable 2 against explicit baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (84.29 % vs. 48.57 % / 18.57 %) is an observable experimental outcome, not a quantity forced by definition or prior self-work. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not detail any free parameters, axioms, or invented entities beyond the described modules.

pith-pipeline@v0.9.1-grok · 5747 in / 1106 out tokens · 40447 ms · 2026-06-30T00:08:53.153043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Technical guide to information security testing and assessment,

National Institute of Standards and Technology, “Technical guide to information security testing and assessment,” U.S. Department of Commerce, Tech. Rep. Special Publication 800-115, Sep. 2008, accessed: September 03, 2025. [Online]. Available: https://csrc.nist.gov/ publications/detail/sp/800-115/final

2008
[2]

Nautilus: Automated restful api vulnerability detection,

G. Deng, Z. Zhang, Y . Li, Y . Liu, T. Zhang, Y . Liu, Y . Guo, and D. Wang, “Nautilus: Automated restful api vulnerability detection,” inProceedings of the 32nd USENIX Security Symposium. USENIX Association, 2023

2023
[3]

Automated progressive red teaming,

B. Jiang, Y . Jing, T. Shen, T. Wu, Q. Yang, and D. Xiong, “Automated progressive red teaming,”arXiv preprint arXiv:2407.03876, 2024. [Online]. Available: https://arxiv.org/abs/2407.03876

work page arXiv 2024
[4]

Automated penetration testing: An overview,

F. Abu-Dabaseh and E. Alshammari, “Automated penetration testing: An overview,” inComputer Science & Information Technology (CS & IT), Apr. 2018, pp. 121–129

2018
[6]

A Survey of Large Language Models

[Online]. Available: https://arxiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Digital twin for smart manufacturing, A review,

Y . Liu, T. Han, S. Ma, J. Zhang, Y . Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu, L. Zhao, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge, “Summary of chatgpt-related research and perspective towards the future of large language models,”Meta- Radiology, vol. 1, no. 2, p. 100017, Sep. 2023. [Online]. Available: http://dx.doi.org/10.1016/j.metra...

work page doi:10.1016/j.metrad.2023.100017 2023
[8]

Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,

V . Mayoral-Vilches, G. Deng, Y . Liu, M. Pinzger, and S. Rass, “Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,” 2023. [Online]. Available: https: //arxiv.org/abs/2308.02152

work page arXiv 2023
[9]

How well does llm generate security tests?

Y . Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does llm generate security tests?”arXiv preprint arXiv:2310.00710, 2023. [Online]. Available: https://arxiv.org/abs/2310.00710

work page arXiv 2023
[10]

Large language models for blockchain security: A systematic literature review,

Z. He, Z. Li, S. Yang, H. Ye, A. Qiao, X. Zhang, X. Luo, and T. Chen, “Large language models for blockchain security: A systematic literature review,”arXiv preprint arXiv:2403.14280, 2025. [Online]. Available: https://arxiv.org/abs/2403.14280

work page arXiv 2025
[11]

From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,

A. Abuadbba, K. Moore, D. Goel, C. Hicks, V . Mavroudis, B. Hasir- cioglu, and P. Jennings, “From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,”IEEE Security & Privacy, vol. 24, no. 2, pp. 53–63, 2026

2026
[12]

doi:10.48550/arXiv.2402.06664

R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang, “Llm agents can autonomously hack websites,”arXiv preprint arXiv:2402.06664, Feb

work page arXiv
[13]

doi:10.48550/arXiv.2402.06664

[Online]. Available: https://doi.org/10.48550/arXiv.2402.06664

work page doi:10.48550/arxiv.2402.06664
[14]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents can autonomously exploit one-day vulnerabilities,”arXiv preprint arXiv:2404.08144, Apr. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.08144

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “Pentestgpt: An llm-empowered automatic penetration testing tool,”arXiv preprint arXiv:2308.06782, Aug. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308. 06782

work page doi:10.48550/arxiv.2308 2023
[16]

AutoAttacker: A large language model guided system to implement automatic cyber-attacks,

J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, and Z. Li, “Autoattacker: A large language model guided system to implement automatic cyber-attacks,”arXiv preprint arXiv:2403.01038, Mar. 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2403.01038

work page doi:10.48550/arxiv.2403.01038 2024
[17]

Zhang, O

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” 2023. [Online]. Available: https://arxiv.org/abs/2305.13534

work page arXiv 2023
[18]

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,

N. Li, Y . Li, Y . Liu, L. Shi, K. Wang, and H. Wang, “Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405. 00648

2024
[19]

SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,

P. Manakul, A. Liusie, and M. Gales, “SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://openreview.net/forum?id=RwzFNbJ3Ez

2023
[20]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems. Curran Associates Inc., 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,

L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,”arXiv preprint arXiv:2306.11489, Jun

work page arXiv
[22]

Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,

[Online]. Available: https://doi.org/10.48550/arXiv.2306.11489

work page doi:10.48550/arxiv.2306.11489
[23]

Efficient interactive fuzzy keyword search,

S. Ji, G. Li, C. Li, and J. Feng, “Efficient interactive fuzzy keyword search,” inProceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). New York, NY , USA: Association for Computing Machinery, 2009. [Online]. Available: https://doi.org/10.1145/1526709.1526760

work page doi:10.1145/1526709.1526760 2009
[24]

Metasploitable 2,

Rapid7, “Metasploitable 2,” 2025, accessed: September 03, 2025. [On- line]. Available: https://docs.rapid7.com/metasploit/metasploitable-2/

2025
[25]

Taken out of context: On measuring situational awareness in llms.arXiv preprint arXiv:2309.00667,

S. Moskal, S. Laney, E. Hemberg, and U.-M. O’Reilly, “Llms killed the script kiddie: How agents supported by large language models change the landscape of network threat testing,”arXiv preprint arXiv:2309.00667, Sep. 2023. [Online]. Available: https: //arxiv.org/abs/2309.00667

work page arXiv 2023
[26]

Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,

J. Schwartz, H. Kurniawati, and E. El-Mahassni, “Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,” inProceedings of the 30th International Conference on Automated Planning and Scheduling (ICAPS), 2020, pp. 235–243. [Online]. Available: https://doi.org/10.1609/icaps.v30i1.6666

work page doi:10.1609/icaps.v30i1.6666 2020
[27]

Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,

J. Chen, S. Hu, H. Zheng, C. Xing, and G. Zhang, “Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,”Computers & Security, vol. 126, p. 103055, 2023

2023
[28]

Evaluation of reinforcement learning for autonomous pene- tration testing using A3C, q-learning and DQN,

N. Becker, D. Reti, E. V . N. Ntagiou, M. Wallum, and H. D. Schotten, “Evaluation of reinforcement learning for autonomous penetration testing using a3c, q-learning and dqn,”arXiv preprint arXiv:2407.15656, Jul. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2407.15656

work page arXiv 2024
[29]

A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,

Q. Li, M. Zhang, Y . Shen, R. Wang, M. Hu, Y . Li, and H. Hao, “A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,”Computers & Security, vol. 132, p. 103358, 2023. [Online]. Available: https: //doi.org/10.1016/j.cose.2023.103358

work page doi:10.1016/j.cose.2023.103358 2023
[30]

Pentestagent: Incorporating llm agents to automated penetration testing.arXiv preprint arXiv:2411.05185, 2024

X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,”arXiv preprint arXiv:2411.05185, 2025. [Online]. Available: https://arxiv.org/abs/2411.05185

work page arXiv 2025
[31]

Metasploit,

Rapid7, “Metasploit,” https://www.metasploit.com/, 2025, accessed: September 03, 2025

2025
[32]

Penetration testing for system security: Methods and practical approaches,

W. Zhang, J. Xing, and X. Li, “Penetration testing for system security: Methods and practical approaches,”arXiv preprint arXiv:2505.19174, May 2025. [Online]. Available: https://arxiv.org/abs/2505.19174

work page arXiv 2025
[33]

Penetration Testing == POMDP Solving?

C. Sarraute, O. Buffet, and J. Hoffmann, “Penetration testing == pomdp solving?”arXiv preprint arXiv:1306.4714, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1306.4714

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1306.4714 2013
[34]

Gpt-3.5 turbo,

OpenAI, “Gpt-3.5 turbo,” https://platform.openai.com/docs/models/ gpt-3.5-turbo, 2022, accessed: September 03, 2025

2022
[35]

——, “Gpt-4o,” https://platform.openai.com/docs/models/gpt-4o, 2024, accessed: September 03, 2025

2024
[36]

Mitre att&ck ® — enterprise tactics,

MITRE Corporation, “Mitre att&ck ® — enterprise tactics,” accessed 2025-10-08. [Online]. Available: https://attack.mitre.org/ tactics/enterprise/

2025
[37]

PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,

J. Huang and Q. Zhu, “PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,” inProceedings of the Workshop on Autonomous Cybersecurity. ACM, Nov. 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3689933.3690831 APPENDIXA COMPARISON WITHPENHEAL PenHeal [34] also evaluates its framework on the Metas- ploitable II envi...

work page doi:10.1145/3689933.3690831 2024

[1] [1]

Technical guide to information security testing and assessment,

National Institute of Standards and Technology, “Technical guide to information security testing and assessment,” U.S. Department of Commerce, Tech. Rep. Special Publication 800-115, Sep. 2008, accessed: September 03, 2025. [Online]. Available: https://csrc.nist.gov/ publications/detail/sp/800-115/final

2008

[2] [2]

Nautilus: Automated restful api vulnerability detection,

G. Deng, Z. Zhang, Y . Li, Y . Liu, T. Zhang, Y . Liu, Y . Guo, and D. Wang, “Nautilus: Automated restful api vulnerability detection,” inProceedings of the 32nd USENIX Security Symposium. USENIX Association, 2023

2023

[3] [3]

Automated progressive red teaming,

B. Jiang, Y . Jing, T. Shen, T. Wu, Q. Yang, and D. Xiong, “Automated progressive red teaming,”arXiv preprint arXiv:2407.03876, 2024. [Online]. Available: https://arxiv.org/abs/2407.03876

work page arXiv 2024

[4] [4]

Automated penetration testing: An overview,

F. Abu-Dabaseh and E. Alshammari, “Automated penetration testing: An overview,” inComputer Science & Information Technology (CS & IT), Apr. 2018, pp. 121–129

2018

[5] [6]

A Survey of Large Language Models

[Online]. Available: https://arxiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

Digital twin for smart manufacturing, A review,

Y . Liu, T. Han, S. Ma, J. Zhang, Y . Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu, L. Zhao, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge, “Summary of chatgpt-related research and perspective towards the future of large language models,”Meta- Radiology, vol. 1, no. 2, p. 100017, Sep. 2023. [Online]. Available: http://dx.doi.org/10.1016/j.metra...

work page doi:10.1016/j.metrad.2023.100017 2023

[7] [8]

Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,

V . Mayoral-Vilches, G. Deng, Y . Liu, M. Pinzger, and S. Rass, “Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,” 2023. [Online]. Available: https: //arxiv.org/abs/2308.02152

work page arXiv 2023

[8] [9]

How well does llm generate security tests?

Y . Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does llm generate security tests?”arXiv preprint arXiv:2310.00710, 2023. [Online]. Available: https://arxiv.org/abs/2310.00710

work page arXiv 2023

[9] [10]

Large language models for blockchain security: A systematic literature review,

Z. He, Z. Li, S. Yang, H. Ye, A. Qiao, X. Zhang, X. Luo, and T. Chen, “Large language models for blockchain security: A systematic literature review,”arXiv preprint arXiv:2403.14280, 2025. [Online]. Available: https://arxiv.org/abs/2403.14280

work page arXiv 2025

[10] [11]

From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,

A. Abuadbba, K. Moore, D. Goel, C. Hicks, V . Mavroudis, B. Hasir- cioglu, and P. Jennings, “From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,”IEEE Security & Privacy, vol. 24, no. 2, pp. 53–63, 2026

2026

[11] [12]

doi:10.48550/arXiv.2402.06664

R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang, “Llm agents can autonomously hack websites,”arXiv preprint arXiv:2402.06664, Feb

work page arXiv

[12] [13]

doi:10.48550/arXiv.2402.06664

[Online]. Available: https://doi.org/10.48550/arXiv.2402.06664

work page doi:10.48550/arxiv.2402.06664

[13] [14]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents can autonomously exploit one-day vulnerabilities,”arXiv preprint arXiv:2404.08144, Apr. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.08144

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “Pentestgpt: An llm-empowered automatic penetration testing tool,”arXiv preprint arXiv:2308.06782, Aug. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308. 06782

work page doi:10.48550/arxiv.2308 2023

[15] [16]

AutoAttacker: A large language model guided system to implement automatic cyber-attacks,

J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, and Z. Li, “Autoattacker: A large language model guided system to implement automatic cyber-attacks,”arXiv preprint arXiv:2403.01038, Mar. 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2403.01038

work page doi:10.48550/arxiv.2403.01038 2024

[16] [17]

Zhang, O

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” 2023. [Online]. Available: https://arxiv.org/abs/2305.13534

work page arXiv 2023

[17] [18]

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,

N. Li, Y . Li, Y . Liu, L. Shi, K. Wang, and H. Wang, “Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405. 00648

2024

[18] [19]

SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,

P. Manakul, A. Liusie, and M. Gales, “SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://openreview.net/forum?id=RwzFNbJ3Ez

2023

[19] [20]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems. Curran Associates Inc., 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [21]

Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,

L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,”arXiv preprint arXiv:2306.11489, Jun

work page arXiv

[21] [22]

Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,

[Online]. Available: https://doi.org/10.48550/arXiv.2306.11489

work page doi:10.48550/arxiv.2306.11489

[22] [23]

Efficient interactive fuzzy keyword search,

S. Ji, G. Li, C. Li, and J. Feng, “Efficient interactive fuzzy keyword search,” inProceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). New York, NY , USA: Association for Computing Machinery, 2009. [Online]. Available: https://doi.org/10.1145/1526709.1526760

work page doi:10.1145/1526709.1526760 2009

[23] [24]

Metasploitable 2,

Rapid7, “Metasploitable 2,” 2025, accessed: September 03, 2025. [On- line]. Available: https://docs.rapid7.com/metasploit/metasploitable-2/

2025

[24] [25]

Taken out of context: On measuring situational awareness in llms.arXiv preprint arXiv:2309.00667,

S. Moskal, S. Laney, E. Hemberg, and U.-M. O’Reilly, “Llms killed the script kiddie: How agents supported by large language models change the landscape of network threat testing,”arXiv preprint arXiv:2309.00667, Sep. 2023. [Online]. Available: https: //arxiv.org/abs/2309.00667

work page arXiv 2023

[25] [26]

Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,

J. Schwartz, H. Kurniawati, and E. El-Mahassni, “Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,” inProceedings of the 30th International Conference on Automated Planning and Scheduling (ICAPS), 2020, pp. 235–243. [Online]. Available: https://doi.org/10.1609/icaps.v30i1.6666

work page doi:10.1609/icaps.v30i1.6666 2020

[26] [27]

Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,

J. Chen, S. Hu, H. Zheng, C. Xing, and G. Zhang, “Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,”Computers & Security, vol. 126, p. 103055, 2023

2023

[27] [28]

Evaluation of reinforcement learning for autonomous pene- tration testing using A3C, q-learning and DQN,

N. Becker, D. Reti, E. V . N. Ntagiou, M. Wallum, and H. D. Schotten, “Evaluation of reinforcement learning for autonomous penetration testing using a3c, q-learning and dqn,”arXiv preprint arXiv:2407.15656, Jul. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2407.15656

work page arXiv 2024

[28] [29]

A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,

Q. Li, M. Zhang, Y . Shen, R. Wang, M. Hu, Y . Li, and H. Hao, “A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,”Computers & Security, vol. 132, p. 103358, 2023. [Online]. Available: https: //doi.org/10.1016/j.cose.2023.103358

work page doi:10.1016/j.cose.2023.103358 2023

[29] [30]

Pentestagent: Incorporating llm agents to automated penetration testing.arXiv preprint arXiv:2411.05185, 2024

X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,”arXiv preprint arXiv:2411.05185, 2025. [Online]. Available: https://arxiv.org/abs/2411.05185

work page arXiv 2025

[30] [31]

Metasploit,

Rapid7, “Metasploit,” https://www.metasploit.com/, 2025, accessed: September 03, 2025

2025

[31] [32]

Penetration testing for system security: Methods and practical approaches,

W. Zhang, J. Xing, and X. Li, “Penetration testing for system security: Methods and practical approaches,”arXiv preprint arXiv:2505.19174, May 2025. [Online]. Available: https://arxiv.org/abs/2505.19174

work page arXiv 2025

[32] [33]

Penetration Testing == POMDP Solving?

C. Sarraute, O. Buffet, and J. Hoffmann, “Penetration testing == pomdp solving?”arXiv preprint arXiv:1306.4714, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1306.4714

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1306.4714 2013

[33] [34]

Gpt-3.5 turbo,

OpenAI, “Gpt-3.5 turbo,” https://platform.openai.com/docs/models/ gpt-3.5-turbo, 2022, accessed: September 03, 2025

2022

[34] [35]

——, “Gpt-4o,” https://platform.openai.com/docs/models/gpt-4o, 2024, accessed: September 03, 2025

2024

[35] [36]

Mitre att&ck ® — enterprise tactics,

MITRE Corporation, “Mitre att&ck ® — enterprise tactics,” accessed 2025-10-08. [Online]. Available: https://attack.mitre.org/ tactics/enterprise/

2025

[36] [37]

PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,

J. Huang and Q. Zhu, “PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,” inProceedings of the Workshop on Autonomous Cybersecurity. ACM, Nov. 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3689933.3690831 APPENDIXA COMPARISON WITHPENHEAL PenHeal [34] also evaluates its framework on the Metas- ploitable II envi...

work page doi:10.1145/3689933.3690831 2024