pith. machine review for the scientific record. sign in

arxiv: 2604.11772 · v1 · submitted 2026-04-13 · 💻 cs.CR

Recognition: unknown

Towards Automated Pentesting with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:07 UTC · model grok-4.3

classification 💻 cs.CR
keywords large language modelspowershell code generationautomated pentestingoffensive securitymicrosoft windows vulnerabilitiessyntactic validitycode similarityfine-tuned models
0
0 comments X

The pith

RedShell fine-tunes large language models on malicious PowerShell samples to generate valid offensive code for Windows pentesting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RedShell as a framework that fine-tunes large language models on a dataset of malicious PowerShell code, further enhanced with manually curated samples, to assist in creating scripts that target Microsoft Windows vulnerabilities. It reports that the system produces code with over 90 percent syntactic validity, strong alignment to reference pentesting examples, and better performance than prior approaches on edit-distance similarity measures. A sympathetic reader would care because this points toward practical automation of routine tasks in ethical security testing while emphasizing privacy and low hardware demands. The work also includes functional checks showing that the generated snippets execute reliably when run in environments designed to resemble real deployment conditions.

Core claim

RedShell is a privacy-preserving, hardware-efficient framework that leverages fine-tuned LLMs to assist pentesters in generating offensive PowerShell code targeting Microsoft Windows vulnerabilities. Trained on a malicious PowerShell dataset from the literature enhanced with manually curated code samples, the framework achieves over 90% syntactic validity in generated samples and strong semantic alignment with reference pentesting snippets, outperforming state-of-the-art counterparts in distance metrics such as edit distance with above 50% average code similarity. Functional experiments emphasize the execution reliability of the snippets produced by RedShell in a testing scenario that mimics

What carries the argument

RedShell: a framework that fine-tunes LLMs on an enhanced malicious PowerShell dataset to output offensive code for vulnerability testing.

Load-bearing premise

That strong results on syntactic validity, code similarity, and execution in mirrored test settings will translate to safe, useful performance during actual live pentesting without introducing new risks or needing heavy human correction.

What would settle it

A controlled test in which RedShell-generated scripts are run against live but isolated Windows systems and frequently fail to execute, fail to detect the targeted vulnerabilities, or require major manual fixes would show the claims do not hold.

Figures

Figures reproduced from arXiv: 2604.11772 by Jo\~ao Louren\c{c}o, Jo\~ao Trindade, Ricardo Bessa, Rui Claro.

Figure 1
Figure 1. Figure 1: Dataset coverage of the MITRE ATT&CK tactics. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the RedShell framework architecture. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Syntactic evaluation of LLMs fine-tuned on the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Semantic evaluation of LLMs fine-tuned on the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Functional evaluation of RedShell’s Qwen2.5-Coder and ChatGPT. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are redefining offensive cybersecurity by allowing the generation of harmful machine code with minimal human intervention. While attackers take advantage of dark LLMs such as XXXGPT and WolfGPT to produce malicious code, ethical hackers can follow similar approaches to automate traditional pentesting workflows. In this work, we present RedShell, a privacy-preserving, hardware-efficient framework that leverages fine-tuned LLMs to assist pentesters in generating offensive PowerShell code targeting Microsoft Windows vulnerabilities. RedShell was trained on a malicious PowerShell dataset from the literature, which we further enhanced with manually curated code samples. Experiments show that our framework achieves over 90% syntactic validity in generated samples and strong semantic alignment with reference pentesting snippets, outperforming state-of-the-art counterparts in distance metrics such as edit distance (above 50% average code similarity). Additionally, functional experiments emphasize the execution reliability of the snippets produced by RedShell in a testing scenario that mirrors real-world settings. This work sheds light on the state-of-the-art research in the field of Generative AI applied to malicious code generation and automated testing, acknowledging the potential benefits that LLMs hold within controlled environments such as pentesting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents RedShell, a privacy-preserving and hardware-efficient framework that fine-tunes LLMs on an enhanced malicious PowerShell dataset to generate offensive code for automating pentesting of Microsoft Windows vulnerabilities. It reports experimental outcomes of over 90% syntactic validity in generated samples, strong semantic alignment with reference pentesting snippets, outperformance of state-of-the-art models on edit-distance similarity (above 50% average code similarity), and reliable execution of produced snippets in a testing scenario that mirrors real-world settings.

Significance. If the performance claims are supported by complete experimental details, baselines, and statistical validation, the work would offer a concrete contribution to the application of generative AI for ethical hacking tools in controlled settings. The focus on privacy preservation and hardware efficiency, combined with the acknowledgment of controlled-environment benefits, strengthens its potential relevance if the utility and safety assertions can be substantiated.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (>90% syntactic validity, >50% average code similarity, outperformance of SOTA) are stated without any reference to dataset size, train/test split, exact measurement procedures for syntactic validity or semantic alignment, baseline models, or statistical significance testing; these omissions directly undermine evaluation of the reported results.
  2. [Functional experiments] Functional experiments (as described in the abstract): the claim of 'execution reliability ... in a testing scenario that mirrors real-world settings' is load-bearing for the utility of assisting pentesters, yet no specifics are provided on which real-world elements are mirrored (dynamic networks, EDR evasion, multi-stage chains) or how safety, containment, and avoidance of new vulnerabilities were assessed.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'dark LLMs such as XXXGPT and WolfGPT' would benefit from citations or brief definitions to allow readers to contextualize the comparison to ethical use cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and indicating where revisions will strengthen the presentation without altering the core contributions or experimental outcomes.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (>90% syntactic validity, >50% average code similarity, outperformance of SOTA) are stated without any reference to dataset size, train/test split, exact measurement procedures for syntactic validity or semantic alignment, baseline models, or statistical significance testing; these omissions directly undermine evaluation of the reported results.

    Authors: We agree that the abstract's conciseness omits explicit references to supporting details, which can hinder immediate assessment. The full manuscript (Sections 3 and 4) specifies the enhanced dataset size and composition, the 80/20 train/test split, syntactic validity measured via PowerShell parser success rates on generated samples, semantic alignment assessed through edit-distance similarity and embedding-based metrics, the specific baseline models (including prior SOTA approaches), and statistical validation via repeated trials with reported variances. To improve accessibility, we will revise the abstract to include brief, high-level references to these elements (e.g., 'on an enhanced dataset of X samples with 80/20 split, using parser-based validity checks and edit-distance metrics against baselines'). This is a targeted addition that preserves abstract length constraints. revision: yes

  2. Referee: [Functional experiments] Functional experiments (as described in the abstract): the claim of 'execution reliability ... in a testing scenario that mirrors real-world settings' is load-bearing for the utility of assisting pentesters, yet no specifics are provided on which real-world elements are mirrored (dynamic networks, EDR evasion, multi-stage chains) or how safety, containment, and avoidance of new vulnerabilities were assessed.

    Authors: The manuscript's functional experiments section describes execution in isolated virtual Windows environments configured with standard vulnerability setups and basic security postures to simulate typical pentesting conditions. We acknowledge that the abstract and section could more explicitly delineate the mirrored elements and safety protocols. We will expand the description to clarify the actual scope: static vulnerability targets in contained VMs, basic multi-stage chaining where relevant to the generated snippets, and safety measures including full sandbox isolation, no external network access, and post-execution monitoring to confirm no unintended side effects or new vulnerabilities introduced. Advanced elements such as dynamic network simulation or specific EDR evasion were outside the current experimental focus (which prioritized code validity and basic executability); we will explicitly note this scope to prevent overgeneralization. This is a partial revision focused on elaboration rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical framework and evaluation

full rationale

The paper presents RedShell as a fine-tuned LLM framework trained on an external malicious PowerShell dataset from the literature plus manually curated samples. Performance claims (>90% syntactic validity, edit-distance similarity, execution reliability) are reported as outcomes of separate experiments in a mirrored test scenario rather than being derived from or defined into the training process. No equations, self-referential definitions, or load-bearing self-citations appear in the provided text that would reduce results to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the standard assumption that fine-tuning an LLM on a curated malicious-code corpus will yield outputs that are both syntactically correct and semantically appropriate for pentesting; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1222 out tokens · 83280 ms · 2026-05-10T16:07:16.278206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    In: 2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)

    Alotaibi, L., Seher, S., Mohammad, N.: Cyberattacks using chatgpt: Exploring ma- licious content generation through prompt engineering. In: 2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS). pp. 1304–1311. IEEE (2024), https://doi.org/10.1109/ICETSIS61505. 2024.10459698

  2. [2]

    AtomicRedTeam:Atomicredteam:Adversaryemulationforcybersecurity(2024), https://www.atomicredteam.io/, (Accessed: 2025-11-03)

  3. [3]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align an translate. 3rd International Conference on Learning Repre- sentations, ICLR 2015 - Conference Track Proceedings pp. 1–15 (2015), https: //arxiv.org/abs/1409.0473

  4. [4]

    Towards an Interpretable AI Framework for Advanced Classification of Unmanned Aerial Vehicles (UA Vs)

    Bianou, S.G., Batogna, R.G.: Pentest-ai, an llm-powered multi-agents framework for penetration testing automation leveraging mitre attack. In: 2024 IEEE Inter- national Conference on Cyber Security and Resilience (CSR). pp. 763–770. IEEE (2024), https://doi.org/10.1109/CSR61664.2024.10679480

  5. [5]

    https://redcanary.com/threat-detection-report/techniques/, (Accessed: 2025-11-

    Canary, R.: Top ATT&CK®Techniques | Threat Detection Report. https://redcanary.com/threat-detection-report/techniques/, (Accessed: 2025-11-

  6. [6]

    Sensors23(18), 1–18 (2023), https://doi.org/10.3390/s23188014

    Chowdhary, A., Jha, K., Zhao, M.: Generative adversarial network (gan)-based autonomous penetration testing for web applications. Sensors23(18), 1–18 (2023), https://doi.org/10.3390/s23188014

  7. [7]

    URL https: //doi.org/10.1145/1390156.1390177

    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th Interna- tional Conference on Machine Learning. pp. 160–167. Association for Computing Machinery (2008), https://doi.org/10.1145/1390156.1390177

  8. [8]

    Corporation, M.: Azuread module (2025), https://learn.microsoft.com/en-us/ powershell/module/azuread/?view=azureadps-2.0, (Accessed: 2025-11-03)

  9. [9]

    Corporation, M.: Mitre att&ck framework (2024), https://attack.mitre.org/, (Ac- cessed: 2025-11-03)

  10. [10]

    Bessa et al

    DeepSeek-AI: Deepseek chat platform (2025), https://chat.deepseek.com/, (Ac- cessed: 2025-11-03) 20 R. Bessa et al

  11. [11]

    Delpy, B.: Mimikatz (2011), https://github.com/gentilkiwi/mimikatz, (Accessed: 2025-11-03)

  12. [12]

    In: 33rd USENIX Security Symposium (USENIX Security 24)

    Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 847–864. USENIX Association (2024), https://www. usenix.org/conference/usenixsecurity24/prese...

  13. [13]

    In: 32nd USENIX Security Symposium (USENIX Security 23)

    Deng, G., Zhang, Z., Li, Y., Liu, Y., Zhang, T., Liu, Y., Yu, G., Wang, D.: NAUTILUS: Automated RESTful API vulnerability detection. In: 32nd USENIX Security Symposium (USENIX Security 23). pp. 5593–5609. USENIX Associ- ation (2023), https://www.usenix.org/conference/usenixsecurity23/presentation/ deng-gelei

  14. [14]

    Electronics (Switzerland)13(10) (2024), https://doi.org/10.3390/ electronics13101839

    Eze, C.S., Shamir, L.: Analysis and prevention of ai-based phishing email attacks. Electronics (Switzerland)13(10) (2024), https://doi.org/10.3390/ electronics13101839

  15. [15]

    Face, H.: Open llm leaderboard (2025), https://huggingface.co/ open-llm-leaderboard, (Accessed: 2025-11-03)

  16. [16]

    International Journal of Scientific Research in Computer Science, Engineering and Information Technology pp

    Falade, P.V.: Decoding the threat landscape : Chatgpt, fraudgpt, and wormgpt in social engineering attacks. International Journal of Scientific Research in Computer Science, Engineering and Information Technology pp. 185–198 (2023), https://doi. org/10.32628/cseit2390533

  17. [17]

    doi:10.1109/ICSE-FoSE59343.2023.00008

    Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems. In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). pp. 31–53. IEEE (2023), https://doi.org/10. 1109/ICSE-FoSE59343.2023.00008

  18. [18]

    arXiv preprint arXiv:2505.10066 (2025), https://arxiv.org/ abs/2505.10066

    Fire, M., Elbazis, Y., Wasenstein, A., Rokach, L.: Dark llms: The growing threat of unaligned ai models. arXiv preprint arXiv:2505.10066 (2025), https://arxiv.org/ abs/2505.10066

  19. [19]

    MIT Press (2016), http: //www.deeplearningbook.org

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http: //www.deeplearningbook.org

  20. [20]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783, (Accessed: 2025-11-03)

  21. [21]

    Hacktricks: Hacktricks (2025), https://book.hacktricks.xyz/, (Accessed: 2025-11- 03)

  22. [22]

    Speech recognition through physical reservoir computing with neuromorphic nanowire networks,

    Hanif, H., Maffeis, S.: Vulberta: Simplified source code pre-training for vul- nerability detection. In: 2022 International Joint Conference on Neural Net- works (IJCNN). pp. 1–8. IEEE (2022), https://doi.org/10.1109/IJCNN55064.2022. 9892280

  23. [23]

    Hugging Face: evaluate: A python library for model evaluation and comparison (2025), https://pypi.org/project/evaluate/, (Accessed: 2025-11-03)

  24. [24]

    Hugging Face Datasets: dessertlab/offensive-powershell (2025), https: //huggingface.co/datasets/dessertlab/offensive-powershell, (Accessed: 2025- 11-03)

  25. [25]

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5 coder technical report (2024), https://arxiv.org/abs/2409.12186, (Ac- cessed: 2025-11-03) Towards Automated Pentesting with Large Language Models 21

  26. [26]

    arXiv preprint arXiv:2410.23261 (2025), https://arxiv.org/abs/2410.23261

    Khandelwal, A., Yun, T., Nayak, N.V., Merullo, J., Bach, S.H., Sun, C., Pavlick, E.: $100k or 100 days: Trade-offs when pre-training with academic resources. arXiv preprint arXiv:2410.23261 (2025), https://arxiv.org/abs/2410.23261

  27. [27]

    kuangzh: pylcs: A super fast c++ implementation of classic lcs problems using dynamic programming (2023), https://pypi.org/project/pylcs/, (Accessed: 2025- 11-03)

  28. [28]

    Liguori, P., Al-Hossami, E., Cotroneo, D., Natella, R., Cukic, B., Shaikh, S.: Can we generate shellcodes via natural language? An empirical study, vol. 29. Springer US (2022). https://doi.org/10.1007/s10515-022-00331-3

  29. [29]

    In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)

    Liguori, P., Al-Hossami, E., Orbinato, V., Natella, R., Shaikh, S., Cotroneo, D., Cukic, B.: Evil: Exploiting software via natural language. In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). pp. 321–

  30. [30]

    IEEE (2021), https://doi.org/10.1109/ISSRE52982.2021.00042

  31. [31]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Liguori, P., Improta, C., Natella, R., Cukic, B., Cotroneo, D.: Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code gener- ators. Expert Systems with Applications225(2023), https://doi.org/10.1016/j. eswa.2023.120073

  32. [32]

    In: 18th USENIX WOOTConferenceonOffensiveTechnologies(WOOT24).pp.27–43.USENIXAs- sociation (2024), https://www.usenix.org/conference/woot24/presentation/liguori

    Liguori, P., Marescalco, C., Natella, R., Orbinato, V., Pianese, L.: The power of words: Generating PowerShell attacks from natural language. In: 18th USENIX WOOTConferenceonOffensiveTechnologies(WOOT24).pp.27–43.USENIXAs- sociation (2024), https://www.usenix.org/conference/woot24/presentation/liguori

  33. [33]

    arXiv preprint arXiv:2102.04664 (2021) 16 A

    Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C.B., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S.K., Fu, S., Liu, S.: Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRRabs/2102.04664(2021), https...

  34. [34]

    infosecmatter.com/powershell-commands-for-pentesters/, (Accessed: 2025-11- 03)

    Matter, I.: Powershell commands for pentesters (2025), https://www. infosecmatter.com/powershell-commands-for-pentesters/, (Accessed: 2025-11- 03)

  35. [35]

    Microsoft Corporation: Psscriptanalyzer (2025), https://github.com/PowerShell/ PSScriptAnalyzer, (Accessed: 2025-11-03)

  36. [36]

    com/samratashok/nishang, (Accessed: 2025-11-03)

    Mittal, N.: Nishang - offensive powershell for red teams (2018), https://github. com/samratashok/nishang, (Accessed: 2025-11-03)

  37. [37]

    In: 2023 24th Interna- tional Arab Conference on Information Technology (ACIT)

    Mohamed Firdhous, M.F., Elbreiki, W., Abdullahi, I., Sudantha, B.H., Budiarto, R.: Wormgpt: A large language model chatbot for criminals. In: 2023 24th Interna- tional Arab Conference on Information Technology (ACIT). pp. 1–6. IEEE (2023), https://doi.org/10.1109/ACIT58888.2023.10453752

  38. [38]

    Mora, S.: Rouge: A pure python implementation of the rouge metric (2019), https: //pypi.org/project/rouge/, (Accessed: 2025-11-03)

  39. [39]

    arXiv preprint arXiv:2402.00891 (2024), https://arxiv.org/pdf/2402.00891

    Motlagh, F.N., Hajizadeh, M., Majd, M., Najafi, P., Cheng, F., Meinel, C.: Large language models in cybersecurity: State-of-the-art. arXiv preprint arXiv:2402.00891 (2024), https://arxiv.org/pdf/2402.00891

  40. [40]

    Nasrabadi,N.M.:PatternRecognitionandMachineLearning.JournalofElectronic Imaging16(4), 049901 (2007), https://doi.org/10.1117/1.2819119

  41. [41]

    Bessa et al

    Natella, R., Liguori, P., Improta, C., Cukic, B., Cotroneo, D.: AI Code Generators for Security: Friend or Foe? IEEE Security and Privacy22(5), 73–81 (2024), https: //doi.org/10.1109/MSEC.2024.3355713 22 R. Bessa et al

  42. [42]

    NetSPI: Microburst (2025), https://github.com/NetSPI/MicroBurst, (Accessed: 2025-11-03)

  43. [43]

    com/NetSPI/PowerUpSQL/wiki/PowerUpSQL-Cheat-Sheet, (Accessed: 2025-11- 03)

    NetSPI: Powerupsql - a powershell toolkit for sql server (2025), https://github. com/NetSPI/PowerUpSQL/wiki/PowerUpSQL-Cheat-Sheet, (Accessed: 2025-11- 03)

  44. [44]

    OpenAI: Chatgpt: Overview and features (2025), https://openai.com/chatgpt/ overview/, (Accessed: 2025-11-03)

  45. [45]

    Project, E.: Empire (2025), https://github.com/EmpireProject/Empire, (Ac- cessed: 2025-11-03)

  46. [46]

    Gabriel U Talasso, Allan M de Souza, Luis FG Gonzalez, Eduardo Cerqueira, Antonio AF Loureiro, and Leandro A Villas

    Rustam, F., Ranaweera, P., Jurcut, A.D.: Ai on the defensive and offensive: Securing multi-environment networks from ai agents. In: ICC 2024 - IEEE In- ternational Conference on Communications. pp. 4287–4292. IEEE (2024), https: //doi.org/10.1109/ICC51166.2024.10622943

  47. [47]

    Sladić, M., Valeros, V., Catania, C., Garcia, S.: Llm in the shell: Generative honey- pots.In:2024IEEEEuropeanSymposiumonSecurityandPrivacyWorkshops(Eu- roS&PW). pp. 430–435. IEEE (2024), https://doi.org/10.1109/EuroSPW61312. 2024.00054

  48. [48]

    In: Technical report

    Strom, B.E., Miller, D.P., Nickels, K.C., Pennington, A.G., Thomas, C.B.: MITRE ATT&CK: Design and Philosophy. In: Technical report. The MITRE Corpora- tion (2018), https://attack.mitre.org/docs/ATTACK_Design_and_Philosophy_ March_2020.pdf

  49. [49]

    io/blog/qwen2.5/, (Accessed: 2025-11-03)

    Team, Q.: Qwen2.5: A party of foundation models (2024), https://qwenlm.github. io/blog/qwen2.5/, (Accessed: 2025-11-03)

  50. [50]

    TryHackMe: Tryhackme - learn cybersecurity, penetration testing, and ethical hacking (2025), https://tryhackme.com/, (Accessed: 2025-11-03)

  51. [51]

    Unsloth AI: Unsloth: Open source fine-tuning for llms (2025), https://unsloth.ai/, (Accessed: 2025-11-03)

  52. [52]

    unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide, (Accessed: 2025-11-03)

    Unsloth Team: Lora hyperparameters guide - unsloth docs (2024), https://docs. unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide, (Accessed: 2025-11-03)

  53. [53]

    Advances in Neural Infor- mation Processing Systems30, 5999–6009 (2017), https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Infor- mation Processing Systems30, 5999–6009 (2017), https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  54. [54]

    In: 2020 8th International Conference on Reliability, InfocomTechnologiesandOptimization(TrendsandFutureDirections)(ICRITO)

    Vats, P., Mandot, M., Gosain, A.: A comprehensive literature review of penetration testing and its applications. In: 2020 8th International Conference on Reliability, InfocomTechnologiesandOptimization(TrendsandFutureDirections)(ICRITO). pp. 674–680. IEEE (2020), https://doi.org/10.1109/ICRITO48877.2020.9197961

  55. [55]

    No starch press (2014)

    Weidman, G.: Penetration testing: a hands-on introduction to hacking. No starch press (2014)

  56. [56]

    Transformers: State-of-the-Art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T., Gugger, S., Rush, A.: Transformers: State-of- the-art natural language processing. In: EMNLP 2020 - Conference on Empirical Methods in Natural Language Pr...

  57. [57]

    In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

    Yang, G., Chen, X., Zhou, Y., Yu, C.: DualSC: Automatic Generation and Sum- marization of Shellcode via Transformer and Dual Learning. In: 2022 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). pp. 361–372 (2022), https://doi.org/10.1109/SANER53432.2022.00052 Towards Automated Pentesting with Large Language Models 23

  58. [58]

    Journal of Systems and Software197(2023), https://doi.org/10.1016/j.jss.2022.111577

    Yang,G.,Zhou,Y.,Chen,X.,Zhang,X.,Han,T.,Chen,T.:ExploitGen:Template- augmented exploit code generation based on CodeBERT. Journal of Systems and Software197(2023), https://doi.org/10.1016/j.jss.2022.111577

  59. [59]

    arXiv preprint arXiv:2403.08701 (2024), https://arxiv

    Yigit, Y., Buchanan, W.J., Tehrani, M.G., Maglaras, L.: Review of generative ai methods in cybersecurity. arXiv preprint arXiv:2403.08701 (2024), https://arxiv. org/abs/2403.08701

  60. [60]

    openreview.net preprint openre- view.net:d1PtojR26j (2024), https://openreview.net/forum?id=d1PtojR26j

    Yoo, H., Yang, Y., Lee, H.: Code-switching red-teaming: Llm evaluation for safety and multilingual understanding. openreview.net preprint openre- view.net:d1PtojR26j (2024), https://openreview.net/forum?id=d1PtojR26j