arxiv: 2604.11772 · v1 · submitted 2026-04-13 · 💻 cs.CR

Recognition: unknown

Towards Automated Pentesting with Large Language Models

Ricardo Bessa , Rui Claro , Jo\~ao Trindade , Jo\~ao Louren\c{c}o

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:07 UTC · model grok-4.3

classification 💻 cs.CR

keywords large language modelspowershell code generationautomated pentestingoffensive securitymicrosoft windows vulnerabilitiessyntactic validitycode similarityfine-tuned models

0 comments

The pith

RedShell fine-tunes large language models on malicious PowerShell samples to generate valid offensive code for Windows pentesting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RedShell as a framework that fine-tunes large language models on a dataset of malicious PowerShell code, further enhanced with manually curated samples, to assist in creating scripts that target Microsoft Windows vulnerabilities. It reports that the system produces code with over 90 percent syntactic validity, strong alignment to reference pentesting examples, and better performance than prior approaches on edit-distance similarity measures. A sympathetic reader would care because this points toward practical automation of routine tasks in ethical security testing while emphasizing privacy and low hardware demands. The work also includes functional checks showing that the generated snippets execute reliably when run in environments designed to resemble real deployment conditions.

Core claim

RedShell is a privacy-preserving, hardware-efficient framework that leverages fine-tuned LLMs to assist pentesters in generating offensive PowerShell code targeting Microsoft Windows vulnerabilities. Trained on a malicious PowerShell dataset from the literature enhanced with manually curated code samples, the framework achieves over 90% syntactic validity in generated samples and strong semantic alignment with reference pentesting snippets, outperforming state-of-the-art counterparts in distance metrics such as edit distance with above 50% average code similarity. Functional experiments emphasize the execution reliability of the snippets produced by RedShell in a testing scenario that mimics

What carries the argument

RedShell: a framework that fine-tunes LLMs on an enhanced malicious PowerShell dataset to output offensive code for vulnerability testing.

Load-bearing premise

That strong results on syntactic validity, code similarity, and execution in mirrored test settings will translate to safe, useful performance during actual live pentesting without introducing new risks or needing heavy human correction.

What would settle it

A controlled test in which RedShell-generated scripts are run against live but isolated Windows systems and frequently fail to execute, fail to detect the targeted vulnerabilities, or require major manual fixes would show the claims do not hold.

Figures

Figures reproduced from arXiv: 2604.11772 by Jo\~ao Louren\c{c}o, Jo\~ao Trindade, Ricardo Bessa, Rui Claro.

**Figure 2.** Figure 2: Overview of the RedShell framework architecture. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Syntactic evaluation of LLMs fine-tuned on the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic evaluation of LLMs fine-tuned on the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Functional evaluation of RedShell’s Qwen2.5-Coder and ChatGPT. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are redefining offensive cybersecurity by allowing the generation of harmful machine code with minimal human intervention. While attackers take advantage of dark LLMs such as XXXGPT and WolfGPT to produce malicious code, ethical hackers can follow similar approaches to automate traditional pentesting workflows. In this work, we present RedShell, a privacy-preserving, hardware-efficient framework that leverages fine-tuned LLMs to assist pentesters in generating offensive PowerShell code targeting Microsoft Windows vulnerabilities. RedShell was trained on a malicious PowerShell dataset from the literature, which we further enhanced with manually curated code samples. Experiments show that our framework achieves over 90% syntactic validity in generated samples and strong semantic alignment with reference pentesting snippets, outperforming state-of-the-art counterparts in distance metrics such as edit distance (above 50% average code similarity). Additionally, functional experiments emphasize the execution reliability of the snippets produced by RedShell in a testing scenario that mirrors real-world settings. This work sheds light on the state-of-the-art research in the field of Generative AI applied to malicious code generation and automated testing, acknowledging the potential benefits that LLMs hold within controlled environments such as pentesting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RedShell fine-tunes LLMs on malicious PowerShell for Windows pentesting and reports solid lab metrics, but the mirrored tests do not show it would work safely or effectively in real engagements.

read the letter

RedShell is a concrete engineering effort to fine-tune LLMs for generating offensive PowerShell aimed at Windows vulnerabilities. It adds a privacy-preserving, hardware-efficient training pipeline on an enhanced corpus of malicious samples and names the resulting system. The reported numbers—over 90% syntactic validity and above 50% average edit-distance similarity—plus some execution checks in a controlled mirror setup are the main deliverables. That combination of details is not in the prior work cited in the abstract, so the paper does deliver a new named artifact with measurable outputs rather than just another high-level idea.

Referee Report

2 major / 1 minor

Summary. The manuscript presents RedShell, a privacy-preserving and hardware-efficient framework that fine-tunes LLMs on an enhanced malicious PowerShell dataset to generate offensive code for automating pentesting of Microsoft Windows vulnerabilities. It reports experimental outcomes of over 90% syntactic validity in generated samples, strong semantic alignment with reference pentesting snippets, outperformance of state-of-the-art models on edit-distance similarity (above 50% average code similarity), and reliable execution of produced snippets in a testing scenario that mirrors real-world settings.

Significance. If the performance claims are supported by complete experimental details, baselines, and statistical validation, the work would offer a concrete contribution to the application of generative AI for ethical hacking tools in controlled settings. The focus on privacy preservation and hardware efficiency, combined with the acknowledgment of controlled-environment benefits, strengthens its potential relevance if the utility and safety assertions can be substantiated.

major comments (2)

[Abstract] Abstract: the central performance claims (>90% syntactic validity, >50% average code similarity, outperformance of SOTA) are stated without any reference to dataset size, train/test split, exact measurement procedures for syntactic validity or semantic alignment, baseline models, or statistical significance testing; these omissions directly undermine evaluation of the reported results.
[Functional experiments] Functional experiments (as described in the abstract): the claim of 'execution reliability ... in a testing scenario that mirrors real-world settings' is load-bearing for the utility of assisting pentesters, yet no specifics are provided on which real-world elements are mirrored (dynamic networks, EDR evasion, multi-stage chains) or how safety, containment, and avoidance of new vulnerabilities were assessed.

minor comments (1)

[Abstract] Abstract: the phrasing 'dark LLMs such as XXXGPT and WolfGPT' would benefit from citations or brief definitions to allow readers to contextualize the comparison to ethical use cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and indicating where revisions will strengthen the presentation without altering the core contributions or experimental outcomes.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (>90% syntactic validity, >50% average code similarity, outperformance of SOTA) are stated without any reference to dataset size, train/test split, exact measurement procedures for syntactic validity or semantic alignment, baseline models, or statistical significance testing; these omissions directly undermine evaluation of the reported results.

Authors: We agree that the abstract's conciseness omits explicit references to supporting details, which can hinder immediate assessment. The full manuscript (Sections 3 and 4) specifies the enhanced dataset size and composition, the 80/20 train/test split, syntactic validity measured via PowerShell parser success rates on generated samples, semantic alignment assessed through edit-distance similarity and embedding-based metrics, the specific baseline models (including prior SOTA approaches), and statistical validation via repeated trials with reported variances. To improve accessibility, we will revise the abstract to include brief, high-level references to these elements (e.g., 'on an enhanced dataset of X samples with 80/20 split, using parser-based validity checks and edit-distance metrics against baselines'). This is a targeted addition that preserves abstract length constraints. revision: yes
Referee: [Functional experiments] Functional experiments (as described in the abstract): the claim of 'execution reliability ... in a testing scenario that mirrors real-world settings' is load-bearing for the utility of assisting pentesters, yet no specifics are provided on which real-world elements are mirrored (dynamic networks, EDR evasion, multi-stage chains) or how safety, containment, and avoidance of new vulnerabilities were assessed.

Authors: The manuscript's functional experiments section describes execution in isolated virtual Windows environments configured with standard vulnerability setups and basic security postures to simulate typical pentesting conditions. We acknowledge that the abstract and section could more explicitly delineate the mirrored elements and safety protocols. We will expand the description to clarify the actual scope: static vulnerability targets in contained VMs, basic multi-stage chaining where relevant to the generated snippets, and safety measures including full sandbox isolation, no external network access, and post-execution monitoring to confirm no unintended side effects or new vulnerabilities introduced. Advanced elements such as dynamic network simulation or specific EDR evasion were outside the current experimental focus (which prioritized code validity and basic executability); we will explicitly note this scope to prevent overgeneralization. This is a partial revision focused on elaboration rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical framework and evaluation

full rationale

The paper presents RedShell as a fine-tuned LLM framework trained on an external malicious PowerShell dataset from the literature plus manually curated samples. Performance claims (>90% syntactic validity, edit-distance similarity, execution reliability) are reported as outcomes of separate experiments in a mirrored test scenario rather than being derived from or defined into the training process. No equations, self-referential definitions, or load-bearing self-citations appear in the provided text that would reduce results to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the standard assumption that fine-tuning an LLM on a curated malicious-code corpus will yield outputs that are both syntactically correct and semantically appropriate for pentesting; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1222 out tokens · 83280 ms · 2026-05-10T16:07:16.278206+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 27 canonical work pages · 3 internal anchors

[1]

In: 2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)

Alotaibi, L., Seher, S., Mohammad, N.: Cyberattacks using chatgpt: Exploring ma- licious content generation through prompt engineering. In: 2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS). pp. 1304–1311. IEEE (2024), https://doi.org/10.1109/ICETSIS61505. 2024.10459698

work page doi:10.1109/icetsis61505 2024
[2]

AtomicRedTeam:Atomicredteam:Adversaryemulationforcybersecurity(2024), https://www.atomicredteam.io/, (Accessed: 2025-11-03)

2024
[3]

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align an translate. 3rd International Conference on Learning Repre- sentations, ICLR 2015 - Conference Track Proceedings pp. 1–15 (2015), https: //arxiv.org/abs/1409.0473

work page internal anchor Pith review arXiv 2015
[4]

Towards an Interpretable AI Framework for Advanced Classification of Unmanned Aerial Vehicles (UA Vs)

Bianou, S.G., Batogna, R.G.: Pentest-ai, an llm-powered multi-agents framework for penetration testing automation leveraging mitre attack. In: 2024 IEEE Inter- national Conference on Cyber Security and Resilience (CSR). pp. 763–770. IEEE (2024), https://doi.org/10.1109/CSR61664.2024.10679480

work page doi:10.1109/csr61664.2024.10679480 2024
[5]

https://redcanary.com/threat-detection-report/techniques/, (Accessed: 2025-11-

Canary, R.: Top ATT&CK®Techniques | Threat Detection Report. https://redcanary.com/threat-detection-report/techniques/, (Accessed: 2025-11-

2025
[6]

Sensors23(18), 1–18 (2023), https://doi.org/10.3390/s23188014

Chowdhary, A., Jha, K., Zhao, M.: Generative adversarial network (gan)-based autonomous penetration testing for web applications. Sensors23(18), 1–18 (2023), https://doi.org/10.3390/s23188014

work page doi:10.3390/s23188014 2023
[7]

URL https: //doi.org/10.1145/1390156.1390177

Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th Interna- tional Conference on Machine Learning. pp. 160–167. Association for Computing Machinery (2008), https://doi.org/10.1145/1390156.1390177

work page doi:10.1145/1390156.1390177 2008
[8]

Corporation, M.: Azuread module (2025), https://learn.microsoft.com/en-us/ powershell/module/azuread/?view=azureadps-2.0, (Accessed: 2025-11-03)

2025
[9]

Corporation, M.: Mitre att&ck framework (2024), https://attack.mitre.org/, (Ac- cessed: 2025-11-03)

2024
[10]

Bessa et al

DeepSeek-AI: Deepseek chat platform (2025), https://chat.deepseek.com/, (Ac- cessed: 2025-11-03) 20 R. Bessa et al

2025
[11]

Delpy, B.: Mimikatz (2011), https://github.com/gentilkiwi/mimikatz, (Accessed: 2025-11-03)

2011
[12]

In: 33rd USENIX Security Symposium (USENIX Security 24)

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 847–864. USENIX Association (2024), https://www. usenix.org/conference/usenixsecurity24/prese...

2024
[13]

In: 32nd USENIX Security Symposium (USENIX Security 23)

Deng, G., Zhang, Z., Li, Y., Liu, Y., Zhang, T., Liu, Y., Yu, G., Wang, D.: NAUTILUS: Automated RESTful API vulnerability detection. In: 32nd USENIX Security Symposium (USENIX Security 23). pp. 5593–5609. USENIX Associ- ation (2023), https://www.usenix.org/conference/usenixsecurity23/presentation/ deng-gelei

2023
[14]

Electronics (Switzerland)13(10) (2024), https://doi.org/10.3390/ electronics13101839

Eze, C.S., Shamir, L.: Analysis and prevention of ai-based phishing email attacks. Electronics (Switzerland)13(10) (2024), https://doi.org/10.3390/ electronics13101839

2024
[15]

Face, H.: Open llm leaderboard (2025), https://huggingface.co/ open-llm-leaderboard, (Accessed: 2025-11-03)

2025
[16]

International Journal of Scientific Research in Computer Science, Engineering and Information Technology pp

Falade, P.V.: Decoding the threat landscape : Chatgpt, fraudgpt, and wormgpt in social engineering attacks. International Journal of Scientific Research in Computer Science, Engineering and Information Technology pp. 185–198 (2023), https://doi. org/10.32628/cseit2390533

work page doi:10.32628/cseit2390533 2023
[17]

doi:10.1109/ICSE-FoSE59343.2023.00008

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large language models for software engineering: Survey and open problems. In: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). pp. 31–53. IEEE (2023), https://doi.org/10. 1109/ICSE-FoSE59343.2023.00008

work page arXiv 2023
[18]

arXiv preprint arXiv:2505.10066 (2025), https://arxiv.org/ abs/2505.10066

Fire, M., Elbazis, Y., Wasenstein, A., Rokach, L.: Dark llms: The growing threat of unaligned ai models. arXiv preprint arXiv:2505.10066 (2025), https://arxiv.org/ abs/2505.10066

work page arXiv 2025
[19]

MIT Press (2016), http: //www.deeplearningbook.org

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http: //www.deeplearningbook.org

2016
[20]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783, (Accessed: 2025-11-03)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Hacktricks: Hacktricks (2025), https://book.hacktricks.xyz/, (Accessed: 2025-11- 03)

2025
[22]

Speech recognition through physical reservoir computing with neuromorphic nanowire networks,

Hanif, H., Maffeis, S.: Vulberta: Simplified source code pre-training for vul- nerability detection. In: 2022 International Joint Conference on Neural Net- works (IJCNN). pp. 1–8. IEEE (2022), https://doi.org/10.1109/IJCNN55064.2022. 9892280

work page doi:10.1109/ijcnn55064.2022 2022
[23]

Hugging Face: evaluate: A python library for model evaluation and comparison (2025), https://pypi.org/project/evaluate/, (Accessed: 2025-11-03)

2025
[24]

Hugging Face Datasets: dessertlab/offensive-powershell (2025), https: //huggingface.co/datasets/dessertlab/offensive-powershell, (Accessed: 2025- 11-03)

2025
[25]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5 coder technical report (2024), https://arxiv.org/abs/2409.12186, (Ac- cessed: 2025-11-03) Towards Automated Pentesting with Large Language Models 21

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

arXiv preprint arXiv:2410.23261 (2025), https://arxiv.org/abs/2410.23261

Khandelwal, A., Yun, T., Nayak, N.V., Merullo, J., Bach, S.H., Sun, C., Pavlick, E.: $100k or 100 days: Trade-offs when pre-training with academic resources. arXiv preprint arXiv:2410.23261 (2025), https://arxiv.org/abs/2410.23261

work page arXiv 2025
[27]

kuangzh: pylcs: A super fast c++ implementation of classic lcs problems using dynamic programming (2023), https://pypi.org/project/pylcs/, (Accessed: 2025- 11-03)

2023
[28]

Liguori, P., Al-Hossami, E., Cotroneo, D., Natella, R., Cukic, B., Shaikh, S.: Can we generate shellcodes via natural language? An empirical study, vol. 29. Springer US (2022). https://doi.org/10.1007/s10515-022-00331-3

work page doi:10.1007/s10515-022-00331-3 2022
[29]

In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)

Liguori, P., Al-Hossami, E., Orbinato, V., Natella, R., Shaikh, S., Cotroneo, D., Cukic, B.: Evil: Exploiting software via natural language. In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). pp. 321–

2021
[30]

IEEE (2021), https://doi.org/10.1109/ISSRE52982.2021.00042

work page doi:10.1109/issre52982.2021.00042 2021
[31]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

Liguori, P., Improta, C., Natella, R., Cukic, B., Cotroneo, D.: Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code gener- ators. Expert Systems with Applications225(2023), https://doi.org/10.1016/j. eswa.2023.120073

work page doi:10.1016/j 2023
[32]

In: 18th USENIX WOOTConferenceonOffensiveTechnologies(WOOT24).pp.27–43.USENIXAs- sociation (2024), https://www.usenix.org/conference/woot24/presentation/liguori

Liguori, P., Marescalco, C., Natella, R., Orbinato, V., Pianese, L.: The power of words: Generating PowerShell attacks from natural language. In: 18th USENIX WOOTConferenceonOffensiveTechnologies(WOOT24).pp.27–43.USENIXAs- sociation (2024), https://www.usenix.org/conference/woot24/presentation/liguori

2024
[33]

arXiv preprint arXiv:2102.04664 (2021) 16 A

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C.B., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S.K., Fu, S., Liu, S.: Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRRabs/2102.04664(2021), https...

work page arXiv 2021
[34]

infosecmatter.com/powershell-commands-for-pentesters/, (Accessed: 2025-11- 03)

Matter, I.: Powershell commands for pentesters (2025), https://www. infosecmatter.com/powershell-commands-for-pentesters/, (Accessed: 2025-11- 03)

2025
[35]

Microsoft Corporation: Psscriptanalyzer (2025), https://github.com/PowerShell/ PSScriptAnalyzer, (Accessed: 2025-11-03)

2025
[36]

com/samratashok/nishang, (Accessed: 2025-11-03)

Mittal, N.: Nishang - offensive powershell for red teams (2018), https://github. com/samratashok/nishang, (Accessed: 2025-11-03)

2018
[37]

In: 2023 24th Interna- tional Arab Conference on Information Technology (ACIT)

Mohamed Firdhous, M.F., Elbreiki, W., Abdullahi, I., Sudantha, B.H., Budiarto, R.: Wormgpt: A large language model chatbot for criminals. In: 2023 24th Interna- tional Arab Conference on Information Technology (ACIT). pp. 1–6. IEEE (2023), https://doi.org/10.1109/ACIT58888.2023.10453752

work page doi:10.1109/acit58888.2023.10453752 2023
[38]

Mora, S.: Rouge: A pure python implementation of the rouge metric (2019), https: //pypi.org/project/rouge/, (Accessed: 2025-11-03)

2019
[39]

arXiv preprint arXiv:2402.00891 (2024), https://arxiv.org/pdf/2402.00891

Motlagh, F.N., Hajizadeh, M., Majd, M., Najafi, P., Cheng, F., Meinel, C.: Large language models in cybersecurity: State-of-the-art. arXiv preprint arXiv:2402.00891 (2024), https://arxiv.org/pdf/2402.00891

work page arXiv 2024
[40]

Nasrabadi,N.M.:PatternRecognitionandMachineLearning.JournalofElectronic Imaging16(4), 049901 (2007), https://doi.org/10.1117/1.2819119

work page doi:10.1117/1.2819119 2007
[41]

Bessa et al

Natella, R., Liguori, P., Improta, C., Cukic, B., Cotroneo, D.: AI Code Generators for Security: Friend or Foe? IEEE Security and Privacy22(5), 73–81 (2024), https: //doi.org/10.1109/MSEC.2024.3355713 22 R. Bessa et al

work page doi:10.1109/msec.2024.3355713 2024
[42]

NetSPI: Microburst (2025), https://github.com/NetSPI/MicroBurst, (Accessed: 2025-11-03)

2025
[43]

com/NetSPI/PowerUpSQL/wiki/PowerUpSQL-Cheat-Sheet, (Accessed: 2025-11- 03)

NetSPI: Powerupsql - a powershell toolkit for sql server (2025), https://github. com/NetSPI/PowerUpSQL/wiki/PowerUpSQL-Cheat-Sheet, (Accessed: 2025-11- 03)

2025
[44]

OpenAI: Chatgpt: Overview and features (2025), https://openai.com/chatgpt/ overview/, (Accessed: 2025-11-03)

2025
[45]

Project, E.: Empire (2025), https://github.com/EmpireProject/Empire, (Ac- cessed: 2025-11-03)

2025
[46]

Gabriel U Talasso, Allan M de Souza, Luis FG Gonzalez, Eduardo Cerqueira, Antonio AF Loureiro, and Leandro A Villas

Rustam, F., Ranaweera, P., Jurcut, A.D.: Ai on the defensive and offensive: Securing multi-environment networks from ai agents. In: ICC 2024 - IEEE In- ternational Conference on Communications. pp. 4287–4292. IEEE (2024), https: //doi.org/10.1109/ICC51166.2024.10622943

work page doi:10.1109/icc51166.2024.10622943 2024
[47]

Sladić, M., Valeros, V., Catania, C., Garcia, S.: Llm in the shell: Generative honey- pots.In:2024IEEEEuropeanSymposiumonSecurityandPrivacyWorkshops(Eu- roS&PW). pp. 430–435. IEEE (2024), https://doi.org/10.1109/EuroSPW61312. 2024.00054

work page doi:10.1109/eurospw61312 2024
[48]

In: Technical report

Strom, B.E., Miller, D.P., Nickels, K.C., Pennington, A.G., Thomas, C.B.: MITRE ATT&CK: Design and Philosophy. In: Technical report. The MITRE Corpora- tion (2018), https://attack.mitre.org/docs/ATTACK_Design_and_Philosophy_ March_2020.pdf

2018
[49]

io/blog/qwen2.5/, (Accessed: 2025-11-03)

Team, Q.: Qwen2.5: A party of foundation models (2024), https://qwenlm.github. io/blog/qwen2.5/, (Accessed: 2025-11-03)

2024
[50]

TryHackMe: Tryhackme - learn cybersecurity, penetration testing, and ethical hacking (2025), https://tryhackme.com/, (Accessed: 2025-11-03)

2025
[51]

Unsloth AI: Unsloth: Open source fine-tuning for llms (2025), https://unsloth.ai/, (Accessed: 2025-11-03)

2025
[52]

unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide, (Accessed: 2025-11-03)

Unsloth Team: Lora hyperparameters guide - unsloth docs (2024), https://docs. unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide, (Accessed: 2025-11-03)

2024
[53]

Advances in Neural Infor- mation Processing Systems30, 5999–6009 (2017), https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Infor- mation Processing Systems30, 5999–6009 (2017), https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

2017
[54]

In: 2020 8th International Conference on Reliability, InfocomTechnologiesandOptimization(TrendsandFutureDirections)(ICRITO)

Vats, P., Mandot, M., Gosain, A.: A comprehensive literature review of penetration testing and its applications. In: 2020 8th International Conference on Reliability, InfocomTechnologiesandOptimization(TrendsandFutureDirections)(ICRITO). pp. 674–680. IEEE (2020), https://doi.org/10.1109/ICRITO48877.2020.9197961

work page doi:10.1109/icrito48877.2020.9197961 2020
[55]

No starch press (2014)

Weidman, G.: Penetration testing: a hands-on introduction to hacking. No starch press (2014)

2014
[56]

Transformers: State-of-the-Art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T., Gugger, S., Rush, A.: Transformers: State-of- the-art natural language processing. In: EMNLP 2020 - Conference on Empirical Methods in Natural Language Pr...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[57]

In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

Yang, G., Chen, X., Zhou, Y., Yu, C.: DualSC: Automatic Generation and Sum- marization of Shellcode via Transformer and Dual Learning. In: 2022 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). pp. 361–372 (2022), https://doi.org/10.1109/SANER53432.2022.00052 Towards Automated Pentesting with Large Language Models 23

work page doi:10.1109/saner53432.2022.00052 2022
[58]

Journal of Systems and Software197(2023), https://doi.org/10.1016/j.jss.2022.111577

Yang,G.,Zhou,Y.,Chen,X.,Zhang,X.,Han,T.,Chen,T.:ExploitGen:Template- augmented exploit code generation based on CodeBERT. Journal of Systems and Software197(2023), https://doi.org/10.1016/j.jss.2022.111577

work page doi:10.1016/j.jss.2022.111577 2023
[59]

arXiv preprint arXiv:2403.08701 (2024), https://arxiv

Yigit, Y., Buchanan, W.J., Tehrani, M.G., Maglaras, L.: Review of generative ai methods in cybersecurity. arXiv preprint arXiv:2403.08701 (2024), https://arxiv. org/abs/2403.08701

work page arXiv 2024
[60]

openreview.net preprint openre- view.net:d1PtojR26j (2024), https://openreview.net/forum?id=d1PtojR26j

Yoo, H., Yang, Y., Lee, H.: Code-switching red-teaming: Llm evaluation for safety and multilingual understanding. openreview.net preprint openre- view.net:d1PtojR26j (2024), https://openreview.net/forum?id=d1PtojR26j

2024