arxiv: 2604.12601 · v1 · submitted 2026-04-14 · 💻 cs.CR · cs.AI

Recognition: unknown

LLM-Guided Prompt Evolution for Password Guessing

Vladimir A. Mazin , Mikhail A. Zorin , Dmitrii S. Korzh , Elvir Z. Karimov , Dmitrii A. Bolokhov , Oleg Y. Rogov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords password guessingLLM prompt optimizationevolutionary computationcredential leakspassword auditingauthentication security

0 comments

The pith

Evolving prompts with LLMs raises password guessing success from 2.02% to 8.48% on leaked test data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can guide an evolutionary search to automatically improve prompts used for generating password guesses. The process creates prompt variations, tests how many real passwords each one cracks from a held-out collection, and retains those that perform best. This produces guesses whose character patterns align more closely with actual user choices than the starting prompts do. A sympathetic reader would care because passwords remain the main login method and better automated guessing helps organizations check policy strength and understand attacker capabilities without expert manual tuning.

Core claim

The central claim is that LLM-driven evolutionary computation optimizes prompts for an LLM password guessing system, raising the cracking rate on a RockYou-derived test set from 2.02% to 8.48% while producing character distributions that are statistically closer to those in real leaked passwords.

What carries the argument

The evolutionary loop in which LLMs propose prompt variants that are scored and retained according to how many passwords they crack from the test collection.

If this is right

Evolved prompts generate password guesses whose character frequencies match real distributions more closely than fixed prompts.
The same evolutionary method works across local, cloud, and ensemble LLM configurations.
Automated prompt improvement removes the need for manual engineering in LLM-based password auditing pipelines.
Attack modeling tools can be strengthened through repeated cycles of prompt variation and selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt quality appears to be a major bottleneck in current LLM password guessing, and systematic search can address it without domain expertise.
The same evolutionary approach might improve LLM performance on other security tasks that rely on generating realistic examples, such as phishing message creation.
Over time, repeated application to new leaks could produce guessing systems that adapt to shifting user password patterns.

Load-bearing premise

Gains measured on one specific leaked-password test set will indicate better guessing performance against passwords that users actually choose in practice.

What would settle it

Testing the evolved prompts on a separate collection of leaked passwords never seen during the evolutionary process and checking whether the cracking rate stays higher than with the original prompts.

Figures

Figures reproduced from arXiv: 2604.12601 by Dmitrii A. Bolokhov, Dmitrii S. Korzh, Elvir Z. Karimov, Mikhail A. Zorin, Oleg Y. Rogov, Vladimir A. Mazin.

**Figure 2.** Figure 2: MAP-Elites archive and island model. Three islands each maintain an independent [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-symbol F-score curves for the baseline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-iteration cracked rate (first 45 iterations). [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Full 100-iteration ensemble run. Teal: per [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Passwords still remain a dominant authentication method, yet their security is routinely subverted by predictable user choices and large-scale credential leaks. Automated password guessing is a key tool for stress-testing password policies and modeling attacker behavior. This paper applies LLM-driven evolutionary computation to automatically optimize prompts for the LLM password guessing framework. Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. We evaluate three configurations: a local setup with Qwen3 8B, a single compact cloud model Gemini-2.5 Flash, and a two-model ensemble of frontier LLMs. The approach raises the cracking rates from 2.02\% to 8.48\%. Character distribution analysis further confirms how evolved prompts produce statistically more realistic passwords. Automated prompt evolution is a low-barrier yet effective way to strengthen LLM-based password auditing and underlining how attack pipelines show tendency via automated improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LLM-guided prompt evolution using OpenEvolve (MAP-Elites combined with an island population model) to optimize prompts for password guessing with LLMs. It evaluates three setups (local Qwen3 8B, Gemini-2.5 Flash, and a frontier LLM ensemble) and reports that evolved prompts raise cracking rates on a RockYou-derived test set from a 2.02% baseline to 8.48%, with supporting character-distribution analysis indicating more realistic generated passwords.

Significance. If the performance gains prove robust to proper generalization testing, the work offers a practical, low-barrier method for improving automated LLM-based password auditing and attacker modeling. The use of an open-source evolutionary framework and multi-configuration evaluation are positive elements that could aid reproducibility in security research.

major comments (3)

[Abstract] Abstract and evaluation description: Prompts are evolved explicitly to maximize cracking rate on the RockYou-derived test set, and the 8.48% figure is measured on the same set. This creates a direct optimization-to-evaluation loop that risks overfitting to dataset-specific quirks (length distributions, common substrings) rather than producing generally improved prompts. The central claim that evolved prompts yield meaningfully better guessing therefore requires re-evaluation on a held-out set never seen during evolution.
[Results] Results and methods: The baseline 2.02% cracking rate and the exact evolutionary hyperparameters (population size, mutation rates, selection criteria, number of generations) are not fully specified, nor are controls for confounding factors such as prompt length, temperature settings, or number of guesses per prompt. Without these details and statistical tests, it is unclear whether the reported lift is attributable to the evolutionary process itself.
[Results] Character distribution analysis: While the post-hoc analysis shows shifts toward more realistic character frequencies, it does not rule out overfitting; surface-level distributional matches can be achieved by prompts that exploit test-set artifacts without improving generalization to other password corpora.

minor comments (2)

[Abstract] The final sentence of the abstract contains awkward phrasing (''underlining how attack pipelines show tendency via automated improvements''); reword for clarity.
[Discussion] The paper should include a dedicated limitations or future-work subsection discussing the risk of overfitting and plans for cross-dataset validation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: Prompts are evolved explicitly to maximize cracking rate on the RockYou-derived test set, and the 8.48% figure is measured on the same set. This creates a direct optimization-to-evaluation loop that risks overfitting to dataset-specific quirks (length distributions, common substrings) rather than producing generally improved prompts. The central claim that evolved prompts yield meaningfully better guessing therefore requires re-evaluation on a held-out set never seen during evolution.

Authors: We acknowledge the risk of overfitting inherent in optimizing and evaluating on the same distribution. The RockYou-derived set was chosen as a standard benchmark representative of common password patterns, consistent with prior password-guessing literature. To strengthen the claim of meaningful improvement, we will add evaluations of the evolved prompts on a held-out set drawn from a separate leak (e.g., a disjoint portion of LinkedIn or another corpus never used during evolution or prompt design). Updated cracking rates on this unseen set will be reported in the revision. revision: yes
Referee: [Results] Results and methods: The baseline 2.02% cracking rate and the exact evolutionary hyperparameters (population size, mutation rates, selection criteria, number of generations) are not fully specified, nor are controls for confounding factors such as prompt length, temperature settings, or number of guesses per prompt. Without these details and statistical tests, it is unclear whether the reported lift is attributable to the evolutionary process itself.

Authors: The referee is correct that full specification is required for reproducibility. In the revised Methods section we will explicitly state the baseline configuration, all OpenEvolve hyperparameters (population size, mutation rates, selection criteria, generation count), and controls for prompt length, temperature, and per-prompt guess budget. We will also add statistical significance testing (bootstrap confidence intervals and paired comparisons across repeated runs) to demonstrate that the performance lift is attributable to the evolutionary process rather than confounding factors. revision: yes
Referee: [Results] Character distribution analysis: While the post-hoc analysis shows shifts toward more realistic character frequencies, it does not rule out overfitting; surface-level distributional matches can be achieved by prompts that exploit test-set artifacts without improving generalization to other password corpora.

Authors: We agree that distributional similarity is only supporting evidence and does not by itself prove generalization. We will revise the relevant section to clarify the auxiliary role of the character-frequency analysis and will supplement it with the new held-out evaluations plus additional metrics (password-length histograms and top-n-gram overlap) to provide a more robust assessment of whether the evolved prompts improve guessing beyond test-set artifacts. revision: partial

Circularity Check

1 steps flagged

Evolution maximizes cracking rate on the same RockYou-derived test set used for final reported gains

specific steps

fitted input called prediction [Abstract]
"Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. ... The approach raises the cracking rates from 2.02% to 8.48%."

Prompts are evolved explicitly to maximize the cracking rate on the RockYou-derived test set, after which the paper reports the achieved cracking rate (2.02% to 8.48%) on that identical set as the key result. The performance gain is therefore the direct outcome of search optimizing for the reported metric, not an independent measurement.

full rationale

The paper's central empirical claim is an increase in cracking rate achieved by evolving prompts specifically to maximize that rate on the evaluation set. This reduces the reported improvement to the direct output of the optimization procedure on the data being measured, matching the fitted-input-called-prediction pattern. No held-out test set or external validation is described in the abstract or provided context, so the result is statistically forced by construction rather than independently demonstrated.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about LLM generative capabilities for passwords and the effectiveness of evolutionary algorithms for prompt search; no new entities are postulated and free parameters are typical algorithmic hyperparameters not detailed in the abstract.

free parameters (1)

Evolutionary hyperparameters (population size, mutation rates, selection criteria)
Standard in MAP-Elites and island models but not quantified in the abstract; these are typically tuned to achieve reported performance.

axioms (2)

domain assumption LLMs can produce plausible password guesses when supplied with suitable prompts
Invoked as the foundation for the entire guessing framework.
domain assumption Cracking rate on a RockYou-derived test set is a valid proxy for prompt quality
Used to drive the evolutionary fitness function.

pith-pipeline@v0.9.0 · 5494 in / 1436 out tokens · 50803 ms · 2026-05-10T14:45:45.858658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Nikita Afonin, Nikita Andriyanov, Vahagn Hov- hannisyan, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Oleg Ro- gov, Elena Tutubalina, Alexander Panchenko, and Mikhail Seleznyov. 2026. Emergent misalignment via in-context learning: Narrow in-context examples can produce broadly misaligned LLMs.Preprint, arXiv:2510.11288

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

van Oorschot, and Frank Stajano

Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Stajano. 2012. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes. In2012 IEEE Symposium on Security and Privacy, pages 553–567

2012
[3]

Davis Brown, Jesse He, Helen Jenne, Henry Kvinge, and Max Vargas. 2025. Even with ai, bijection discovery is still hard: The opportunities and chal- lenges of openevolve for novel bijection construction. Preprint, arXiv:2511.20987

work page arXiv 2025
[4]

Adel ElZemity, Budi Arief, and Shujun Li. 2025. Cy- berLLMInstruct: A pseudo-malicious dataset reveal- ing safety-performance trade-offs in cyber security LLM fine-tuning.Preprint, arXiv:2503.09334

work page arXiv 2025
[5]

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2023. Promptbreeder: Self-referential self-improvement via prompt evolution.Preprint, arXiv:2309.16797

work page arXiv 2023
[6]

Van Oorschot

Dinei Florêncio, Cormac Herley, and Paul C. Van Oorschot. 2014. An administrator’s guide to internet password research. InProceedings of the 28th USENIX Conference on Large Installation Sys- tem Administration, LISA’14, pages 35–52, USA. USENIX Association

2014
[7]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz
[8]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection.Preprint, arXiv:2302.12173

work page internal anchor Pith review arXiv
[9]

Hashcat. 2025. Hashcat

2025
[10]

Cormac Herley and Paul Van Oorschot. 2012. A research agenda acknowledging the persistence of passwords.IEEE Security and Privacy, 10(1):28–36. 9

2012
[11]

Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fernando Perez-Cruz. 2019. Passgan: A deep learning approach for password guessing.Preprint, arXiv:1709.00440

work page arXiv 2019
[12]

collec- tion #1

Troy Hunt. 2019. The 773 million record "collec- tion #1" data breach. Troy Hunt’s Blog

2019
[13]

Rogov, Ivan Oseledets, and Elena Tutubalina

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. 2026. The rogue scalpel: Activa- tion steering compromises LLM safety.Preprint, arXiv:2509.22067

work page arXiv 2026
[14]

Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor

William Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. 2016. Fast, lean, and ac- curate: modeling password guessability using neu- ral networks. InProceedings of the 25th USENIX Conference on Security Symposium, SEC’16, pages 175–191, USA. USENIX Association

2016
[15]

Jean-Baptiste Mouret and Jeff Clune. 2015. Illu- minating search spaces by mapping elites.Preprint, arXiv:1504.04909

work page Pith review arXiv 2015
[16]

Arvind Narayanan and Vitaly Shmatikov. 2005. Fast dictionary attacks on passwords using time- space tradeoff. InProceedings of the 12th ACM Conference on Computer and Communications Secu- rity, CCS ’05, pages 364–372, New York, NY , USA. Association for Computing Machinery

2005
[17]

Openwall. 2025. John the ripper

2025
[18]

Xiangyu Qi, Yi Zeng, Ting Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2024. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth International Conference on Learning Repre- sentations

2024
[19]

Divya Saxena and Jiannong Cao. 2021. Generative adversarial networks (gans): Challenges, solutions, and future directions.ACM Comput. Surv., 54(3)

2021
[20]

Asankhaya Sharma. 2025. Openevolve: an open- source evolutionary coding agent

2025
[21]

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jade Whittlestone, Jade Leung, and Daniel Kokotajlo. 2023. Model evaluation for ex- treme risks.Preprint, arXiv:2305.15324

work page arXiv 2023
[22]

Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, and Deyi Xiong. 2024. Large language model safety: A holistic survey.Preprint, arXiv:2412.17686

work page arXiv 2024
[23]

Chunhui Wan, Xunan Dai, Zhuo Wang, Minglei Li, Yanpeng Wang, Yinan Mao, Yu Lan, and Zhi- wen Xiao. 2025. Loongflow: Directed evolution- ary search via a cognitive plan-execute-summarize paradigm.Preprint, arXiv:2512.24077

work page arXiv 2025
[24]

Yangde Wang, Weidong Qiu, Peng Tang, Hao Tian, and Shujun Li. 2025. Se#pcfg: Semantically en- hanced pcfg for password analysis and cracking. IEEE Transactions on Dependable and Secure Com- puting, 22(4):4428–4441

2025
[25]

Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. 2009. Password cracking using probabilistic context-free grammars. In2009 30th IEEE Symposium on Security and Privacy, pages 391–405

2009
[26]

Ming Xu, Jitao Yu, Xinyi Zhang, Chuanwang Wang, Shenghao Zhang, Haoqi Wu, and Weili Han. 2023. Improving real-world password guessing attacks via bi-directional transformers. In32nd USENIX Secu- rity Symposium (USENIX Security 23), pages 1001– 1018, Anaheim, CA. USENIX Association

2023
[27]

Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hong- song Zhu, and Dan Meng. 2025. When LLMs meet cybersecurity: a systematic literature review.Cyber- security, 8(1):55

2025
[28]

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human- level prompt engineers.Preprint, arXiv:2211.01910

work page arXiv 2023
[29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Uni- versal and transferable adversarial attacks on aligned language models.Preprint, arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Yunkai Zou, Maoxiang An, and Ding Wang. 2025. Password guessing using large language models. In Proceedings of the 34th USENIX Conference on Se- curity Symposium, SEC ’25, USA. USENIX Associ- ation. A Per-iteration cracked rate This appendix presents detailed per-iteration cracked rates for the three evolved configurations (Ollama, Gemini, Ensemble) and t...

2025