Recognition: unknown
LLM-Guided Prompt Evolution for Password Guessing
Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3
The pith
Evolving prompts with LLMs raises password guessing success from 2.02% to 8.48% on leaked test data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLM-driven evolutionary computation optimizes prompts for an LLM password guessing system, raising the cracking rate on a RockYou-derived test set from 2.02% to 8.48% while producing character distributions that are statistically closer to those in real leaked passwords.
What carries the argument
The evolutionary loop in which LLMs propose prompt variants that are scored and retained according to how many passwords they crack from the test collection.
If this is right
- Evolved prompts generate password guesses whose character frequencies match real distributions more closely than fixed prompts.
- The same evolutionary method works across local, cloud, and ensemble LLM configurations.
- Automated prompt improvement removes the need for manual engineering in LLM-based password auditing pipelines.
- Attack modeling tools can be strengthened through repeated cycles of prompt variation and selection.
Where Pith is reading between the lines
- Prompt quality appears to be a major bottleneck in current LLM password guessing, and systematic search can address it without domain expertise.
- The same evolutionary approach might improve LLM performance on other security tasks that rely on generating realistic examples, such as phishing message creation.
- Over time, repeated application to new leaks could produce guessing systems that adapt to shifting user password patterns.
Load-bearing premise
Gains measured on one specific leaked-password test set will indicate better guessing performance against passwords that users actually choose in practice.
What would settle it
Testing the evolved prompts on a separate collection of leaked passwords never seen during the evolutionary process and checking whether the cracking rate stays higher than with the original prompts.
Figures
read the original abstract
Passwords still remain a dominant authentication method, yet their security is routinely subverted by predictable user choices and large-scale credential leaks. Automated password guessing is a key tool for stress-testing password policies and modeling attacker behavior. This paper applies LLM-driven evolutionary computation to automatically optimize prompts for the LLM password guessing framework. Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. We evaluate three configurations: a local setup with Qwen3 8B, a single compact cloud model Gemini-2.5 Flash, and a two-model ensemble of frontier LLMs. The approach raises the cracking rates from 2.02\% to 8.48\%. Character distribution analysis further confirms how evolved prompts produce statistically more realistic passwords. Automated prompt evolution is a low-barrier yet effective way to strengthen LLM-based password auditing and underlining how attack pipelines show tendency via automated improvements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLM-guided prompt evolution using OpenEvolve (MAP-Elites combined with an island population model) to optimize prompts for password guessing with LLMs. It evaluates three setups (local Qwen3 8B, Gemini-2.5 Flash, and a frontier LLM ensemble) and reports that evolved prompts raise cracking rates on a RockYou-derived test set from a 2.02% baseline to 8.48%, with supporting character-distribution analysis indicating more realistic generated passwords.
Significance. If the performance gains prove robust to proper generalization testing, the work offers a practical, low-barrier method for improving automated LLM-based password auditing and attacker modeling. The use of an open-source evolutionary framework and multi-configuration evaluation are positive elements that could aid reproducibility in security research.
major comments (3)
- [Abstract] Abstract and evaluation description: Prompts are evolved explicitly to maximize cracking rate on the RockYou-derived test set, and the 8.48% figure is measured on the same set. This creates a direct optimization-to-evaluation loop that risks overfitting to dataset-specific quirks (length distributions, common substrings) rather than producing generally improved prompts. The central claim that evolved prompts yield meaningfully better guessing therefore requires re-evaluation on a held-out set never seen during evolution.
- [Results] Results and methods: The baseline 2.02% cracking rate and the exact evolutionary hyperparameters (population size, mutation rates, selection criteria, number of generations) are not fully specified, nor are controls for confounding factors such as prompt length, temperature settings, or number of guesses per prompt. Without these details and statistical tests, it is unclear whether the reported lift is attributable to the evolutionary process itself.
- [Results] Character distribution analysis: While the post-hoc analysis shows shifts toward more realistic character frequencies, it does not rule out overfitting; surface-level distributional matches can be achieved by prompts that exploit test-set artifacts without improving generalization to other password corpora.
minor comments (2)
- [Abstract] The final sentence of the abstract contains awkward phrasing (''underlining how attack pipelines show tendency via automated improvements''); reword for clarity.
- [Discussion] The paper should include a dedicated limitations or future-work subsection discussing the risk of overfitting and plans for cross-dataset validation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work. We address each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: Prompts are evolved explicitly to maximize cracking rate on the RockYou-derived test set, and the 8.48% figure is measured on the same set. This creates a direct optimization-to-evaluation loop that risks overfitting to dataset-specific quirks (length distributions, common substrings) rather than producing generally improved prompts. The central claim that evolved prompts yield meaningfully better guessing therefore requires re-evaluation on a held-out set never seen during evolution.
Authors: We acknowledge the risk of overfitting inherent in optimizing and evaluating on the same distribution. The RockYou-derived set was chosen as a standard benchmark representative of common password patterns, consistent with prior password-guessing literature. To strengthen the claim of meaningful improvement, we will add evaluations of the evolved prompts on a held-out set drawn from a separate leak (e.g., a disjoint portion of LinkedIn or another corpus never used during evolution or prompt design). Updated cracking rates on this unseen set will be reported in the revision. revision: yes
-
Referee: [Results] Results and methods: The baseline 2.02% cracking rate and the exact evolutionary hyperparameters (population size, mutation rates, selection criteria, number of generations) are not fully specified, nor are controls for confounding factors such as prompt length, temperature settings, or number of guesses per prompt. Without these details and statistical tests, it is unclear whether the reported lift is attributable to the evolutionary process itself.
Authors: The referee is correct that full specification is required for reproducibility. In the revised Methods section we will explicitly state the baseline configuration, all OpenEvolve hyperparameters (population size, mutation rates, selection criteria, generation count), and controls for prompt length, temperature, and per-prompt guess budget. We will also add statistical significance testing (bootstrap confidence intervals and paired comparisons across repeated runs) to demonstrate that the performance lift is attributable to the evolutionary process rather than confounding factors. revision: yes
-
Referee: [Results] Character distribution analysis: While the post-hoc analysis shows shifts toward more realistic character frequencies, it does not rule out overfitting; surface-level distributional matches can be achieved by prompts that exploit test-set artifacts without improving generalization to other password corpora.
Authors: We agree that distributional similarity is only supporting evidence and does not by itself prove generalization. We will revise the relevant section to clarify the auxiliary role of the character-frequency analysis and will supplement it with the new held-out evaluations plus additional metrics (password-length histograms and top-n-gram overlap) to provide a more robust assessment of whether the evolved prompts improve guessing beyond test-set artifacts. revision: partial
Circularity Check
Evolution maximizes cracking rate on the same RockYou-derived test set used for final reported gains
specific steps
-
fitted input called prediction
[Abstract]
"Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. ... The approach raises the cracking rates from 2.02% to 8.48%."
Prompts are evolved explicitly to maximize the cracking rate on the RockYou-derived test set, after which the paper reports the achieved cracking rate (2.02% to 8.48%) on that identical set as the key result. The performance gain is therefore the direct outcome of search optimizing for the reported metric, not an independent measurement.
full rationale
The paper's central empirical claim is an increase in cracking rate achieved by evolving prompts specifically to maximize that rate on the evaluation set. This reduces the reported improvement to the direct output of the optimization procedure on the data being measured, matching the fitted-input-called-prediction pattern. No held-out test set or external validation is described in the abstract or provided context, so the result is statistically forced by construction rather than independently demonstrated.
Axiom & Free-Parameter Ledger
free parameters (1)
- Evolutionary hyperparameters (population size, mutation rates, selection criteria)
axioms (2)
- domain assumption LLMs can produce plausible password guesses when supplied with suitable prompts
- domain assumption Cracking rate on a RockYou-derived test set is a valid proxy for prompt quality
Reference graph
Works this paper leans on
-
[1]
Nikita Afonin, Nikita Andriyanov, Vahagn Hov- hannisyan, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Oleg Ro- gov, Elena Tutubalina, Alexander Panchenko, and Mikhail Seleznyov. 2026. Emergent misalignment via in-context learning: Narrow in-context examples can produce broadly misaligned LLMs.Preprint, arXiv:2510.11288
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
van Oorschot, and Frank Stajano
Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Stajano. 2012. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes. In2012 IEEE Symposium on Security and Privacy, pages 553–567
2012
- [3]
- [4]
- [5]
-
[6]
Van Oorschot
Dinei Florêncio, Cormac Herley, and Paul C. Van Oorschot. 2014. An administrator’s guide to internet password research. InProceedings of the 28th USENIX Conference on Large Installation Sys- tem Administration, LISA’14, pages 35–52, USA. USENIX Association
2014
-
[7]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz
-
[8]
Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection.Preprint, arXiv:2302.12173
work page internal anchor Pith review arXiv
-
[9]
Hashcat. 2025. Hashcat
2025
-
[10]
Cormac Herley and Paul Van Oorschot. 2012. A research agenda acknowledging the persistence of passwords.IEEE Security and Privacy, 10(1):28–36. 9
2012
- [11]
-
[12]
collec- tion #1
Troy Hunt. 2019. The 773 million record "collec- tion #1" data breach. Troy Hunt’s Blog
2019
-
[13]
Rogov, Ivan Oseledets, and Elena Tutubalina
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. 2026. The rogue scalpel: Activa- tion steering compromises LLM safety.Preprint, arXiv:2509.22067
-
[14]
Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor
William Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. 2016. Fast, lean, and ac- curate: modeling password guessability using neu- ral networks. InProceedings of the 25th USENIX Conference on Security Symposium, SEC’16, pages 175–191, USA. USENIX Association
2016
-
[15]
Jean-Baptiste Mouret and Jeff Clune. 2015. Illu- minating search spaces by mapping elites.Preprint, arXiv:1504.04909
work page Pith review arXiv 2015
-
[16]
Arvind Narayanan and Vitaly Shmatikov. 2005. Fast dictionary attacks on passwords using time- space tradeoff. InProceedings of the 12th ACM Conference on Computer and Communications Secu- rity, CCS ’05, pages 364–372, New York, NY , USA. Association for Computing Machinery
2005
-
[17]
Openwall. 2025. John the ripper
2025
-
[18]
Xiangyu Qi, Yi Zeng, Ting Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2024. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth International Conference on Learning Repre- sentations
2024
-
[19]
Divya Saxena and Jiannong Cao. 2021. Generative adversarial networks (gans): Challenges, solutions, and future directions.ACM Comput. Surv., 54(3)
2021
-
[20]
Asankhaya Sharma. 2025. Openevolve: an open- source evolutionary coding agent
2025
- [21]
- [22]
- [23]
-
[24]
Yangde Wang, Weidong Qiu, Peng Tang, Hao Tian, and Shujun Li. 2025. Se#pcfg: Semantically en- hanced pcfg for password analysis and cracking. IEEE Transactions on Dependable and Secure Com- puting, 22(4):4428–4441
2025
-
[25]
Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. 2009. Password cracking using probabilistic context-free grammars. In2009 30th IEEE Symposium on Security and Privacy, pages 391–405
2009
-
[26]
Ming Xu, Jitao Yu, Xinyi Zhang, Chuanwang Wang, Shenghao Zhang, Haoqi Wu, and Weili Han. 2023. Improving real-world password guessing attacks via bi-directional transformers. In32nd USENIX Secu- rity Symposium (USENIX Security 23), pages 1001– 1018, Anaheim, CA. USENIX Association
2023
-
[27]
Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hong- song Zhu, and Dan Meng. 2025. When LLMs meet cybersecurity: a systematic literature review.Cyber- security, 8(1):55
2025
- [28]
-
[29]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Uni- versal and transferable adversarial attacks on aligned language models.Preprint, arXiv:2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Yunkai Zou, Maoxiang An, and Ding Wang. 2025. Password guessing using large language models. In Proceedings of the 34th USENIX Conference on Se- curity Symposium, SEC ’25, USA. USENIX Associ- ation. A Per-iteration cracked rate This appendix presents detailed per-iteration cracked rates for the three evolved configurations (Ollama, Gemini, Ensemble) and t...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.