arxiv: 2604.09861 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.NE

Recognition: unknown

Evolutionary Token-Level Prompt Optimization for Diffusion Models

Dom\'icio Pereira Neto , Jo\~ao Correia , Penousal Machado

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:58 UTC · model grok-4.3

classification 💻 cs.AI cs.NE

keywords prompt optimizationgenetic algorithmdiffusion modelstoken vectorstext-to-image generationevolutionary computationprompt engineering

0 comments

The pith

A genetic algorithm optimizes prompts for diffusion models by evolving token vectors to improve aesthetic quality and text alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates automating the process of finding effective prompts for generating images from text using diffusion models. Instead of changing the words in the prompt, it evolves the underlying numerical representations of those words using a genetic algorithm. The evolution is guided by a score that rewards both visually appealing images and close matches to the original prompt description. A reader might care because these models often need careful wording to produce desired results, and this offers a systematic way to search for better versions without endless manual testing. Tests on a set of 36 prompts indicate improvements over previous automated techniques.

Core claim

The central claim is that a genetic algorithm can be used for prompt optimization in text-to-image diffusion models by directly evolving the token vectors. This is done by optimizing a fitness function that combines measures of aesthetic quality with prompt-image alignment. On 36 prompts from a standard dataset, this approach outperforms baseline methods such as text rewriting and random search, with fitness improvements reaching up to 23.93%. The method is adaptable to image generation models with tokenized text encoders and provides a modular framework.

What carries the argument

A genetic algorithm that directly evolves the token vectors used by the text encoder in diffusion models, guided by a fitness function that combines aesthetic quality and prompt-image alignment measures.

Load-bearing premise

That an automated combination of aesthetic quality and prompt alignment scores reliably indicates desirable image results, and that directly changing the token vectors provides a meaningful way to explore the input space.

What would settle it

A side-by-side comparison by human judges of images from the evolved prompts versus those from the baseline methods, where the absence of consistent preference for the evolved versions would challenge the claim.

Figures

Figures reproduced from arXiv: 2604.09861 by Dom\'icio Pereira Neto, Jo\~ao Correia, Penousal Machado.

**Figure 2.** Figure 2: Final outputs from baseline SDXL Turbo, GA Mutated, GA Empty, GA Random, Random Search, and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Final outputs from baseline SDXL Turbo, GA Mutated, GA Empty, GA Random, Random Search, and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GA evolution of CLIP token vectors beats text baselines on proxy scores for 36 prompts, but the gains rest on unvalidated fitness and carry embedding risks.

read the letter

This paper applies a genetic algorithm directly to the token vectors that CLIP passes to a diffusion model. It optimizes a fitness score that blends the LAION Aesthetic Predictor V2 with CLIPScore, and reports up to 23.93% better results than Promptist or random search on 36 Parti prompts. The approach skips text rewriting and searches the embedding space instead, which is the clearest difference from prior prompt-optimization work. The setup is modular and the authors note how it could extend to other tokenized encoders, which keeps the contribution practical rather than purely theoretical. Experiments are simple to follow and use off-the-shelf predictors, so the method is easy to re-implement if someone wants to try it on their own model. The main limitation is that the fitness function is never checked against human preferences or even qualitative image inspection. A higher combined score does not automatically mean better-looking or better-aligned outputs, and the paper gives no correlation data or user study to close that gap. Evolving the vectors also risks moving them outside the distribution the diffusion model expects, yet there is no analysis of whether this produces artifacts or degraded generation. The test set is small, and the write-up does not report variance across runs or statistical significance, so the size of the improvement is hard to judge. This is for researchers working on automated prompt tuning or evolutionary methods in generative models. Readers who care about concrete implementations of embedding-space search will get the most from it. The work is coherent enough and has enough empirical grounding to deserve peer review, though any referee should ask for human validation and checks on embedding validity before accepting the claims at face value.

Referee Report

3 major / 2 minor

Summary. The paper proposes a genetic algorithm (GA) to optimize text prompts for CLIP-based diffusion models by directly evolving token vectors in embedding space rather than rewriting text. A composite fitness function is defined using the LAION Aesthetic Predictor V2 and CLIPScore; the GA is evaluated on 36 prompts from the Parti Prompts (P2) dataset and reported to outperform Promptist and random search by up to 23.93% in fitness.

Significance. If the empirical superiority holds under proper validation, the work supplies a model-agnostic, evolutionary framework for automated prompt optimization that could reduce manual trial-and-error for any tokenized text encoder. The modular design and discussion of limitations are positive; however, the absence of human-preference correlation or embedding-validity checks limits the practical significance of the reported fitness gains.

major comments (3)

[Experiments] Experiments section: the 23.93% fitness improvement is presented without reported variance across runs, number of independent trials, or statistical significance tests; this makes it impossible to assess whether the gains over Promptist and random search are reliable or could be explained by stochasticity in the GA or the diffusion sampler.
[Methods] Methods, fitness definition: the linear combination of LAION Aesthetic V2 and CLIPScore is introduced without ablation on the weighting coefficients or any correlation analysis against human ratings; consequently the claim that higher fitness corresponds to subjectively better images remains unanchored.
[Method] Token-vector evolution: no mechanism or post-hoc check is described to ensure that mutated or crossed-over vectors remain inside the support of the CLIP text encoder’s training distribution; out-of-distribution embeddings could silently degrade diffusion conditioning even while proxy scores rise.

minor comments (2)

[Introduction] The abstract and introduction cite only a handful of prompt-optimization baselines; a more complete comparison table including recent token-level or gradient-based methods would strengthen the positioning.
[Results] Figure captions and axis labels in the results plots should explicitly state the number of GA generations, population size, and mutation rate used for each curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to improve its rigor and clarity.

read point-by-point responses

Referee: [Experiments] Experiments section: the 23.93% fitness improvement is presented without reported variance across runs, number of independent trials, or statistical significance tests; this makes it impossible to assess whether the gains over Promptist and random search are reliable or could be explained by stochasticity in the GA or the diffusion sampler.

Authors: We agree that reporting variability and statistical significance is necessary to substantiate the claims. In the revised manuscript, we will rerun all experiments over at least five independent trials with different random seeds, report mean fitness scores with standard deviations, and include paired t-tests (or Wilcoxon tests) against Promptist and random search to establish statistical significance of the observed improvements. revision: yes
Referee: [Methods] Methods, fitness definition: the linear combination of LAION Aesthetic V2 and CLIPScore is introduced without ablation on the weighting coefficients or any correlation analysis against human ratings; consequently the claim that higher fitness corresponds to subjectively better images remains unanchored.

Authors: We accept that the weighting coefficients require justification. We will add an ablation study in the revised paper that varies the relative weights (e.g., 0.3/0.7, 0.5/0.5, 0.7/0.3) and reports the resulting fitness and qualitative image quality. While a comprehensive human preference study lies outside the scope of this work, we will cite prior validation of both predictors against human judgments and include a small-scale qualitative review of selected outputs by the authors in the appendix. We will also revise the text to present the fitness function explicitly as a proxy metric rather than claiming direct subjective superiority. revision: partial
Referee: [Method] Token-vector evolution: no mechanism or post-hoc check is described to ensure that mutated or crossed-over vectors remain inside the support of the CLIP text encoder’s training distribution; out-of-distribution embeddings could silently degrade diffusion conditioning even while proxy scores rise.

Authors: This is a legitimate concern. Although the fitness function is evaluated directly through the diffusion model, we will add a post-hoc analysis in the revised version that computes the average Euclidean distance of evolved token vectors to the nearest original CLIP vocabulary embeddings. We will also report any observed degradation in image coherence and discuss the risk of out-of-distribution embeddings in the limitations section, along with a simple projection heuristic that could be applied in future extensions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper applies a standard genetic algorithm to evolve CLIP token vectors, with fitness explicitly defined as a linear combination of two external, pre-trained predictors (LAION Aesthetic V2 and CLIPScore). Experimental claims consist of comparative fitness scores on a fixed 36-prompt subset of Parti Prompts against Promptist and random search; these are direct measurements of the stated objective rather than self-referential derivations. No equations reduce to their own inputs by construction, no load-bearing self-citations justify core premises, and no ansatz or uniqueness result is imported from prior author work. The method is therefore self-contained as an empirical optimization procedure whose outputs are evaluated against independently trained proxies.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard genetic-algorithm machinery plus two external scoring models whose combination is treated as an effective fitness signal; no implementation details are given.

free parameters (2)

Fitness combination weights
How the LAION Aesthetic Predictor V2 and CLIPScore are weighted or normalized is not stated.
Genetic algorithm hyperparameters
Population size, number of generations, mutation rate, and selection mechanism are required for the method but absent from the abstract.

axioms (2)

domain assumption Directly evolving token vectors produces valid conditioning signals for the diffusion model
The method assumes token-level mutation remains inside the embedding manifold without additional constraints or repair steps.
domain assumption LAION Aesthetic Predictor V2 plus CLIPScore is a sufficient proxy for prompt quality
The fitness function is used to drive selection without reported validation against human judgments.

pith-pipeline@v0.9.0 · 5467 in / 1548 out tokens · 104221 ms · 2026-05-10T16:58:30.831586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June:10674–10685, 2022. ISSN 10636919. doi: 10.1109/CVPR52688.2022. 01042

work page doi:10.1109/cvpr52688.2022 2022
[2]

Optimizing prompts for text-to-image generation

Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66923–66939. Curran Associates, Inc., 2023. URLhttps://proceedings.neurips. cc/paper_files/paper/2023/file/d346d91...

2023
[3]

Evolving prompts for synthetic image generation with genetic algorithm

Khoi Dinh Tran, Dat Viet Bui, and Ngoc Hoang Luong. Evolving prompts for synthetic image generation with genetic algorithm. In2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2023. doi: 10.1109/MAPR59823.2023.10288925

work page doi:10.1109/mapr59823.2023.10288925 2023
[4]

Generating adversarial examples through latent space exploration of generative adversarial networks

Luana Clare and João Correia. Generating adversarial examples through latent space exploration of generative adversarial networks. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 1760–1767, New York, NY , USA, 2023. Association for Computing Machin- ery. ISBN 9798400701207. doi: 10.1145/3583133....

work page doi:10.1145/3583133.3596392 2023
[5]

Promptcharm: Text-to-image generation through multi-modal prompting and refinement.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. Promptcharm: Text-to-image generation through multi-modal prompting and refinement.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024

2024
[6]

Evolving the embedding space of diffusion models in the field of visual arts

Marcel Salvenmoser and Michael Affenzeller. Evolving the embedding space of diffusion models in the field of visual arts. InArtificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23–25, 2025, Proceedings, page 402–416, Berlin, Heidelberg, 2025. Springer-Ve...

work page doi:10.1007/978-3-031-90167-6_27 2025
[7]

A sample implementation for parallelizing divide-and-conquer algorithms on the GPU,

Nassim Dehouche and Kullathida Dehouche. What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education.Heliyon, 9(6):e16757, 2023. ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon. 2023.e16757

work page doi:10.1016/j.heliyon 2023
[8]

Test-time prompt refinement for text-to-image models.ArXiv, abs/2507.22076, 2025

Mohammad Abdul Hafeez Khan, Yash Jain, Siddhartha Bhattacharyya, and Vibhav Vineet. Test-time prompt refinement for text-to-image models.ArXiv, abs/2507.22076, 2025

work page arXiv 2025
[9]

Prompt stealing attacks against text-to-image generation models.arXiv, 2024

Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. Prompt stealing attacks against text-to-image generation models.arXiv, 2024. URLhttps://arxiv.org/abs/2302.09923

work page arXiv 2024
[10]

InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23)

Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman. Promptify: Text-to-image generation through interactive prompt exploration with large language models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 97984007013...

work page doi:10.1145/3586183.3606725 2023
[11]

Evolutionary algorithms.Wiley Int

Thomas Bartz-Beielstein, Jürgen Branke, Jörn Mehnen, and Olaf Mersmann. Evolutionary algorithms.Wiley Int. Rev. Data Min. and Knowl. Disc., 4(3):178–195, May 2014. doi: 10.1002/widm.1124. URL https: //doi.org/10.1002/widm.1124

work page doi:10.1002/widm.1124 2014
[12]

Evogen-prompt-evolution

Magnus Petersen. Evogen-prompt-evolution. https://github.com/MagnusPetersen/ EvoGen-Prompt-Evolution, 2022. Accessed: 2023-07-16

2022
[13]

If by deepfloyd.https://github.com/deep-floyd/IF, 2023

DeepFloyd Team. If by deepfloyd.https://github.com/deep-floyd/IF, 2023. Accessed: 2025-10-08

2023
[14]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023. URLhttps://arxiv.org/abs/2310.00426

work page internal anchor Pith review arXiv 2023
[16]

URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 87–103, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-73016-0

2024
[18]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

2021
[19]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435

2020
[20]

Laion-aesthetics, 8 2022

Christoph Schuhmann. Laion-aesthetics, 8 2022. URLhttps://laion.ai/blog/laion-aesthetics/. 10 Evolutionary Token-Level Prompt Optimization for Diffusion Models

2022
[21]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021. URL https://api.semanticscholar. org/CorpusID:233296711. 11 Evolutionary Token-Level Prompt Optimization for Diffusion Models Appendix A Full List of Generated Images Figure A1: Final out...

work page internal anchor Pith review arXiv 2021