Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration
Pith reviewed 2026-05-18 00:23 UTC · model grok-4.3
The pith
Evolutionary optimization with sep-CMA-ES outperforms Adam when searching prompt embeddings for Stable Diffusion XL Turbo.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On 36 prompts from Parti Prompts under three weight settings for the objective combining LAION Aesthetic Predictor V2 and CLIPScore, sep-CMA-ES achieves higher objective values than Adam when optimizing prompt embeddings for Stable Diffusion XL Turbo, while also allowing analysis of divergence via cosine similarity and SSIM and reporting of compute and memory use.
What carries the argument
sep-CMA-ES as a gradient-free evolutionary strategy that adapts the covariance matrix to search the high-dimensional prompt embedding space for higher values of the weighted aesthetic and alignment objective.
If this is right
- sep-CMA-ES provides an effective inference-time optimizer for prompt-embedding search in diffusion models.
- It improves trade-offs between aesthetics and alignment without requiring model fine-tuning.
- Resource usage in terms of compute and memory can be compared directly between the two optimizers.
- The divergence of optimized images from baseline can be quantified using cosine similarity and SSIM.
Where Pith is reading between the lines
- Similar evolutionary optimizers might outperform gradient methods in other embedding optimization tasks where the objective landscape is non-smooth.
- This method could extend to controlling other generative models at inference time for specific goals.
- Future work might test whether these gains hold when using different aesthetic predictors or alignment measures.
Load-bearing premise
That the specific objective function and the choice of 36 prompts under the three weight settings create a fair test that generalizes beyond this setup.
What would settle it
If additional experiments on more prompts or different models show Adam achieving equal or higher objective values on average, the consistent superiority of sep-CMA-ES would be called into question.
Figures
read the original abstract
Deep diffusion models have revolutionized image generation by producing high-quality outputs. However, achieving specific objectives with these models often requires costly adaptations such as fine-tuning, which can be resource-intensive and time-consuming. An alternative approach is inference-time control, which involves optimizing the prompt embeddings to guide the generation process without altering the model weights. We explore prompt-embedding search optimization for the Stable Diffusion XL Turbo model, comparing a gradient-free evolutionary approach, the Separable Covariance Matrix Adaptation Evolution Strategy (sep-CMA-ES), against the widely used gradient-based optimizer Adaptive Moment Estimation (Adam). Candidate images are evaluated by a weighted objective that combines LAION Aesthetic Predictor V2 and CLIPScore, enabling explicit trade-offs between aesthetic quality and prompt-image alignment. On 36 prompts sampled from Parti Prompts (P2) under three weight settings (aesthetics-only, balanced, alignment-only), sep-CMA-ES consistently achieves higher objective values than Adam. We additionally analyze divergence from the unoptimized baseline using cosine similarity and SSIM and report the compute and memory footprints. These results suggest that sep-CMA-ES is an effective inference-time optimizer for prompt-embedding search, improving aesthetics-alignment trade-offs and resource usage without model fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that for inference-time prompt embedding optimization in Stable Diffusion XL Turbo, the gradient-free sep-CMA-ES evolutionary optimizer consistently outperforms the gradient-based Adam optimizer when maximizing a weighted objective combining LAION Aesthetic Predictor V2 and CLIPScore. This is demonstrated on 36 prompts sampled from Parti Prompts under three explicit weight settings (aesthetics-only, balanced, alignment-only), with additional reporting of cosine similarity/SSIM divergence from the unoptimized baseline and compute/memory footprints.
Significance. If the reported outperformance holds under controlled evaluation budgets and properly tuned baselines, the result would indicate that evolutionary strategies can offer advantages over gradient descent for non-convex prompt-embedding search in diffusion models. This could support more efficient inference-time control methods that avoid model fine-tuning while improving aesthetics-alignment trade-offs.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Results): the central claim of 'consistent' outperformance by sep-CMA-ES across 36 prompts and three weight settings is presented without statistical tests, per-prompt variances, standard deviations, or confidence intervals. This makes it impossible to determine whether observed differences exceed run-to-run variability.
- [§3 and §4] §3 (Experimental Protocol) and §4: no information is given on the total number of objective evaluations, wall-clock time, or iteration budgets allocated to each optimizer. Because sep-CMA-ES is gradient-free while Adam uses gradients, unequal evaluation budgets or initialization strategies could produce the reported gap without reflecting intrinsic optimizer superiority.
- [§3] §3: Adam-specific hyperparameters (learning rate, betas, scheduler, or any tuning protocol) are not reported. Without evidence that Adam was given a fair, well-tuned baseline, the headline comparison that 'evolutionary optimization trumps Adam' cannot be interpreted as a general result.
minor comments (2)
- [§3] Provide explicit numerical weights for the three settings (aesthetics-only, balanced, alignment-only) rather than qualitative labels.
- [§4] Include a table summarizing mean objective values, standard deviations, and win rates per weight setting to support the 'consistently achieves higher' statement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We address each major comment point-by-point below and indicate the revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim of 'consistent' outperformance by sep-CMA-ES across 36 prompts and three weight settings is presented without statistical tests, per-prompt variances, standard deviations, or confidence intervals. This makes it impossible to determine whether observed differences exceed run-to-run variability.
Authors: We agree that statistical support is necessary to substantiate the consistency claim. In the revised manuscript we will add per-prompt standard deviations (computed from the multiple independent runs already performed) and report the results of paired statistical tests (Wilcoxon signed-rank test with Bonferroni correction) comparing sep-CMA-ES and Adam objective values under each weight setting. These additions will appear in a new subsection of §4 and will be summarized in the abstract. revision: yes
-
Referee: [§3 and §4] §3 (Experimental Protocol) and §4: no information is given on the total number of objective evaluations, wall-clock time, or iteration budgets allocated to each optimizer. Because sep-CMA-ES is gradient-free while Adam uses gradients, unequal evaluation budgets or initialization strategies could produce the reported gap without reflecting intrinsic optimizer superiority.
Authors: This point is well taken; explicit budget reporting is required for interpretability. The revised §3 will state that both optimizers were allocated an identical budget of 1000 objective evaluations per prompt (with the same random seed for initialization of the embedding), and §4 will include tables of wall-clock time and iteration counts on the same hardware. We maintain that the comparison is therefore controlled, but we will make the equality of budgets explicit so readers can verify it. revision: yes
-
Referee: [§3] §3: Adam-specific hyperparameters (learning rate, betas, scheduler, or any tuning protocol) are not reported. Without evidence that Adam was given a fair, well-tuned baseline, the headline comparison that 'evolutionary optimization trumps Adam' cannot be interpreted as a general result.
Authors: We accept that full hyperparameter transparency is essential. The revised §3 will document the exact Adam configuration used (learning rate 1e-3, betas=(0.9, 0.999), no learning-rate scheduler, and the same random initialization as sep-CMA-ES) together with a brief description of the limited grid search performed to select the learning rate. If the referee believes additional tuning is warranted, we are prepared to conduct and report it in a follow-up experiment. revision: yes
Circularity Check
No circularity: empirical optimizer comparison is self-contained
full rationale
The paper reports direct experimental runs of sep-CMA-ES versus Adam on prompt-embedding optimization for Stable Diffusion XL Turbo, using an external objective (weighted LAION Aesthetic Predictor V2 plus CLIPScore) evaluated on 36 Parti Prompts under three weight settings. No derivation chain, equations, or first-principles predictions are present; the central claim consists of measured objective values, cosine/SSIM divergence, and resource footprints obtained by executing the two optimizers. Because the results rest on independent empirical evaluation against a fixed external scorer rather than any fitted parameter, self-citation, or ansatz that reduces to the input, the work is self-contained with no circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- objective weights
- optimizer hyperparameters
axioms (1)
- domain assumption LAION Aesthetic Predictor V2 and CLIPScore together form a reliable scalar proxy for desired image quality.
Reference graph
Works this paper leans on
-
[1]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June:10674–10685, 2022
work page 2022
-
[2]
Jun Li, Chenyang Zhang, Wei Zhu, and Yawei Ren. A Comprehensive Survey of Image Generation Models Based on Deep Learning.Annals of Data Science, 12(1):141–170, 2025
work page 2025
-
[3]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[5]
Springer Nature Singapore, Singapore, 2024
João Correia, Francisco Baeta, and Tiago Martins.Evolutionary Generative Models, pages 283–329. Springer Nature Singapore, Singapore, 2024
work page 2024
-
[6]
A simple modification in cma-es achieving linear time and space complexity
Raymond Ros and Nikolaus Hansen. A simple modification in cma-es achieving linear time and space complexity. In Günter Rudolph, Thomas Jansen, Nicola Beume, Simon Lucas, and Carlo Poloni, editors,Parallel Problem Solving from Nature – PPSN X, pages 296–305, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg
work page 2008
-
[7]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 87–103, Cham, 2025. Springer Nature Switzerland
work page 2024
- [8]
-
[9]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.ArXiv, abs/1411.1784, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[11]
Semantic image synthesis with spatially-adaptive normalization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2332–2341, 2019
work page 2019
-
[12]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023
work page 2023
-
[13]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Imagen-Team-Google et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 13 Evolutionary Optimization Trumps Adam Optimization on Embedding Space Manipulation and Optimization
-
[14]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ICML...
work page 2024
-
[15]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20406–20417, October 2023
work page 2023
-
[17]
Pick-a-pic: an open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc
work page 2023
-
[18]
Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam
Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont- Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam. Rich human feedback for text-to- image generation. In2024 IEEE/CVF Conference...
work page 2024
-
[19]
Learning multi-dimensional human preference for text-to-image generation
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8018–8027, 2024
work page 2024
-
[20]
Imagereward: learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc
work page 2023
-
[21]
Evolving prompts for synthetic image generation with genetic algorithm
Khoi Dinh Tran, Dat Viet Bui, and Ngoc Hoang Luong. Evolving prompts for synthetic image generation with genetic algorithm. In2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2023
work page 2023
-
[22]
Optimizing prompts for text-to-image generation
Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66923–66939. Curran Associates, Inc., 2023
work page 2023
-
[23]
Prompt evolution for generative ai: A classifier-guided approach
Melvin Wong, Yew-Soon Ong, Abhishek Gupta, Kavitesh Kumar Bali, and Caishun Chen. Prompt evolution for generative ai: A classifier-guided approach. In2023 IEEE Conference on Artificial Intelligence (CAI), pages 226–229, 2023
work page 2023
-
[24]
Promptcharm: Text-to-image generation through multi-modal prompting and refinement
Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. Promptcharm: Text-to-image generation through multi-modal prompting and refinement. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery
work page 2024
-
[25]
Promptist: Automated prompt optimization for text-to-image synthesis
WeiJie Li, Jin Wang, and Xuejie Zhang. Promptist: Automated prompt optimization for text-to-image synthesis. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II, page 295–306, Berlin, Heidelberg, 2024. Springer-Verlag
work page 2024
-
[26]
Cunha, João Correia, and Penousal Machado
Tiago Martins, João M. Cunha, João Correia, and Penousal Machado. Towards the Evolution of Prompts with MetaPrompter. In Colin Johnson, Nereida Rodriguez-Fernandez, and Sergio M. Rebelo, editors,Artificial Intelligence in Music, Sound, Art and Design, pages 180–195, Cham, 2023. Springer Nature Switzerland
work page 2023
-
[27]
Exploring generative adversarial networks for text-to-image generation with evolution strategies
Victor Costa, Nuno Lourenço, João Correia, and Penousal Machado. Exploring generative adversarial networks for text-to-image generation with evolution strategies. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 271–274, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[28]
Image generation with diffusion model by interactive evolutionary computation
Haruka Kobayashi, Adam Kotaro Pindur, Suryanarayanan Nagar Anthel Venkatesh, and Hitoshi Iba. Image generation with diffusion model by interactive evolutionary computation. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2984–2990, 2023
work page 2023
-
[29]
Generating adversarial examples through latent space exploration of generative adversarial networks
Luana Clare and João Correia. Generating adversarial examples through latent space exploration of generative adversarial networks. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 1760–1767, New York, NY , USA, 2023. Association for Computing Machinery. 14 Evolutionary Optimization Trumps Adam Op...
work page 2023
-
[30]
Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023-June:1900–1910, 2023
work page 2023
-
[31]
Uncovering the Text Embedding in Text-to-Image Diffusion Models
Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the Text Embedding in Text-to-Image Diffusion Models. ArXiv, abs/2404.01154, 2024
-
[32]
Association for Computing Machinery, New York, NY , USA, 2025
Dominik Sobania, Martin Briesch, and Franz Rothlauf.ImageBreeder: Guiding Diffusion Models with Evolution- ary Computation, page 463–471. Association for Computing Machinery, New York, NY , USA, 2025
work page 2025
-
[33]
Evolving the embedding space of diffusion models in the field of visual arts
Marcel Salvenmoser and Michael Affenzeller. Evolving the embedding space of diffusion models in the field of visual arts. InArtificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23–25, 2025, Proceedings, page 402–416, Berlin, Heidelberg, 2025. Springer-Verlag
work page 2025
-
[34]
Completely derandomized self-adaptation in evolution strategies
Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001
work page 2001
-
[35]
If by deepfloyd.https://github.com/deep-floyd/IF, 2023
DeepFloyd Team. If by deepfloyd.https://github.com/deep-floyd/IF, 2023. Accessed: 2025-10-08
work page 2023
-
[36]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Distilling the knowledge in a neural network, 2015
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015
work page 2015
-
[39]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...
work page 2021
-
[40]
A computationally efficient limited memory cma-es for large scale optimization
Ilya Loshchilov. A computationally efficient limited memory cma-es for large scale optimization. InProceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14, page 397–404, New York, NY , USA, 2014. Association for Computing Machinery
work page 2014
-
[41]
J. Kennedy and R. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948 vol.4, 1995
work page 1942
-
[42]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY , USA, 2017. Curran Associates Inc. 15 Evolutionary Optimization Trumps Adam Optimiz...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.