Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration

Dom\'icio Pereira Neto; Jo\~ao Correia; Penousal Machado

arxiv: 2511.03913 · v2 · submitted 2025-11-05 · 💻 cs.NE · cs.AI

Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration

Dom\'icio Pereira Neto , Jo\~ao Correia , Penousal Machado This is my paper

Pith reviewed 2026-05-18 00:23 UTC · model grok-4.3

classification 💻 cs.NE cs.AI

keywords prompt embedding optimizationevolutionary strategysep-CMA-ESAdam optimizerStable Diffusioninference-time optimizationaesthetic evaluationimage generation

0 comments

The pith

Evolutionary optimization with sep-CMA-ES outperforms Adam when searching prompt embeddings for Stable Diffusion XL Turbo.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares a gradient-free evolutionary optimizer, sep-CMA-ES, to the gradient-based Adam optimizer for finding better prompt embeddings in the Stable Diffusion XL Turbo model. It uses an objective that balances aesthetic quality from the LAION Aesthetic Predictor V2 and prompt alignment via CLIPScore, with different weight settings. Across 36 sampled prompts from Parti Prompts, the evolutionary method consistently reaches higher objective scores. This matters because it offers a way to control image generation at inference time without the cost of fine-tuning the model. The authors also track how much the optimized images diverge from the unoptimized ones and measure the resources used.

Core claim

On 36 prompts from Parti Prompts under three weight settings for the objective combining LAION Aesthetic Predictor V2 and CLIPScore, sep-CMA-ES achieves higher objective values than Adam when optimizing prompt embeddings for Stable Diffusion XL Turbo, while also allowing analysis of divergence via cosine similarity and SSIM and reporting of compute and memory use.

What carries the argument

sep-CMA-ES as a gradient-free evolutionary strategy that adapts the covariance matrix to search the high-dimensional prompt embedding space for higher values of the weighted aesthetic and alignment objective.

If this is right

sep-CMA-ES provides an effective inference-time optimizer for prompt-embedding search in diffusion models.
It improves trade-offs between aesthetics and alignment without requiring model fine-tuning.
Resource usage in terms of compute and memory can be compared directly between the two optimizers.
The divergence of optimized images from baseline can be quantified using cosine similarity and SSIM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evolutionary optimizers might outperform gradient methods in other embedding optimization tasks where the objective landscape is non-smooth.
This method could extend to controlling other generative models at inference time for specific goals.
Future work might test whether these gains hold when using different aesthetic predictors or alignment measures.

Load-bearing premise

That the specific objective function and the choice of 36 prompts under the three weight settings create a fair test that generalizes beyond this setup.

What would settle it

If additional experiments on more prompts or different models show Adam achieving equal or higher objective values on average, the consistent superiority of sep-CMA-ES would be called into question.

Figures

Figures reproduced from arXiv: 2511.03913 by Dom\'icio Pereira Neto, Jo\~ao Correia, Penousal Machado.

**Figure 1.** Figure 1: General structure and workflow of EIGO. The main components and their respective inputs and outputs are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of categories (left plot) and challenge types (right plot) related to the 36 prompts that were [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mean fitness evolution comparison between Adam (blue line) and sep-CMA-ES (orange line) for each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Final outputs comparison between the baseline (non-optimized SDXL Turbo), Adam, and sep-CMA-ES for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Final outputs comparison between the baseline (non-optimized SDXL Turbo), Adam, and sep-CMA-ES [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Final outputs comparison between the baseline (non-optimized SDXL Turbo), Adam, and sep-CMA-ES for [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Cosine Distance (left plot) and SSIM (right plot) averages between the final image for each approach and the [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Deep diffusion models have revolutionized image generation by producing high-quality outputs. However, achieving specific objectives with these models often requires costly adaptations such as fine-tuning, which can be resource-intensive and time-consuming. An alternative approach is inference-time control, which involves optimizing the prompt embeddings to guide the generation process without altering the model weights. We explore prompt-embedding search optimization for the Stable Diffusion XL Turbo model, comparing a gradient-free evolutionary approach, the Separable Covariance Matrix Adaptation Evolution Strategy (sep-CMA-ES), against the widely used gradient-based optimizer Adaptive Moment Estimation (Adam). Candidate images are evaluated by a weighted objective that combines LAION Aesthetic Predictor V2 and CLIPScore, enabling explicit trade-offs between aesthetic quality and prompt-image alignment. On 36 prompts sampled from Parti Prompts (P2) under three weight settings (aesthetics-only, balanced, alignment-only), sep-CMA-ES consistently achieves higher objective values than Adam. We additionally analyze divergence from the unoptimized baseline using cosine similarity and SSIM and report the compute and memory footprints. These results suggest that sep-CMA-ES is an effective inference-time optimizer for prompt-embedding search, improving aesthetics-alignment trade-offs and resource usage without model fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that for inference-time prompt embedding optimization in Stable Diffusion XL Turbo, the gradient-free sep-CMA-ES evolutionary optimizer consistently outperforms the gradient-based Adam optimizer when maximizing a weighted objective combining LAION Aesthetic Predictor V2 and CLIPScore. This is demonstrated on 36 prompts sampled from Parti Prompts under three explicit weight settings (aesthetics-only, balanced, alignment-only), with additional reporting of cosine similarity/SSIM divergence from the unoptimized baseline and compute/memory footprints.

Significance. If the reported outperformance holds under controlled evaluation budgets and properly tuned baselines, the result would indicate that evolutionary strategies can offer advantages over gradient descent for non-convex prompt-embedding search in diffusion models. This could support more efficient inference-time control methods that avoid model fine-tuning while improving aesthetics-alignment trade-offs.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Results): the central claim of 'consistent' outperformance by sep-CMA-ES across 36 prompts and three weight settings is presented without statistical tests, per-prompt variances, standard deviations, or confidence intervals. This makes it impossible to determine whether observed differences exceed run-to-run variability.
[§3 and §4] §3 (Experimental Protocol) and §4: no information is given on the total number of objective evaluations, wall-clock time, or iteration budgets allocated to each optimizer. Because sep-CMA-ES is gradient-free while Adam uses gradients, unequal evaluation budgets or initialization strategies could produce the reported gap without reflecting intrinsic optimizer superiority.
[§3] §3: Adam-specific hyperparameters (learning rate, betas, scheduler, or any tuning protocol) are not reported. Without evidence that Adam was given a fair, well-tuned baseline, the headline comparison that 'evolutionary optimization trumps Adam' cannot be interpreted as a general result.

minor comments (2)

[§3] Provide explicit numerical weights for the three settings (aesthetics-only, balanced, alignment-only) rather than qualitative labels.
[§4] Include a table summarizing mean objective values, standard deviations, and win rates per weight setting to support the 'consistently achieves higher' statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We address each major comment point-by-point below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim of 'consistent' outperformance by sep-CMA-ES across 36 prompts and three weight settings is presented without statistical tests, per-prompt variances, standard deviations, or confidence intervals. This makes it impossible to determine whether observed differences exceed run-to-run variability.

Authors: We agree that statistical support is necessary to substantiate the consistency claim. In the revised manuscript we will add per-prompt standard deviations (computed from the multiple independent runs already performed) and report the results of paired statistical tests (Wilcoxon signed-rank test with Bonferroni correction) comparing sep-CMA-ES and Adam objective values under each weight setting. These additions will appear in a new subsection of §4 and will be summarized in the abstract. revision: yes
Referee: [§3 and §4] §3 (Experimental Protocol) and §4: no information is given on the total number of objective evaluations, wall-clock time, or iteration budgets allocated to each optimizer. Because sep-CMA-ES is gradient-free while Adam uses gradients, unequal evaluation budgets or initialization strategies could produce the reported gap without reflecting intrinsic optimizer superiority.

Authors: This point is well taken; explicit budget reporting is required for interpretability. The revised §3 will state that both optimizers were allocated an identical budget of 1000 objective evaluations per prompt (with the same random seed for initialization of the embedding), and §4 will include tables of wall-clock time and iteration counts on the same hardware. We maintain that the comparison is therefore controlled, but we will make the equality of budgets explicit so readers can verify it. revision: yes
Referee: [§3] §3: Adam-specific hyperparameters (learning rate, betas, scheduler, or any tuning protocol) are not reported. Without evidence that Adam was given a fair, well-tuned baseline, the headline comparison that 'evolutionary optimization trumps Adam' cannot be interpreted as a general result.

Authors: We accept that full hyperparameter transparency is essential. The revised §3 will document the exact Adam configuration used (learning rate 1e-3, betas=(0.9, 0.999), no learning-rate scheduler, and the same random initialization as sep-CMA-ES) together with a brief description of the limited grid search performed to select the learning rate. If the referee believes additional tuning is warranted, we are prepared to conduct and report it in a follow-up experiment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimizer comparison is self-contained

full rationale

The paper reports direct experimental runs of sep-CMA-ES versus Adam on prompt-embedding optimization for Stable Diffusion XL Turbo, using an external objective (weighted LAION Aesthetic Predictor V2 plus CLIPScore) evaluated on 36 Parti Prompts under three weight settings. No derivation chain, equations, or first-principles predictions are present; the central claim consists of measured objective values, cosine/SSIM divergence, and resource footprints obtained by executing the two optimizers. Because the results rest on independent empirical evaluation against a fixed external scorer rather than any fitted parameter, self-citation, or ansatz that reduces to the input, the work is self-contained with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the pre-trained aesthetic and CLIP predictors as proxies and on the assumption that the two optimizers were given comparable search budgets; no new entities are introduced.

free parameters (2)

objective weights
Three discrete weight settings (aesthetics-only, balanced, alignment-only) are chosen to explore trade-offs.
optimizer hyperparameters
Standard settings for sep-CMA-ES and Adam are presumably tuned for the embedding space.

axioms (1)

domain assumption LAION Aesthetic Predictor V2 and CLIPScore together form a reliable scalar proxy for desired image quality.
The abstract uses this composite score to rank candidate embeddings.

pith-pipeline@v0.9.0 · 5750 in / 1242 out tokens · 77519 ms · 2026-05-18T00:23:51.108396+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 7 internal anchors

[1]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June:10674–10685, 2022

work page 2022
[2]

A Comprehensive Survey of Image Generation Models Based on Deep Learning.Annals of Data Science, 12(1):141–170, 2025

Jun Li, Chenyang Zhang, Wei Zhu, and Yawei Ren. A Comprehensive Survey of Image Generation Models Based on Deep Learning.Annals of Data Science, 12(1):141–170, 2025

work page 2025
[3]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[5]

Springer Nature Singapore, Singapore, 2024

João Correia, Francisco Baeta, and Tiago Martins.Evolutionary Generative Models, pages 283–329. Springer Nature Singapore, Singapore, 2024

work page 2024
[6]

A simple modification in cma-es achieving linear time and space complexity

Raymond Ros and Nikolaus Hansen. A simple modification in cma-es achieving linear time and space complexity. In Günter Rudolph, Thomas Jansen, Nicola Beume, Simon Lucas, and Carlo Poloni, editors,Parallel Problem Solving from Nature – PPSN X, pages 296–305, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg

work page 2008
[7]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 87–103, Cham, 2025. Springer Nature Switzerland

work page 2024
[8]

Laion-aesthetics, 8 2022

Christoph Schuhmann. Laion-aesthetics, 8 2022

work page 2022
[9]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.ArXiv, abs/1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2332–2341, 2019

work page 2019
[12]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023

work page 2023
[13]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen-Team-Google et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 13 Evolutionary Optimization Trumps Adam Optimization on Embedding Space Manipulation and Optimization

work page arXiv 2024
[14]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ICML...

work page 2024
[15]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20406–20417, October 2023

work page 2023
[17]

Pick-a-pic: an open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[18]

Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont- Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam. Rich human feedback for text-to- image generation. In2024 IEEE/CVF Conference...

work page 2024
[19]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8018–8027, 2024

work page 2024
[20]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[21]

Evolving prompts for synthetic image generation with genetic algorithm

Khoi Dinh Tran, Dat Viet Bui, and Ngoc Hoang Luong. Evolving prompts for synthetic image generation with genetic algorithm. In2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2023

work page 2023
[22]

Optimizing prompts for text-to-image generation

Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66923–66939. Curran Associates, Inc., 2023

work page 2023
[23]

Prompt evolution for generative ai: A classifier-guided approach

Melvin Wong, Yew-Soon Ong, Abhishek Gupta, Kavitesh Kumar Bali, and Caishun Chen. Prompt evolution for generative ai: A classifier-guided approach. In2023 IEEE Conference on Artificial Intelligence (CAI), pages 226–229, 2023

work page 2023
[24]

Promptcharm: Text-to-image generation through multi-modal prompting and refinement

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. Promptcharm: Text-to-image generation through multi-modal prompting and refinement. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[25]

Promptist: Automated prompt optimization for text-to-image synthesis

WeiJie Li, Jin Wang, and Xuejie Zhang. Promptist: Automated prompt optimization for text-to-image synthesis. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II, page 295–306, Berlin, Heidelberg, 2024. Springer-Verlag

work page 2024
[26]

Cunha, João Correia, and Penousal Machado

Tiago Martins, João M. Cunha, João Correia, and Penousal Machado. Towards the Evolution of Prompts with MetaPrompter. In Colin Johnson, Nereida Rodriguez-Fernandez, and Sergio M. Rebelo, editors,Artificial Intelligence in Music, Sound, Art and Design, pages 180–195, Cham, 2023. Springer Nature Switzerland

work page 2023
[27]

Exploring generative adversarial networks for text-to-image generation with evolution strategies

Victor Costa, Nuno Lourenço, João Correia, and Penousal Machado. Exploring generative adversarial networks for text-to-image generation with evolution strategies. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 271–274, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[28]

Image generation with diffusion model by interactive evolutionary computation

Haruka Kobayashi, Adam Kotaro Pindur, Suryanarayanan Nagar Anthel Venkatesh, and Hitoshi Iba. Image generation with diffusion model by interactive evolutionary computation. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2984–2990, 2023

work page 2023
[29]

Generating adversarial examples through latent space exploration of generative adversarial networks

Luana Clare and João Correia. Generating adversarial examples through latent space exploration of generative adversarial networks. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 1760–1767, New York, NY , USA, 2023. Association for Computing Machinery. 14 Evolutionary Optimization Trumps Adam Op...

work page 2023
[30]

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023-June:1900–1910, 2023

work page 2023
[31]

Uncovering the Text Embedding in Text-to-Image Diffusion Models

Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the Text Embedding in Text-to-Image Diffusion Models. ArXiv, abs/2404.01154, 2024

work page arXiv 2024
[32]

Association for Computing Machinery, New York, NY , USA, 2025

Dominik Sobania, Martin Briesch, and Franz Rothlauf.ImageBreeder: Guiding Diffusion Models with Evolution- ary Computation, page 463–471. Association for Computing Machinery, New York, NY , USA, 2025

work page 2025
[33]

Evolving the embedding space of diffusion models in the field of visual arts

Marcel Salvenmoser and Michael Affenzeller. Evolving the embedding space of diffusion models in the field of visual arts. InArtificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23–25, 2025, Proceedings, page 402–416, Berlin, Heidelberg, 2025. Springer-Verlag

work page 2025
[34]

Completely derandomized self-adaptation in evolution strategies

Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001

work page 2001
[35]

If by deepfloyd.https://github.com/deep-floyd/IF, 2023

DeepFloyd Team. If by deepfloyd.https://github.com/deep-floyd/IF, 2023. Accessed: 2025-10-08

work page 2023
[36]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Distilling the knowledge in a neural network, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015

work page 2015
[39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

work page 2021
[40]

A computationally efficient limited memory cma-es for large scale optimization

Ilya Loshchilov. A computationally efficient limited memory cma-es for large scale optimization. InProceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14, page 397–404, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014
[41]

Kennedy and R

J. Kennedy and R. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948 vol.4, 1995

work page 1942
[42]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY , USA, 2017. Curran Associates Inc. 15 Evolutionary Optimization Trumps Adam Optimiz...

work page 2017

[1] [1]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June:10674–10685, 2022

work page 2022

[2] [2]

A Comprehensive Survey of Image Generation Models Based on Deep Learning.Annals of Data Science, 12(1):141–170, 2025

Jun Li, Chenyang Zhang, Wei Zhu, and Yawei Ren. A Comprehensive Survey of Image Generation Models Based on Deep Learning.Annals of Data Science, 12(1):141–170, 2025

work page 2025

[3] [3]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[5] [5]

Springer Nature Singapore, Singapore, 2024

João Correia, Francisco Baeta, and Tiago Martins.Evolutionary Generative Models, pages 283–329. Springer Nature Singapore, Singapore, 2024

work page 2024

[6] [6]

A simple modification in cma-es achieving linear time and space complexity

Raymond Ros and Nikolaus Hansen. A simple modification in cma-es achieving linear time and space complexity. In Günter Rudolph, Thomas Jansen, Nicola Beume, Simon Lucas, and Carlo Poloni, editors,Parallel Problem Solving from Nature – PPSN X, pages 296–305, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg

work page 2008

[7] [7]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 87–103, Cham, 2025. Springer Nature Switzerland

work page 2024

[8] [8]

Laion-aesthetics, 8 2022

Christoph Schuhmann. Laion-aesthetics, 8 2022

work page 2022

[9] [9]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.ArXiv, abs/1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2332–2341, 2019

work page 2019

[12] [12]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023

work page 2023

[13] [13]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen-Team-Google et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 13 Evolutionary Optimization Trumps Adam Optimization on Embedding Space Manipulation and Optimization

work page arXiv 2024

[14] [14]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ICML...

work page 2024

[15] [15]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20406–20417, October 2023

work page 2023

[17] [17]

Pick-a-pic: an open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023

[18] [18]

Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont- Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam. Rich human feedback for text-to- image generation. In2024 IEEE/CVF Conference...

work page 2024

[19] [19]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8018–8027, 2024

work page 2024

[20] [20]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023

[21] [21]

Evolving prompts for synthetic image generation with genetic algorithm

Khoi Dinh Tran, Dat Viet Bui, and Ngoc Hoang Luong. Evolving prompts for synthetic image generation with genetic algorithm. In2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2023

work page 2023

[22] [22]

Optimizing prompts for text-to-image generation

Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 66923–66939. Curran Associates, Inc., 2023

work page 2023

[23] [23]

Prompt evolution for generative ai: A classifier-guided approach

Melvin Wong, Yew-Soon Ong, Abhishek Gupta, Kavitesh Kumar Bali, and Caishun Chen. Prompt evolution for generative ai: A classifier-guided approach. In2023 IEEE Conference on Artificial Intelligence (CAI), pages 226–229, 2023

work page 2023

[24] [24]

Promptcharm: Text-to-image generation through multi-modal prompting and refinement

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. Promptcharm: Text-to-image generation through multi-modal prompting and refinement. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[25] [25]

Promptist: Automated prompt optimization for text-to-image synthesis

WeiJie Li, Jin Wang, and Xuejie Zhang. Promptist: Automated prompt optimization for text-to-image synthesis. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II, page 295–306, Berlin, Heidelberg, 2024. Springer-Verlag

work page 2024

[26] [26]

Cunha, João Correia, and Penousal Machado

Tiago Martins, João M. Cunha, João Correia, and Penousal Machado. Towards the Evolution of Prompts with MetaPrompter. In Colin Johnson, Nereida Rodriguez-Fernandez, and Sergio M. Rebelo, editors,Artificial Intelligence in Music, Sound, Art and Design, pages 180–195, Cham, 2023. Springer Nature Switzerland

work page 2023

[27] [27]

Exploring generative adversarial networks for text-to-image generation with evolution strategies

Victor Costa, Nuno Lourenço, João Correia, and Penousal Machado. Exploring generative adversarial networks for text-to-image generation with evolution strategies. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 271–274, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[28] [28]

Image generation with diffusion model by interactive evolutionary computation

Haruka Kobayashi, Adam Kotaro Pindur, Suryanarayanan Nagar Anthel Venkatesh, and Hitoshi Iba. Image generation with diffusion model by interactive evolutionary computation. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2984–2990, 2023

work page 2023

[29] [29]

Generating adversarial examples through latent space exploration of generative adversarial networks

Luana Clare and João Correia. Generating adversarial examples through latent space exploration of generative adversarial networks. InProceedings of the Companion Conference on Genetic and Evolutionary Computation, GECCO ’23 Companion, page 1760–1767, New York, NY , USA, 2023. Association for Computing Machinery. 14 Evolutionary Optimization Trumps Adam Op...

work page 2023

[30] [30]

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023-June:1900–1910, 2023

work page 2023

[31] [31]

Uncovering the Text Embedding in Text-to-Image Diffusion Models

Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the Text Embedding in Text-to-Image Diffusion Models. ArXiv, abs/2404.01154, 2024

work page arXiv 2024

[32] [32]

Association for Computing Machinery, New York, NY , USA, 2025

Dominik Sobania, Martin Briesch, and Franz Rothlauf.ImageBreeder: Guiding Diffusion Models with Evolution- ary Computation, page 463–471. Association for Computing Machinery, New York, NY , USA, 2025

work page 2025

[33] [33]

Evolving the embedding space of diffusion models in the field of visual arts

Marcel Salvenmoser and Michael Affenzeller. Evolving the embedding space of diffusion models in the field of visual arts. InArtificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23–25, 2025, Proceedings, page 402–416, Berlin, Heidelberg, 2025. Springer-Verlag

work page 2025

[34] [34]

Completely derandomized self-adaptation in evolution strategies

Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001

work page 2001

[35] [35]

If by deepfloyd.https://github.com/deep-floyd/IF, 2023

DeepFloyd Team. If by deepfloyd.https://github.com/deep-floyd/IF, 2023. Accessed: 2025-10-08

work page 2023

[36] [36]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.ArXiv, abs/2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Distilling the knowledge in a neural network, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015

work page 2015

[39] [39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine ...

work page 2021

[40] [40]

A computationally efficient limited memory cma-es for large scale optimization

Ilya Loshchilov. A computationally efficient limited memory cma-es for large scale optimization. InProceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14, page 397–404, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014

[41] [41]

Kennedy and R

J. Kennedy and R. Eberhart. Particle swarm optimization. InProceedings of ICNN’95 - International Conference on Neural Networks, volume 4, pages 1942–1948 vol.4, 1995

work page 1942

[42] [42]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY , USA, 2017. Curran Associates Inc. 15 Evolutionary Optimization Trumps Adam Optimiz...

work page 2017