Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection
Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3
The pith
Attention to core tokens in the first few denoising steps predicts which random seeds produce high-quality, prompt-aligned images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that attention dynamics over prompt core tokens, measured during the first few denoising steps, strongly predict final generation quality. They introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play approach that ranks seeds for a given prompt by cross-attention to those core tokens, retains only the top-k for complete generation, and discards the rest. Experiments across three benchmarks show consistent improvements in text-image alignment and visual quality for Stable Diffusion models, supported by human preference and automatic metrics.
What carries the argument
The central mechanism is the observed predictive correlation between early cross-attention maps on prompt core tokens (the content-bearing words) and the eventual image quality, which ABSS exploits to score and rank seeds before committing to full denoising runs.
If this is right
- ABSS produces consistent gains in prompt alignment and visual quality without retraining or altering the base diffusion model.
- The method serves as a lightweight pre-filter that can be added to existing seed-optimization pipelines for further gains.
- Early discarding of low-scoring seeds avoids full computation on generations unlikely to succeed.
- Results hold across multiple Stable Diffusion variants and are verified by both automatic metrics and human judgments.
Where Pith is reading between the lines
- Early attention patterns may serve as a general probe for semantic fidelity in diffusion processes before later steps refine details.
- The approach could transfer to other generative architectures that use cross-attention or similar internal signals.
- Combining the attention score with additional lightweight checks might allow even earlier termination of poor seeds.
Load-bearing premise
The correlation between early cross-attention to core tokens and final image quality generalizes across prompts, Stable Diffusion variants, and benchmarks without model-specific tuning or threshold selection.
What would settle it
On a fresh set of prompts or a different diffusion backbone, the images from ABSS top-k seeds would show no improvement over random seeds in human preference ratings or standard alignment metrics such as CLIP score.
Figures
read the original abstract
Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that cross-attention dynamics to prompt core tokens (content-bearing words) during the first few denoising steps of text-to-image diffusion models strongly predict final image quality and alignment. It introduces ABSS, a training-free inference-only method that ranks candidate seeds by this attention signal, retains only the top-k for full generation, and reports consistent gains in alignment metrics and human preference studies across three benchmarks on Stable Diffusion variants.
Significance. If the early-attention correlation generalizes, ABSS would be a lightweight, plug-and-play addition to existing diffusion pipelines that mitigates seed sensitivity without training or model changes. The training-free nature and lack of fixed thresholds are genuine strengths that distinguish it from optimization-based seed search methods.
major comments (3)
- [Abstract and §3] Abstract and §3 (Method): The claim that attention to core tokens 'strongly predict' final quality is central yet unsupported by any reported correlation coefficient, R² value, or statistical test; without these numbers the ranking justification remains qualitative and the top-k selection rule lacks a validated decision criterion.
- [§4] §4 (Experiments): No ablation or quantitative breakdown is given for core-token extraction (POS tagging vs. attention thresholding vs. manual curation), the precise attention statistic (mean, max, or sum over steps), or the early-step window size; these choices are load-bearing for the 'no fixed threshold' and 'generalizes without per-model tuning' assertions.
- [§4.2 and Table 2] §4.2 and Table 2: The reported gains on three benchmarks lack error bars, controls for prompt difficulty, and comparison against a fixed-threshold baseline; this weakens the claim that ABSS reliably outperforms random seed selection beyond the tested Stable Diffusion variants.
minor comments (2)
- [§2] §2 (Related Work): The discussion of prior seed-optimization methods could more explicitly contrast ABSS with CLIP-guided or reward-model approaches to clarify the novelty of the attention-based ranking.
- [Figure 3] Figure 3: The attention-map visualizations would benefit from explicit annotation of which tokens are designated 'core' and the exact timestep range used for scoring.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The claim that attention to core tokens 'strongly predict' final quality is central yet unsupported by any reported correlation coefficient, R² value, or statistical test; without these numbers the ranking justification remains qualitative and the top-k selection rule lacks a validated decision criterion.
Authors: We acknowledge that the current manuscript supports the predictive relationship primarily through qualitative visualizations and illustrative examples in Section 3. To address this, we will add quantitative analysis in the revised Section 3, including Pearson correlation coefficients and R² values computed between the early-step core-token attention scores and final CLIP alignment / human preference scores over a large sample of prompts and seeds. This will provide a validated statistical basis for the ranking criterion. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation or quantitative breakdown is given for core-token extraction (POS tagging vs. attention thresholding vs. manual curation), the precise attention statistic (mean, max, or sum over steps), or the early-step window size; these choices are load-bearing for the 'no fixed threshold' and 'generalizes without per-model tuning' assertions.
Authors: We agree that systematic ablations are needed to substantiate the design choices. In the revised manuscript we will include a dedicated ablation subsection in §4 that reports quantitative results for alternative core-token extraction methods (POS tagging, attention-based thresholding, and manual curation), different aggregation statistics (mean, max, sum), and varying early-step windows (e.g., steps 1-5, 1-10, 5-15). These experiments will directly support the claims of threshold-free operation and generalization. revision: yes
-
Referee: [§4.2 and Table 2] §4.2 and Table 2: The reported gains on three benchmarks lack error bars, controls for prompt difficulty, and comparison against a fixed-threshold baseline; this weakens the claim that ABSS reliably outperforms random seed selection beyond the tested Stable Diffusion variants.
Authors: We appreciate this observation. We will revise Table 2 and the associated figures to include error bars (standard deviation across repeated runs with different random seeds). We will also add a prompt-difficulty stratification analysis and a direct comparison against a fixed-threshold baseline that accepts seeds only when their attention score exceeds a preset value. These additions will be placed in §4.2 of the revised version. revision: yes
Circularity Check
No significant circularity: empirical proxy used without definitional reduction
full rationale
The paper's central claim rests on an empirical observation that cross-attention dynamics to prompt core tokens in early denoising steps correlate with final image quality. ABSS then applies this observed correlation as a ranking criterion for seed selection. No equations, fitted parameters, or self-citations are shown that define the ranking score in terms of the target quality metric itself or reduce the prediction to the input by construction. The method is explicitly training-free and operates at inference time on attention maps derived directly from the diffusion process. This is a standard self-contained empirical approach rather than a circular derivation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
early-stage cross-attention on core tokens is a strong predictor of final prompt alignment and image quality
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
M^s_t(B) = 1/|B|HW sum core-token attention after Gaussian smoothing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A-star: Test-time attention segregation and retention for text-to-image synthesis
Aishwarya Agarwal, Srikrishna Karanam, K J Joseph, Apoorv Saxena, Koustava Goswami, and Balaji Vasan Srinivasan. A-star: Test-time attention segregation and retention for text-to-image synthesis. InIEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[2]
Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection
Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, and Zeke Xie. Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection. InInternational Conference on Learning Representations, 2025
work page 2025
-
[3]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
The crystal ball hypothesis in diffusion models: Anticipating object positions from initial noise
Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypothesis in diffusion models: Anticipating object positions from initial noise. InInternational Conference on Learning Representations, 2025
work page 2025
-
[5]
Black Forest Labs. FLUX.1-dev, 2024. URL https://huggingface.co/black-forest-labs/FLUX.1-dev
work page 2024
-
[6]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics, 2023
work page 2023
-
[7]
Training-free layout control with cross-attention guidance
Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2024
work page 2024
-
[8]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[9]
Zero-shot spatial layout conditioning for text-to-image diffusion models
Guillaume Couairon, Marlene Careil, Matthieu Cord, Stephane Lathuiliere, and Jakob Verbeek. Zero-shot spatial layout conditioning for text-to-image diffusion models. InIEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024
work page 2024
-
[11]
Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one- step text-to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems, 2024
work page 2024
-
[12]
Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015, 2022
Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015, 2022
-
[13]
Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025
Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025
-
[14]
Initno: Boosting text-to-image diffusion models via initial noise optimization
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffusion models via initial noise optimization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 10 Attention-Based Seed Selection for T2I Diffusion
work page 2024
-
[15]
Prompt-to-prompt image editing with cross attention control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InInternational Conference on Learning Representations, 2023
work page 2023
-
[16]
Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 2020
work page 2020
-
[17]
Improving sample quality of diffusion models using self-attention guidance
Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. InIEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[18]
Cumulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems, 2002
Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems, 2002
work page 2002
-
[19]
Ssmg: Spatial-semantic map guided diffusion model for free-form layout-to-image generation
Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, and Jingdong Wang. Ssmg: Spatial-semantic map guided diffusion model for free-form layout-to-image generation. InAssociation for the Advancement of Artificial Intelligence, 2024
work page 2024
-
[20]
Kwanyoung Kim and Sanghyun Kim. Model already knows the best noise: Bayesian active noise selection via attention in video diffusion model. InInternational Conference on Learning Representations, 2026
work page 2026
-
[21]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 2023
work page 2023
-
[22]
Peng Li, Qian Wang, Sheng Chen, Jing Zhang, Xin Wang, Yuhang Li, Yifei Zhang, Xing Zhou, Yujun Chen, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Enhancing compositional text-to-image generation with reliable random seeds
Shuangqi Li, Hieu Le, Jingyi Xu, and Mathieu Salzmann. Enhancing compositional text-to-image generation with reliable random seeds. InInternational Conference on Learning Representations, 2025
work page 2025
-
[24]
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[25]
Alignment of diffusion models: Fundamentals, challenges.ACM Computing Surveys, 2024
Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diffusion models: Fundamentals, challenges.ACM Computing Surveys, 2024
work page 2024
-
[26]
Po-Yuan Mao, Shashank Kotyan, Tham Yik Foong, and Danilo Vasconcellos Vargas. Synthetic shifts to initial seed vector exposes the brittle nature of latent-based diffusion models.arXiv preprint arXiv:2312.11473, 2023
-
[27]
Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, and Weidong Cai. Ctrl-z sampling: Diffusion sampling with controlled random zigzag explorations.arXiv preprint arXiv:2506.20294, 2025
-
[28]
Conform: Contrast is all you need for high-fidelity text-to-image diffusion models
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[29]
Noise diffusion for enhancing semantic faithfulness in text-to-image synthesis
Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, and Yao Zhu. Noise diffusion for enhancing semantic faithfulness in text-to-image synthesis. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[30]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, 2023
work page 2023
-
[32]
Not all noises are created equally: Diffusion noise selection and optimization
Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally: Diffusion noise selection and optimization.arXiv preprint arXiv:2407.14041, 2024
-
[33]
Self-cross diffusion guidance for text-to-image synthesis of similar subjects
Weimin Qiu, Jieke Wang, and Meng Tang. Self-cross diffusion guidance for text-to-image synthesis of similar subjects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[34]
Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation
Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. InACM International Conference on Multimedia, 2023
work page 2023
-
[35]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021
work page 2021
-
[36]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 2020. 11 Attention-Based Seed Selection for T2I Diffusion
work page 2020
-
[37]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, 2021
work page 2021
-
[38]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural Information Processing Systems, 2023
work page 2023
-
[40]
Generative adversarial text to image synthesis
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational Conference on Machine Learning, 2016
work page 2016
-
[41]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[42]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Rafael Lopes, and et al. Photorealistic text-to-image diffusion models with deep language under- standing.Advances in Neural Information Processing Systems, 2022
work page 2022
-
[43]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2022
work page 2022
-
[44]
Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, and Zeke Xie. Improved and accelerated text-to-image generation with collect, reflect, and refine.Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[45]
Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, and Jinhui Tang. Imagharmony: Controllable image editing with consistent object quantity and layout.arXiv preprint arXiv:2506.01949, 2025
-
[46]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models.Advances in Neural Information Processing Systems, 2015
work page 2015
-
[47]
Score- based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representation, 2021
work page 2021
-
[48]
Stable diffusion 2.0 release, 2022
Stability AI. Stable diffusion 2.0 release, 2022. URL https://stability.ai/news/ stable-diffusion-v2-release
work page 2022
-
[49]
Stable diffusion v2.1 and dreamstudio updates, 2022
Stability AI. Stable diffusion v2.1 and dreamstudio updates, 2022. URL https://stability.ai/news/ stablediffusion2-1-release7-dec-2022
work page 2022
-
[50]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017
work page 2017
-
[51]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 2023
work page 2023
-
[53]
Jingyi Xu, H. Le, and Dimitris Samaras. Generating features with increased crop-related diversity for few-shot object detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[54]
Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models
Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models. InIEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[55]
Attngan: Fine-grained text to image generation with attentional generative adversarial networks
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018
work page 2018
-
[56]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. InIEEE Interna- tional Conference on Computer Vision, 2017
work page 2017
-
[57]
Layoutdiffusion: Controllable diffusion model for layout-to-image generation
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 12 Attention-Based Seed Selection for T2I Diffusion
work page 2023
-
[58]
Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InIEEE International Conference on Computer Vision, 2025. 13 Attention-Based Seed Selection for T2I Diffusion Appendix A Definition of Core Tokens In ABSS, core tokens are the content-bearing words that specify th...
work page 2025
-
[59]
GOLDEN. We use the first 100 prompts fromInitNOandDrawBench, and the first 50 prompts fromPick-a-Picas a validation set to extract the “golden seeds”; all remaining prompts are used for evaluation. Specifically, GOLDENranks candidate seeds by their average HPS-v2 score on the validation set, and then applies the top-ranked seeds to all test prompts in the...
-
[60]
NS. For a fair comparison with the ABSS setting using a 10-seed pool, we use K=10 noise candidates per prompt instead of the default candidate number of 100. Following the official implementation, each candidate is ranked by its DDIM inversion stability: we first run 50-step DDIM sampling from zT to z0, then perform 50-step DDIM inversion back to z′ T , a...
-
[61]
INITNO. We follow the official INITNO implementation, optimizing the initial noise mean and log-variance with Adam using learning rate 1×10−2. We use the default attention thresholds τcross=0.2 and τself=0.3, with at most 5 restart rounds of 10 optimization steps. According to the pseudo-code in INITNO, each denoising optimization step invokes one full de...
-
[62]
We follow the official AE latent optimization setting, using attention maps at 16×16 resolution
AE. We follow the official AE latent optimization setting, using attention maps at 16×16 resolution. Latent updates are applied within the first 25 denoising steps with scale factor 20, and iterative refinement is triggered at steps 10 and 20 using thresholds τcross=0.2 and τself=0.3, with at most 20 refinement steps. Final images are decoded after the st...
-
[63]
ND. We do not follow the default ND setting, as its original configuration uses 50 optimization epochs and 50 noise candidates, which incurs very high computational cost. Instead, for a fairer comparison, we reduce the setting to 10 optimization epochs and 10 noise candidates. We otherwise follow the official ND implementation, which uses VQAScore with cl...
-
[64]
We follow the official CORE2 implementation on SD 3.5-Large with the released noise model checkpoint
CORE 2. We follow the official CORE2 implementation on SD 3.5-Large with the released noise model checkpoint. The refinement module uses a LoRA-based PromptSD35Net with 28 LoRA slots and rank 64. We set the weak-to- strong guidance scale to 1.5 and apply the refinement branch at every denoising step. For each prompt, we sample 3 images with different rand...
-
[65]
NPNET. We follow the official NPNETinference pipeline and use the released pretrained noise-prompt model for each backbone. For Hunyuan-DiT, we use the DiT branch with the released dit.pth checkpoint to predict the golden initial noise. For each prompt, we generate 3 golden-noise samples with deterministic seeds and use the generated images as the NPNETba...
-
[66]
ABSS. We use a seed pool of 10 candidates per prompt and select the top-3 seeds for final image generation. For SD 1.x and SD 2.x, attention maps are collected at the 10th denoising step across all layers and heads, using spatial resolution 16×16 for SD 1.x and 24×24 for SD 2.0/2.1. Since this requires a full forward pass, the coarse NFE per reported imag...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.