Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

Hao Li

arxiv: 2604.25314 · v1 · submitted 2026-04-28 · 💻 cs.CV

Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

Hao Li This is my paper

Pith reviewed 2026-05-07 16:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords compositional text-to-imagediffusion modelsregion-aware noisecross-attentionadaptive blendingprompt fidelitystarting noisemulti-region generation

0 comments

The pith

Region-aware noise prediction with adaptive blending lets diffusion models better respect multiple distinct entities in a single text prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the core limitation in compositional text-to-image generation is the global text embedding used to predict starting noise, which cannot adequately capture spatially separated regions. It introduces Golden RPG as a lightweight extension to a frozen noise predictor, adding per-region FiLM adapters and cross-attention to reshape noise locally plus a blending head that scales the regional signal by predicted . If the approach works, generated images would show stronger alignment with each part of a multi-region prompt while keeping overall quality intact and adding almost no parameters or runtime cost. A sympathetic reader would view this as a targeted way to fix prompt fidelity without retraining large base models.

Core claim

Golden RPG extends a frozen NPNet with a per-region FiLM adapter that reshapes the predicted noise according to each sub-prompt and a Region Cross-Attention layer that lets spatial locations attend to different sub-prompt tokens. A Confidence-Adaptive Blending head then predicts per sample how strongly the regional signal should override the global signal, preventing degradation on prompts that are already easy. On the RPG benchmark and four multi-region categories of T2I-CompBench this produces the highest cross-region coherence while matching the best baselines on CLIP-Score and CLIP-IQA, with roughly 67 percent user preference and only 2 million added parameters plus 0.6 seconds inference

What carries the argument

The per-region FiLM adapter and Region Cross-Attention together reshape the global noise prediction locally, while the Confidence-Adaptive Blending head decides the override strength to preserve quality on simple prompts.

If this is right

Highest cross-region coherence scores on every tested category of the RPG and T2I-CompBench benchmarks.
Matching performance on absolute CLIP-Score and CLIP-IQA with the strongest baselines.
Approximately 67 percent user preference over the strongest baseline in paired studies.
Only about 2 million trainable parameters and 0.6 seconds added inference time on top of SDXL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regional noise reshaping could be tested on video or 3D generation tasks where spatial separation across frames is also a problem.
Prompt engineering effort might decrease if models can internally handle unordered or overlapping region descriptions more reliably.
Similar adaptive blending heads could be applied to other conditioning signals such as depth or segmentation maps.
The method's low overhead makes it practical to combine with future base models without full fine-tuning.

Load-bearing premise

The global text embedding is the main bottleneck for prompts with spatially separated entities, and the proposed regional adapters and blending can be added without introducing new artifacts or requiring extensive retraining.

What would settle it

An experiment on the same RPG and T2I-CompBench prompts where cross-region coherence scores do not rise or where CLIP-Score and user preference drop when the regional conditioning is enabled.

Figures

Figures reproduced from arXiv: 2604.25314 by Hao Li.

**Figure 1.** Figure 1: Comparison of RPG [20], Golden Noise [21], and our Golden RPG on a 3-region prompt: “a beautiful landscape with mountains and lake, a girl in the foreground, the moon in the background” (regions: mountains — girl in red — moon, identical SDXL seed for all three methods). RPG places each subject but the foreground is gritty and the rocks dominate the girl; Golden Noise produces a more polished illustration … view at source ↗

**Figure 2.** Figure 2: Architecture of Region-Aware NPNet. The frozen NPNet (gray) maps an isotropic seed view at source ↗

**Figure 3.** Figure 3: Training dynamics of Golden RPG (v4) warm-started view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on four T2I-CompBench prompts (rows) across four methods (columns). Each row’s prompt is picked by view at source ↗

read the original abstract

Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbf{Confidence-Adaptive Blending} head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1{,}200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a $\boldsymbol{\sim}$67\% preference over the strongest baseline. The adapter contains $\sim$2M trainable parameters and adds only $0.6$\,s of inference overhead on top of SDXL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Golden RPG adds per-region FiLM adapters and cross-attention to frozen golden noise prediction plus an adaptive blend, delivering coherence gains on standard benchmarks with low overhead, but the blend's safety on easy prompts rests on untested assumptions.

read the letter

The main point is that this paper improves compositional text-to-image generation by making the starting noise prediction aware of different regions in the prompt, using a few added modules on a frozen base network, and it shows gains in coherence metrics plus user preference. The adaptive blending is meant to keep things from getting worse on simpler cases, but that part isn't fully checked. What is new is the combination of per-region FiLM adapters that reshape the noise for each sub-prompt, a Region Cross-Attention layer placed between Swin backbone stages to let locations attend to specific prompt tokens, and the confidence-adaptive blending head that decides the mix per sample. These are added to a frozen NPNet, keeping the parameter count low at around 2 million trainable ones and adding just 0.6 seconds to inference on SDXL. The paper does well by evaluating on the RPG benchmark with 20 prompts and on four categories from T2I-CompBench with 1200 images, comparing against six methods. It claims top scores on Cross-Region-Coherence across the board, while matching the best on CLIP-Score and CLIP-IQA, and a user study with about 67% preference over the strongest baseline. The low overhead is a practical advantage. The soft spots are in the validation of the blending mechanism. The abstract explains that the head dynamically predicts the strength of the regional signal to avoid degrading easy prompts, but there are no reported ablations on single-region or simple prompts, and no data on what confidence values the head actually outputs. This leaves open the possibility that the regional additions could introduce artifacts in cases not covered by the multi-region tests, even if the overall CLIP metrics look fine. This work is aimed at people developing or using diffusion-based T2I models who need better handling of prompts with separated objects. Readers who follow work on golden noise or adapter methods for diffusion will see the most direct value. The paper deserves a serious referee because it has a clear, implementable extension with benchmark results and human evaluation, even if additional experiments on the blending head would make the claims more robust. I would send it to peer review and suggest the reviewers look for those missing controls on the confidence-adaptive part.

Referee Report

2 major / 2 minor

Summary. The paper introduces Golden RPG, an extension to a frozen noise predictor (NPNet) for diffusion-based text-to-image models. It adds a per-region FiLM adapter and a Region Cross-Attention layer to make the initial noise prediction aware of spatially separated sub-prompts in compositional prompts. A learned Confidence-Adaptive Blending head modulates the strength of the regional signal per sample to avoid degrading global fidelity on easier prompts. Experiments on the RPG benchmark (20 prompts) and four multi-region categories of T2I-CompBench (1200 images) report the highest Cross-Region-Coherence scores across all categories while matching top baselines on CLIP-Score and CLIP-IQA; a paired user study shows ~67% preference over the strongest baseline. The method adds ~2M trainable parameters and 0.6s inference overhead.

Significance. If the confidence-adaptive blending reliably defaults to the global signal on non-compositional prompts, the approach offers a lightweight, training-efficient way to improve spatial coherence in T2I generation without retraining the base diffusion model. The low parameter count and reported inference cost are practical strengths, and the focus on the starting noise as a semantic carrier is a useful perspective. However, the significance is tempered by the lack of direct verification that the blending mechanism preserves global metrics outside the multi-region test sets.

major comments (2)

[§4] §4 (Experiments) and the description of Confidence-Adaptive Blending: the central claim that the method matches the strongest baselines on absolute CLIP-Score and CLIP-IQA while improving coherence requires that the blending head defaults to the global signal on easy/single-region prompts and does not introduce localized artifacts. No ablation results, confidence-value histograms, or evaluations on non-compositional prompts are provided to support this; without them the matching on global metrics could mask degradation that only appears outside the reported multi-region categories.
[§4] Table reporting Cross-Region-Coherence and CLIP metrics (presumably Table 1 or 2): the paper states highest coherence on every category with matched CLIP scores, but provides no error bars, statistical significance tests, or details on data exclusion rules and random seeds. This makes it difficult to assess whether the gains are robust or sensitive to the specific 20-prompt RPG set and 1200-image T2I-CompBench subset.

minor comments (2)

[Abstract] The abstract and §3.2 use the notation '1{,}200' for the image count; this is non-standard and should be written as 1,200 or 1200 for clarity.
[§4] The user-study protocol (number of participants, prompt sampling, presentation order, statistical test for the 67% preference) is only summarized; a brief appendix table with these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that stronger verification of the confidence-adaptive blending on non-compositional prompts and improved statistical reporting would strengthen the manuscript. We address each major comment below and will incorporate the suggested additions in the revised version.

read point-by-point responses

Referee: [§4] §4 (Experiments) and the description of Confidence-Adaptive Blending: the central claim that the method matches the strongest baselines on absolute CLIP-Score and CLIP-IQA while improving coherence requires that the blending head defaults to the global signal on easy/single-region prompts and does not introduce localized artifacts. No ablation results, confidence-value histograms, or evaluations on non-compositional prompts are provided to support this; without them the matching on global metrics could mask degradation that only appears outside the reported multi-region categories.

Authors: We agree that direct evidence for the blending head's behavior on non-compositional prompts is necessary to fully support the claim. In the revision we will add: (i) an evaluation of Golden RPG on single-region and non-compositional prompts drawn from standard benchmarks (e.g., a subset of COCO captions and DrawBench), (ii) histograms of the predicted per-sample confidence values stratified by prompt complexity, and (iii) an ablation that disables the blending head. These results will show that the head reliably defaults to the global signal for simpler prompts, preserving CLIP metrics without introducing artifacts. revision: yes
Referee: [§4] Table reporting Cross-Region-Coherence and CLIP metrics (presumably Table 1 or 2): the paper states highest coherence on every category with matched CLIP scores, but provides no error bars, statistical significance tests, or details on data exclusion rules and random seeds. This makes it difficult to assess whether the gains are robust or sensitive to the specific 20-prompt RPG set and 1200-image T2I-CompBench subset.

Authors: We acknowledge the need for greater statistical transparency. In the revised manuscript we will: (i) rerun all experiments with multiple random seeds (at least three) and report mean ± standard deviation for every metric, (ii) add explicit statements on random seeds, data exclusion rules (none were applied beyond the published benchmark definitions), and (iii) include paired statistical significance tests (Wilcoxon signed-rank) between Golden RPG and the strongest baseline for the Cross-Region-Coherence scores. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural extension with external benchmark evaluation

full rationale

The paper introduces Golden RPG as a set of trainable additions (per-region FiLM adapter, Region Cross-Attention, and Confidence-Adaptive Blending head) to a frozen NPNet for compositional T2I generation. No equations, derivations, or first-principles results are claimed that reduce to fitted parameters or self-citations by construction. Performance claims rest on direct evaluation against external benchmarks (RPG benchmark and T2I-CompBench categories) and a user study, with no renaming of known results, no fitted-input predictions, and no load-bearing self-citations. The method is presented as an empirical architectural proposal whose validity is tested on independent data rather than derived tautologically from its own inputs.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to components explicitly named. The work adds trainable architectural modules rather than new physical entities. Axioms are standard diffusion-model assumptions plus the paper's stated observation about global embeddings.

free parameters (3)

FiLM adapter weights
Trainable per-region reshaping parameters added to frozen NPNet; total adapter size stated as ~2M parameters.
Region Cross-Attention weights
Injected layer between Swin backbone stages with its own trainable parameters.
Confidence-Adaptive Blending head weights
Learned head that predicts per-sample blending strength between regional and global signals.

axioms (2)

domain assumption The starting noise of a diffusion model carries significant semantic information from the text prompt.
Cited from recent golden-noise work and used as motivation for the noise predictor.
domain assumption A single global text embedding becomes the bottleneck for prompts describing spatially-separated entities.
Stated observation that motivates the region-aware extensions.

pith-pipeline@v0.9.0 · 5625 in / 1527 out tokens · 68700 ms · 2026-05-07T16:52:13.553340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InICML, 2023. 1, 3, 7, 9, 10

work page 2023
[2]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models. InSIG- GRAPH, 2023. 1, 3, 7, 9, 10

work page 2023
[3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InICML, 2024. 1, 3

work page 2024
[4]

Training-free structured diffu- sion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffu- sion guidance for compositional text-to-image synthesis. In ICLR, 2023. 1, 3

work page 2023
[5]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InEMNLP, 2021. 7, 8

work page 2021
[6]

GANs trained by a two time-scale update rule converge to a local nash equi- librium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equi- librium. InNeurIPS, 2017. 7

work page 2017
[7]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3, 4, 10

work page 2020
[8]

T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation. In NeurIPS Datasets and Benchmarks, 2023. 1, 2, 3, 6, 7, 8, 9, 10

work page 2023
[9]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 3, 4, 6

work page 2022
[10]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023. 1, 3, 12

work page 2023
[11]

Tenenbaum

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InECCV, 2022. 1, 3

work page 2022
[12]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2, 3, 4, 5, 7, 12

work page 2021
[13]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2, 3, 4, 5, 7

work page 2018
[14]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1, 4, 6, 7, 9, 10, 12

work page 2024
[15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2, 3, 4, 5, 7, 12

work page 2021
[16]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 4

work page 2022
[17]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021. 3

work page 2021
[18]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 3, 4

work page 2017
[19]

Chan, and Chen Change Loy

Jianyi Wang, Kelvin C.K. Chan, and Chen Change Loy. CLIP-IQA: Exploring clip for assessing the look and feel of images. InAAAI, 2023. 7

work page 2023
[20]

Mastering text-to-image diffu- sion: Recaptioning, planning, and generating with multi- modal LLMs

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Ste- fano Ermon, and Bin Cui. Mastering text-to-image diffu- sion: Recaptioning, planning, and generating with multi- modal LLMs. InICML, 2024. 1, 2, 3, 4, 6, 7, 9, 10

work page 2024
[21]

Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024

Zikai Zhou, Shitong Wang, Lichen Du, Yifei Wang, Pengtao Liu, Lantao Yu, Ling Yang, Bingyi Liu, and Mengdi Wang. Golden noise for diffusion models: A learning framework. arXiv preprint arXiv:2411.09502, 2024. 1, 2, 3, 4, 7, 9, 10

work page arXiv 2024

[1] [1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InICML, 2023. 1, 3, 7, 9, 10

work page 2023

[2] [2]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models. InSIG- GRAPH, 2023. 1, 3, 7, 9, 10

work page 2023

[3] [3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InICML, 2024. 1, 3

work page 2024

[4] [4]

Training-free structured diffu- sion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffu- sion guidance for compositional text-to-image synthesis. In ICLR, 2023. 1, 3

work page 2023

[5] [5]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InEMNLP, 2021. 7, 8

work page 2021

[6] [6]

GANs trained by a two time-scale update rule converge to a local nash equi- librium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equi- librium. InNeurIPS, 2017. 7

work page 2017

[7] [7]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3, 4, 10

work page 2020

[8] [8]

T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation. In NeurIPS Datasets and Benchmarks, 2023. 1, 2, 3, 6, 7, 8, 9, 10

work page 2023

[9] [9]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 3, 4, 6

work page 2022

[10] [10]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023. 1, 3, 12

work page 2023

[11] [11]

Tenenbaum

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InECCV, 2022. 1, 3

work page 2022

[12] [12]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2, 3, 4, 5, 7, 12

work page 2021

[13] [13]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI, 2018. 2, 3, 4, 5, 7

work page 2018

[14] [14]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1, 4, 6, 7, 9, 10, 12

work page 2024

[15] [15]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2, 3, 4, 5, 7, 12

work page 2021

[16] [16]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 4

work page 2022

[17] [17]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021. 3

work page 2021

[18] [18]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 3, 4

work page 2017

[19] [19]

Chan, and Chen Change Loy

Jianyi Wang, Kelvin C.K. Chan, and Chen Change Loy. CLIP-IQA: Exploring clip for assessing the look and feel of images. InAAAI, 2023. 7

work page 2023

[20] [20]

Mastering text-to-image diffu- sion: Recaptioning, planning, and generating with multi- modal LLMs

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Ste- fano Ermon, and Bin Cui. Mastering text-to-image diffu- sion: Recaptioning, planning, and generating with multi- modal LLMs. InICML, 2024. 1, 2, 3, 4, 6, 7, 9, 10

work page 2024

[21] [21]

Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024

Zikai Zhou, Shitong Wang, Lichen Du, Yifei Wang, Pengtao Liu, Lantao Yu, Ling Yang, Bingyi Liu, and Mengdi Wang. Golden noise for diffusion models: A learning framework. arXiv preprint arXiv:2411.09502, 2024. 1, 2, 3, 4, 7, 9, 10

work page arXiv 2024