MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Anzhe Cheng; Jiahao Chen; Paul Bogdan; Yu Chang

arxiv: 2509.15357 · v2 · pith:DLUMBWE4new · submitted 2025-09-18 · 💻 cs.CV · cs.LG

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Yu Chang , Jiahao Chen , Anzhe Cheng , Paul Bogdan This is my paper

Pith reviewed 2026-05-21 21:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords text-to-imagediffusion modelscross-attentionspatial gatingcontrollable generationmulti-object promptsSDXL

0 comments

The pith

MaskAttn-SDXL injects token-conditioned spatial gating into SDXL cross-attention to enable better region-level control in text-to-image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MaskAttn-SDXL as a plug-in module for SDXL diffusion models. It adds token-conditioned spatial gating to the cross-attention logits before the softmax operation. This sparsifies the interactions between tokens and image locations to suppress bindings that are not relevant to the prompt. The approach aims to reduce common errors like mixing attributes between objects or violating spatial relations in multi-object scenes. A sympathetic reader would care because it addresses these issues without modifying the pretrained backbone, changing the sampling process, or requiring any external supervision or additional inference steps.

Core claim

MaskAttn-SDXL is a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.

What carries the argument

Token-conditioned spatial gating applied to cross-attention logits to sparsify token-to-location interactions.

If this is right

Multi-object prompts produce images with fewer attribute mixing errors and better spatial adherence.
The pretrained SDXL model remains unchanged and continues to use standard sampling.
Generation requires no external supervision or post-inference editing.
Global metrics like FID may not capture the improvements in compositional reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This gating approach might generalize to other attention-based diffusion models beyond SDXL.
Region-level control could enable more precise editing tasks in image synthesis pipelines.
Testing on prompts with increasing numbers of objects could reveal scalability limits of the sparsification.

Load-bearing premise

Injecting token-conditioned spatial gating into cross-attention logits will reliably reduce attribute mixing and spatial violations in multi-object prompts without introducing new artifacts or lowering overall image quality.

What would settle it

Generating images from multi-object prompts using both baseline SDXL and MaskAttn-SDXL and measuring the rate of attribute mixing or spatial violations through human evaluation or automated metrics; if the rates remain similar, the claim does not hold.

read the original abstract

Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost. In parallel, compositional reliability remains weak: models often mix attributes across objects, violate spatial relations, or omit requested entities, and these errors are not reliably reflected by global metrics such as FID or CLIP-based scores. To address these issues without changing the SDXL pipeline, we propose MaskAttn-SDXL, a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaskAttn-SDXL adds token-conditioned spatial gating to SDXL cross-attention logits as a plug-in to reduce attribute mixing in multi-object prompts, but the description supplies no results or targeted tests to confirm it works.

read the letter

The paper's central proposal is a MaskAttn module that injects token-conditioned spatial gating directly into the cross-attention logits of SDXL before the softmax. The gating is meant to sparsify token-to-location interactions and suppress irrelevant bindings that produce attribute mixing or spatial errors, all while leaving the pretrained backbone and standard sampling process alone. No external supervision or inference-time edits are required, which keeps the change lightweight and easy to apply on top of existing SDXL setups. That preservation of the original pipeline is the clearest practical advantage here. The motivation also lines up with a real limitation: U-Net locality makes global coordination difficult, and the paper correctly notes that FID and CLIP scores do not reliably flag the compositional failures it targets. The specific combination of token-conditioned gating applied at the logit level in SDXL appears distinct from earlier attention modifications referenced in the abstract. The main weakness is the absence of evidence. The abstract describes the intended mechanism but provides no quantitative results, ablation studies, or comparisons against other methods. Because the paper itself states that global metrics miss attribute mixing and omitted entities, any evaluation would need custom compositional benchmarks or controlled qualitative tests to show the gating actually delivers the claimed suppression rather than trading one failure mode for another. Without those, it is hard to judge whether the approach improves reliability or introduces new artifacts. This work would interest researchers extending diffusion models for design tools or media production who want a simple add-on rather than a full architecture change. A reader focused on practical compositionality fixes for SDXL could extract value from the idea if the experiments are solid. I would send it to peer review. The proposal is clear, the problem is well-motivated, and the plug-in framing makes it worth referee time even if the current evidence needs strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MaskAttn-SDXL, a plug-in module for the SDXL diffusion model. It injects token-conditioned spatial gating into cross-attention logits before softmax to sparsify token-to-location interactions, with the goal of suppressing irrelevant bindings, reducing attribute mixing and spatial violations in multi-object prompts, while preserving the pretrained backbone, standard sampling, and requiring no external supervision or inference-time editing. The work targets limitations of U-Net locality versus Transformer cost and notes that global metrics like FID or CLIP scores fail to capture compositional errors.

Significance. If the gating mechanism can be shown to improve compositional reliability without degrading quality or introducing new artifacts, the contribution would be a lightweight, supervision-free enhancement to existing diffusion pipelines for region-level control. This would be valuable for practical multi-object generation tasks.

major comments (2)

[Abstract and §5] Abstract and §5 (Evaluation): The manuscript explicitly states that global metrics such as FID or CLIP-based scores do not reliably reflect attribute mixing, spatial violations, or omitted entities. If the reported experiments rely primarily on these metrics or qualitative examples rather than targeted compositional benchmarks (e.g., attribute-binding accuracy or spatial-relation tests), the central claim of suppressed irrelevant bindings lacks direct verification and risks trading one failure mode for another.
[§4.1] §4.1 (MaskAttn module): The token-conditioned spatial gating is described as sparsifying interactions to suppress irrelevant bindings while leaving the SDXL backbone and sampling unchanged. The exact formulation of the gating function, its conditioning on tokens, and any introduced parameters or hyperparameters must be shown to preserve the original attention distribution properties; otherwise the claim that the method requires 'no external supervision' and maintains quality is not yet load-bearing.

minor comments (2)

[§4] Notation in §4: Define the gating mask variable consistently across equations and text to avoid ambiguity in how token conditioning is applied before the softmax.
[Figures] Figure captions: Ensure all qualitative examples include prompt text and highlight the specific compositional improvement being illustrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the manuscript will be updated to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Evaluation): The manuscript explicitly states that global metrics such as FID or CLIP-based scores do not reliably reflect attribute mixing, spatial violations, or omitted entities. If the reported experiments rely primarily on these metrics or qualitative examples rather than targeted compositional benchmarks (e.g., attribute-binding accuracy or spatial-relation tests), the central claim of suppressed irrelevant bindings lacks direct verification and risks trading one failure mode for another.

Authors: We agree that global metrics such as FID and CLIP scores are insufficient for verifying compositional improvements. The manuscript explicitly notes this limitation and our §5 evaluation centers on qualitative results across diverse multi-object prompts to illustrate reduced attribute mixing and better spatial control. To directly address the concern and provide stronger verification of suppressed irrelevant bindings, we have added quantitative results from targeted compositional benchmarks (attribute-binding accuracy and spatial-relation tests) to the revised manuscript. revision: yes
Referee: [§4.1] §4.1 (MaskAttn module): The token-conditioned spatial gating is described as sparsifying interactions to suppress irrelevant bindings while leaving the SDXL backbone and sampling unchanged. The exact formulation of the gating function, its conditioning on tokens, and any introduced parameters or hyperparameters must be shown to preserve the original attention distribution properties; otherwise the claim that the method requires 'no external supervision' and maintains quality is not yet load-bearing.

Authors: Section 4.1 presents the precise formulation of the token-conditioned spatial gating, which modulates cross-attention logits prior to softmax and is conditioned on the input text token embeddings. Only a small set of additional parameters is introduced and optimized without external supervision or masks. In the revision we have expanded this section with further derivation and analysis confirming that the gating preserves the original attention distribution properties (relative weight ordering and normalization) for relevant tokens while sparsifying irrelevant ones, thereby supporting the claims of unchanged backbone behavior and maintained quality. revision: partial

Circularity Check

0 steps flagged

Architectural addition with no reduction to fitted inputs or self-citation chains

full rationale

The paper proposes MaskAttn-SDXL as a plug-in module that adds token-conditioned spatial gating to cross-attention logits in the existing SDXL pipeline. This is presented as a design choice whose intended effect (sparsifying interactions to suppress irrelevant bindings) is a claimed consequence rather than a quantity presupposed by the equations or fitted from target metrics. No load-bearing step reduces by construction to the performance claims, and global metrics like FID/CLIP are explicitly noted as insufficient, with the method relying on the architectural intervention itself. The derivation is self-contained as an engineering proposal without self-definitional loops, renamed known results, or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central addition is the new gating module; no explicit free parameters, ad-hoc axioms, or invented physical entities are described.

axioms (1)

standard math Standard diffusion model assumptions and cross-attention behavior in U-Net architectures remain valid after the plug-in modification.
The paper states it preserves the pretrained backbone and sampling process.

invented entities (1)

MaskAttn module no independent evidence
purpose: To inject token-conditioned spatial gating into cross-attention logits
Newly introduced component whose effectiveness is asserted in the abstract.

pith-pipeline@v0.9.0 · 5688 in / 1279 out tokens · 93478 ms · 2026-05-21T21:30:39.653293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a learnable, additive mask matrix M^l ∈ R^{N×T} directly to the attention logits... M^l(i,t) = 0 if gate on, −∞ if gate off (Eq. 1–4)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MaskAttn-SDXL... preserves the pretrained backbone and standard sampling process, requiring no external supervision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing
cs.CR 2026-05 unverdicted novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
cs.CV 2026-04 unverdicted novelty 4.0

A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

INTRODUCTION Despite remarkable advances in text-to-image generation, state-of-the-art models still struggle to compose multiple ob- jects, attributes, and spatial constraints faithfully [1, 2, 3]. Recent studies report that a primary failure mode is generat- ing images that do not accurately reflect the input prompt’s composition [4]. For example, a prom...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Architecture Overview Our MaskAttn-SDXL extends the SDXL latent diffusion pipeline with masked cross-attention gates

METHOD 2.1. Architecture Overview Our MaskAttn-SDXL extends the SDXL latent diffusion pipeline with masked cross-attention gates. The overview architecture is shown in Fig. 2. Similar to SDXL, our model first encodes an input image into a compact latent using a pretrianed Variational AutoEncoder(V AE) [10]. Then, we in- troduce Gaussian noise into this la...

work page
[3]

The ex- periments are designed to validate the model’s effectiveness in mitigating cross-token interference and enhancing compo- sitional control in text-to-image generation

EXPERIMENTS To rigorously assess our approach and enable a meaningful comparisons with state-of-the-art diffusion models, we exam- ine our MaskAttn-SDXL and baseline methods on both MS COCO 2014-30K [11] and Flickr30k [12] datasets. The ex- periments are designed to validate the model’s effectiveness in mitigating cross-token interference and enhancing co...

work page 2014
[4]

The approach adds small token-conditioned gate heads while leaving the pretrained backbone, text encoders, and sampling path unchanged

CONCLUSION We addressed a recurring weakness of text-to-image diffu- sion, which is cross-token interference under multi-entity prompts—by proposing MaskAttn-SDXL, injecting a sim- ple yet effective gating mechanism that operates directly on cross-attention logits in SDXL’s mid resolution blocks. The approach adds small token-conditioned gate heads while ...

work page
[5]

T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image gen- eration,

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, “T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image gen- eration,”Advances in Neural Information Processing Systems, vol. 36, pp. 78723–78747, 2023

work page 2023
[6]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” inACM SIGGRAPH Conference Proceedings. 2023, ACM

work page 2023
[7]

Geneval: An object-focused framework for evaluating text-to-image alignment,

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,”Advances in Neu- ral Information Processing Systems, vol. 36, pp. 52132– 52152, 2023

work page 2023
[8]

Improving composi- tional attribute binding in text-to-image generative mod- els via enhanced text embeddings,

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kat- takinda, and Soheil Feizi, “Improving composi- tional attribute binding in text-to-image generative mod- els via enhanced text embeddings,”arXiv preprint arXiv:2406.07844, 2024

work page arXiv 2024
[9]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” 2023

work page 2023
[10]

Gligen: Open-set grounded text-to-image genera- tion,

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee, “Gligen: Open-set grounded text-to-image genera- tion,”arXiv preprint arXiv:2301.07093, 2023

work page arXiv 2023
[11]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, “Adding conditional control to text-to-image diffusion models,” inCVPR, 2023

work page 2023
[12]

High-resolution im- age synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution im- age synthesis with latent diffusion models,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695

work page 2022
[13]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber- man, Yael Pritch, and Daniel Cohen-Or, “Prompt- to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Auto-encoding variational bayes,

Diederik P. Kingma and Max Welling, “Auto-encoding variational bayes,” inProceedings of the International Conference on Learning Representations (ICLR), 2014

work page 2014
[15]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll ´ar, “Microsoft coco: Common objects in context,” inEu- ropean Conference on Computer Vision (ECCV). 2014, pp. 740–755, Springer, Cham

work page 2014
[16]

Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models,

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, “Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models,” inProceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2015, pp. 2641– 2649

work page 2015
[17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information process- ing systems, vol. 30, 2017

work page 2017
[18]

Assessing gen- erative models via precision and recall,

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly, “Assessing gen- erative models via precision and recall,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[19]

Clipscore: A reference-free evalua- tion metric for image captioning,

Jack Hessel et al., “Clipscore: A reference-free evalua- tion metric for image captioning,” 2021

work page 2021
[20]

High-resolution im- age synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution im- age synthesis with latent diffusion models,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[21]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image gen- eration,”arXiv preprint arXiv:2305.01569, 2023

work page arXiv 2023

[1] [1]

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

INTRODUCTION Despite remarkable advances in text-to-image generation, state-of-the-art models still struggle to compose multiple ob- jects, attributes, and spatial constraints faithfully [1, 2, 3]. Recent studies report that a primary failure mode is generat- ing images that do not accurately reflect the input prompt’s composition [4]. For example, a prom...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Architecture Overview Our MaskAttn-SDXL extends the SDXL latent diffusion pipeline with masked cross-attention gates

METHOD 2.1. Architecture Overview Our MaskAttn-SDXL extends the SDXL latent diffusion pipeline with masked cross-attention gates. The overview architecture is shown in Fig. 2. Similar to SDXL, our model first encodes an input image into a compact latent using a pretrianed Variational AutoEncoder(V AE) [10]. Then, we in- troduce Gaussian noise into this la...

work page

[3] [3]

The ex- periments are designed to validate the model’s effectiveness in mitigating cross-token interference and enhancing compo- sitional control in text-to-image generation

EXPERIMENTS To rigorously assess our approach and enable a meaningful comparisons with state-of-the-art diffusion models, we exam- ine our MaskAttn-SDXL and baseline methods on both MS COCO 2014-30K [11] and Flickr30k [12] datasets. The ex- periments are designed to validate the model’s effectiveness in mitigating cross-token interference and enhancing co...

work page 2014

[4] [4]

The approach adds small token-conditioned gate heads while leaving the pretrained backbone, text encoders, and sampling path unchanged

CONCLUSION We addressed a recurring weakness of text-to-image diffu- sion, which is cross-token interference under multi-entity prompts—by proposing MaskAttn-SDXL, injecting a sim- ple yet effective gating mechanism that operates directly on cross-attention logits in SDXL’s mid resolution blocks. The approach adds small token-conditioned gate heads while ...

work page

[5] [5]

T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image gen- eration,

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, “T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image gen- eration,”Advances in Neural Information Processing Systems, vol. 36, pp. 78723–78747, 2023

work page 2023

[6] [6]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” inACM SIGGRAPH Conference Proceedings. 2023, ACM

work page 2023

[7] [7]

Geneval: An object-focused framework for evaluating text-to-image alignment,

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,”Advances in Neu- ral Information Processing Systems, vol. 36, pp. 52132– 52152, 2023

work page 2023

[8] [8]

Improving composi- tional attribute binding in text-to-image generative mod- els via enhanced text embeddings,

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kat- takinda, and Soheil Feizi, “Improving composi- tional attribute binding in text-to-image generative mod- els via enhanced text embeddings,”arXiv preprint arXiv:2406.07844, 2024

work page arXiv 2024

[9] [9]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” 2023

work page 2023

[10] [10]

Gligen: Open-set grounded text-to-image genera- tion,

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee, “Gligen: Open-set grounded text-to-image genera- tion,”arXiv preprint arXiv:2301.07093, 2023

work page arXiv 2023

[11] [11]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, “Adding conditional control to text-to-image diffusion models,” inCVPR, 2023

work page 2023

[12] [12]

High-resolution im- age synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution im- age synthesis with latent diffusion models,” inProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695

work page 2022

[13] [13]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber- man, Yael Pritch, and Daniel Cohen-Or, “Prompt- to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Auto-encoding variational bayes,

Diederik P. Kingma and Max Welling, “Auto-encoding variational bayes,” inProceedings of the International Conference on Learning Representations (ICLR), 2014

work page 2014

[15] [15]

Microsoft coco: Common objects in context,

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll ´ar, “Microsoft coco: Common objects in context,” inEu- ropean Conference on Computer Vision (ECCV). 2014, pp. 740–755, Springer, Cham

work page 2014

[16] [16]

Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models,

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, “Flickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models,” inProceedings of the IEEE International Con- ference on Computer Vision (ICCV), 2015, pp. 2641– 2649

work page 2015

[17] [17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information process- ing systems, vol. 30, 2017

work page 2017

[18] [18]

Assessing gen- erative models via precision and recall,

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly, “Assessing gen- erative models via precision and recall,”Advances in neural information processing systems, vol. 31, 2018

work page 2018

[19] [19]

Clipscore: A reference-free evalua- tion metric for image captioning,

Jack Hessel et al., “Clipscore: A reference-free evalua- tion metric for image captioning,” 2021

work page 2021

[20] [20]

High-resolution im- age synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution im- age synthesis with latent diffusion models,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[21] [21]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image gen- eration,”arXiv preprint arXiv:2305.01569, 2023

work page arXiv 2023