MaskAttn-SDXL: Controllable region-level text-to-image generation

· 2025 · cs.CV · arXiv 2509.15357

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost. In parallel, compositional reliability remains weak: models often mix attributes across objects, violate spatial relations, or omit requested entities, and these errors are not reliably reflected by global metrics such as FID or CLIP-based scores. To address these issues without changing the SDXL pipeline, we propose MaskAttn-SDXL, a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.

representative citing papers

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

cs.CV · 2025-09-18 · unverdicted · novelty 6.0

MaskAttn-SDXL adds token-conditioned spatial gating to SDXL cross-attention to sparsify irrelevant token-to-location bindings and improve region-level controllability without retraining or inference edits.

Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

cs.CV · 2026-04-10 · unverdicted · novelty 4.0

A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.

citing papers explorer

Showing 3 of 3 citing papers.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing cs.CR · 2026-05-11 · unverdicted · none · ref 42 · internal anchor
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation cs.CV · 2025-09-18 · unverdicted · none · ref 1 · internal anchor
MaskAttn-SDXL adds token-conditioned spatial gating to SDXL cross-attention to sparsify irrelevant token-to-location bindings and improve region-level controllability without retraining or inference edits.
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning cs.CV · 2026-04-10 · unverdicted · none · ref 8 · internal anchor
A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.

MaskAttn-SDXL: Controllable region-level text-to-image generation

fields

years

verdicts

representative citing papers

citing papers explorer