RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
read the original abstract
Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules to achieve a better balance between structural alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free controllable generation that is both structure-rich and appearance-rich. Extensive experiments demonstrate that our method achieves state-of-the-art performance under complex and diverse conditions. Owing to its generality, our framework naturally supports compositional conditional generation and generalizes across architectures in a plug-and-play manner, from UNet-based diffusion models to modern DiT backbones such as FLUX.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.