UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

Guohao Li; Hua Wu; Hu Yang; Jiachen Liu; Qiaoqiao She; Wei Li; Xinyan Xiao; Xue Xu; Yajuan Lyu; Zhanpeng Wang

arxiv: 2210.16031 · v3 · pith:KC6YMKX7new · submitted 2022-10-28 · 💻 cs.CV · cs.CL

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

Wei Li , Xue Xu , Xinyan Xiao , Jiachen Liu , Hu Yang , Guohao Li , Zhanpeng Wang , Zhifan Feng

show 3 more authors

Qiaoqiao She Yajuan Lyu Hua Wu

This is my paper

classification 💻 cs.CV cs.CL

keywords generationimageupaintingmodelcomplexdiffusionmodelssimple

0 comments

read the original abstract

Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing image generation models mainly include text conditional diffusion model and cross-modal guided diffusion model, which are good at small scene image generation and complex scene image generation respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify simple and complex scene image generation, as shown in Figure 1. Based on architecture improvements and diverse guidance schedules, UPainting effectively integrates cross-modal guidance from a pretrained image-text matching model into a text conditional diffusion model that utilizes a pretrained Transformer language model as the text encoder. Our key findings is that combining the power of large-scale Transformer language model in understanding language and image-text matching model in capturing cross-modal semantics and style, is effective to improve sample fidelity and image-text alignment of image generation. In this way, UPainting has a more general image generation capability, which can generate images of both simple and complex scenes more effectively. To comprehensively compare text-to-image models, we further create a more general benchmark, UniBench, with well-written Chinese and English prompts in both simple and complex scenes. We compare UPainting with recent models and find that UPainting greatly outperforms other models in terms of caption similarity and image fidelity in both simple and complex scenes. UPainting project page \url{https://upainting.github.io/}.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
cs.CV 2026-05 unverdicted novelty 6.0

StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
cs.LG 2026-03 unverdicted novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
cs.CV 2023-09 conditional novelty 6.0

DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.