ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

HaiFeng Wang; Hao Tian; Hua Wu; Jiaxiang Liu; Lanxin Li; Li Chen; Shikun Feng; Weichong Yin; Xintong Yu; Xuyi Chen

arxiv: 2210.15257 · v2 · pith:OLB7QIXLnew · submitted 2022-10-27 · 💻 cs.CV · cs.AI

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

Zhida Feng , Zhenyu Zhang , Xintong Yu , Yewei Fang , Lanxin Li , Xuyi Chen , Yuxiang Lu , Jiaxiang Liu

show 7 more authors

Weichong Yin Shikun Feng Yu Sun Li Chen Hao Tian Hua Wu Haifeng Wang

This is my paper

classification 💻 cs.CV cs.AI

keywords diffusionernie-vilgtext-to-imagedenoisingdifferentfidelityimageimages

0 comments

read the original abstract

Recent progress in diffusion models has revolutionized the popular technology of text-to-image generation. While existing approaches could produce photorealistic high-resolution images with text conditions, there are still several open problems to be solved, which limits the further improvement of image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a large-scale Chinese text-to-image diffusion model, to progressively upgrade the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. With the proposed mechanisms, ERNIE-ViLG 2.0 not only achieves a new state-of-the-art on MS-COCO with zero-shot FID score of 6.75, but also significantly outperforms recent models in terms of image fidelity and image-text alignment, with side-by-side human evaluation on the bilingual prompt set ViLG-300.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Temporally Extended Mixture-of-Experts Models
cs.LG 2026-04 unverdicted novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
Shap-E: Generating Conditional 3D Implicit Functions
cs.CV 2023-05 accept novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.