CogView: Mastering Text-to-Image Generation via Transformers

Chang Zhou; Da Yin; Hongxia Yang; Jie Tang; Junyang Lin; Ming Ding; Wendi Zheng; Wenyi Hong; Xu Zou; Zhou Shao

arxiv: 2105.13290 · v3 · pith:ZJWEOS4Knew · submitted 2021-05-26 · 💻 cs.CV · cs.LG

CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding , Zhuoyi Yang , Wenyi Hong , Wendi Zheng , Chang Zhou , Da Yin , Junyang Lin , Xu Zou

show 3 more authors

Zhou Shao Hongxia Yang Jie Tang

This is my paper

classification 💻 cs.CV cs.LG

keywords cogviewgenerationproblemtext-to-imageachievesadvancebeenbillion-parameter

0 comments

read the original abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
cs.CV 2025-08 unverdicted novelty 5.0

Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.
JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators
cs.CV 2026-06 unverdicted novelty 4.0

JuZhou 1.0 is a 0.387B-parameter T2I diffusion model with 4-step inference achieving 0.69 GenEval, trained on 9M Chinese pairs using Sugon K100 accelerators and deployable on Android/iOS devices.