pith. sign in

arxiv: 2105.13290 · v3 · pith:ZJWEOS4Knew · submitted 2021-05-26 · 💻 cs.CV · cs.LG

CogView: Mastering Text-to-Image Generation via Transformers

classification 💻 cs.CV cs.LG
keywords cogviewgenerationproblemtext-to-imageachievesadvancebeenbillion-parameter
0
0 comments X
read the original abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  2. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  3. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  4. Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

    cs.LG 2026-06 unverdicted novelty 6.0

    MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

  5. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  6. Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    cs.CV 2025-08 unverdicted novelty 5.0

    Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.

  7. JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

    cs.CV 2026-06 unverdicted novelty 4.0

    JuZhou 1.0 is a 0.387B-parameter T2I diffusion model with 4-step inference achieving 0.69 GenEval, trained on 9M Chinese pairs using Sugon K100 accelerators and deployable on Android/iOS devices.