arxiv: 2206.10789 · v1 · submitted 2022-06-22 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu , Yuanzhong Xu , Jing Yu Koh , Thang Luong , Gunjan Baid , Zirui Wang , Vijay Vasudevan , Alexander Ku

show 9 more authors

Yinfei Yang Burcu Karagol Ayan Ben Hutchinson Wei Han Zarana Parekh Xin Li Han Zhang Jason Baldridge Yonghui Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords text-to-image generationautoregressive modelingtransformer scalingViT-VQGANFID evaluationMS-COCO benchmarkcomplex scene synthesissequence-to-sequence vision

0 comments

The pith

Scaling an autoregressive encoder-decoder Transformer to 20 billion parameters produces high-fidelity images from text prompts that include complex compositions and world knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-to-image generation can be framed as a sequence-to-sequence problem where text tokens map to image tokens. A ViT-VQGAN first converts images into discrete token sequences, after which an encoder-decoder Transformer is trained autoregressively to predict those image tokens from text. Scaling this model from smaller sizes up to 20 billion parameters yields steady gains in photorealism and content richness, measured by new state-of-the-art zero-shot and finetuned FID scores on MS-COCO. The same scaling also improves results across varied prompt categories in a new benchmark of over 1600 English prompts and in Localized Narratives.

Core claim

By treating image synthesis as autoregressive token prediction in the same style as machine translation, and by scaling the underlying encoder-decoder Transformer to 20 billion parameters, the model reaches zero-shot FID of 7.23 and finetuned FID of 3.22 on MS-COCO while generating images that respect complex spatial arrangements and external knowledge.

What carries the argument

The encoder-decoder Transformer that receives text token sequences and autoregressively outputs image token sequences, scaled to 20 billion parameters, after images have been tokenized by ViT-VQGAN.

If this is right

Image quality and compositional accuracy rise consistently as the Transformer grows larger.
The model demonstrates stronger handling of world knowledge and detailed scene descriptions than prior text-to-image systems.
Performance can be measured holistically across prompt difficulty using the new PartiPrompts benchmark.
Limitations in current outputs define concrete targets for the next round of improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling laws observed in language modeling appear to transfer when images are represented as token sequences.
A single autoregressive architecture could eventually support joint text-image generation and understanding tasks.
Future work could test whether the same tokenizer-plus-Transformer recipe extends to video or 3D content without major redesign.

Load-bearing premise

That the discrete tokens from the image tokenizer retain enough visual information and that larger model sizes will continue to reduce error rates on intricate scenes without new failure modes.

What would settle it

If further increases in model size beyond 20 billion parameters produce no additional drop in FID on MS-COCO or fail to improve accuracy on prompts that require precise multi-object spatial relationships, the claim that scaling drives the gains would be refuted.

read the original abstract

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript presents the Parti model for text-to-image generation, framing the task as sequence-to-sequence modeling. Images are first tokenized into discrete sequences via a fixed ViT-VQGAN encoder; an encoder-decoder Transformer is then scaled up to 20B parameters to autoregressively predict the image-token sequence conditioned on text. The central empirical claims are consistent quality gains from scaling, a new zero-shot SOTA FID of 7.23 and finetuned FID of 3.22 on MS-COCO, plus supporting evaluations on Localized Narratives and the new PartiPrompts benchmark of >1600 prompts.

Significance. If the reported scaling behavior holds, the work demonstrates that autoregressive modeling—directly importing techniques and scaling laws from large language models—can produce high-fidelity, content-rich images without diffusion-style iterative refinement. The concrete FID numbers on standard benchmarks, the introduction of PartiPrompts, and the explicit discussion of limitations constitute clear, falsifiable contributions that can guide subsequent research. The empirical nature of the results (direct measurements on held-out data) avoids circularity.

minor comments (3)

The abstract states that scaling yields 'consistent quality improvements,' yet no table or figure in the provided summary shows the per-scale FID or human-preference curves that would make this claim directly verifiable; such a plot should be added or referenced by section number.
The ViT-VQGAN tokenizer is described as fixed; a brief quantitative statement of its reconstruction FID or perceptual loss on the training distribution would help readers assess whether observed generation gains are tokenizer-bounded.
The new PartiPrompts benchmark is introduced as 'holistic,' but the manuscript should explicitly list the difficulty axes (e.g., counting, spatial relations, world knowledge) and the number of prompts per axis so that future work can replicate the evaluation protocol.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The recognition of our scaling results, new SOTA FID scores, PartiPrompts benchmark, and explicit limitations discussion is appreciated. No specific major comments were listed in the report, so we have no point-by-point rebuttals to provide.

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper reports empirical outcomes from training encoder-decoder Transformers of increasing size (up to 20B parameters) on sequences of discrete image tokens produced by a fixed ViT-VQGAN tokenizer. Performance is quantified via zero-shot and finetuned FID scores on held-out MS-COCO splits plus qualitative analysis on Localized Narratives and PartiPrompts. No derivation chain, equations, or self-citations are invoked to obtain these scores; the reported improvements are measured results on external benchmarks rather than quantities forced by construction from model parameters or prior self-citations. The tokenizer is treated as an independent preprocessing step whose reconstruction fidelity is not derived from the autoregressive model itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical scaling benefits and the assumption that discrete token sequences from ViT-VQGAN suffice for photorealistic generation.

free parameters (1)

Transformer model size (up to 20B)
Chosen scale at which quality improvements are observed; not derived from first principles.

axioms (1)

domain assumption Images can be lossily but usefully represented as sequences of discrete tokens from a ViT-VQGAN tokenizer
Invoked to justify the sequence-to-sequence framing.

pith-pipeline@v0.9.0 · 5608 in / 1221 out tokens · 47223 ms · 2026-05-12T04:45:30.299538+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.DAlembert.Inevitability bilinear_family_forced / washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Parti treats text-to-image generation as a sequence-to-sequence modeling problem... scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23
Foundation.DimensionForcing / Foundation.LedgerForcing dimension_forced / conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
cs.DC 2026-04 unverdicted novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
Generating HDR Video from SDR Video
cs.CV 2026-05 unverdicted novelty 7.0

A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.
Does Engram Do Memory Retrieval in Autoregressive Image Generation?
cs.CV 2026-05 accept novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
cs.CV 2026-05 unverdicted novelty 7.0

VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot...
Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation
cs.CV 2026-04 conditional novelty 7.0

KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
DreamFusion: Text-to-3D using 2D Diffusion
cs.CV 2022-09 accept novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Visual Implicit Autoregressive Modeling
cs.CV 2026-05 unverdicted novelty 6.0

VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.
ViPO: Visual Preference Optimization at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
cs.CV 2026-04 unverdicted novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
Normalizing Flows with Iterative Denoising
cs.CV 2026-04 unverdicted novelty 6.0

iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
Closed-Form Concept Erasure via Double Projections
cs.LG 2026-04 unverdicted novelty 6.0

A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
cs.LG 2026-03 unverdicted novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
cs.GR 2025-06 unverdicted novelty 6.0

FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
cs.CV 2022-11 unverdicted novelty 6.0

An ensemble of stage-specialized text-to-image diffusion models improves prompt alignment over single shared-parameter models while preserving visual quality and inference speed.
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
cs.CV 2026-04 unverdicted novelty 5.0

ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Qwen-Image-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 41 Pith papers · 15 internal anchors

[1]

Introducing pathways: A next-generation ai architecture, 2021

Jeff Dean. Introducing pathways: A next-generation ai architecture, 2021. 28

work page 2021
[2]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021

work page 2021
[3]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[4]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[5]

Discrete Variational Autoencoders

Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016

work page Pith review arXiv 2016
[6]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[8]

Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021

work page 2021
[9]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machi...

work page 2021
[10]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022

work page arXiv 2022
[11]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review arXiv 2022
[14]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[15]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[16]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[17]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

work page 1901
[18]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

arXiv preprint arXiv:2112.06905 , year =

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efﬁcient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021

work page arXiv 2021
[21]

Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021

work page arXiv 2021
[22]

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE, 2020

work page 2020
[23]

Conformer: Convolution- augmented transformer for speech recognition,

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020

work page arXiv 2005
[24]

arXiv preprint arXiv:2001.09977 , year=

Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V . Le. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001
[25]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

work page internal anchor Pith review arXiv 2022
[26]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

work page 2021
[27]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

work page 2019
[28]

Gspmd: general and scalable parallelization for ml computation graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

work page arXiv 2021
[29]

Connecting vision and language with localized narratives

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with localized narratives. In ECCV, 2020

work page 2020
[30]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

work page 2020
[31]

Wide activation for efﬁcient and accurate image super-resolution

Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas Huang. Wide activation for efﬁcient and accurate image super-resolution. arXiv preprint arXiv:1808.08718, 2018. 30

work page arXiv 2018
[32]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

work page internal anchor Pith review arXiv 2015
[33]

SentencePiece:

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review arXiv 2018
[34]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[35]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

work page internal anchor Pith review arXiv 1910
[36]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

work page 2021
[38]

Classiﬁer free guidance for autoregressive transformers

Katherine Crowson. Classiﬁer free guidance for autoregressive transformers. 2021

work page 2021
[39]

Lingvo: a modular and scalable framework for sequence-to-sequence modeling

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019

work page arXiv 1902
[40]

Le, Yonghui Wu, and Zhifeng Chen

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism, 2018

work page 2018
[41]

Efﬁcient large-scale language model training on gpu clusters using megatron-lm, 2021

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efﬁcient large-scale language model training on gpu clusters using megatron-lm, 2021

work page 2021
[42]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018

work page 2018
[43]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review arXiv 2021
[44]

Scaling vision trans- formers, 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers, 2021

work page 2021
[45]

Simvlm: Sim- ple visual language model pretraining with weak supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[46]

Text-to-image generation grounded by ﬁne-grained user attention

Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Text-to-image generation grounded by ﬁne-grained user attention. WACV, 2021

work page 2021
[47]

Cross-modal contrastive learning for text-to-image generation

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–842, 2021

work page 2021
[48]

Benchmark for compositional text-to-image synthesis

Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. Benchmark for compositional text-to-image synthesis. In Thirty-ﬁfth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021
[49]

Vector quantized diﬀusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. arXiv preprint arXiv:2111.14822, 2021

work page arXiv 2021
[50]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021
[51]

Accelerating large-scale inference with anisotropic vector quantization

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Ku- mar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020. 31

work page 2020
[52]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[53]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[54]

AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018

work page 2018
[55]

Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. 2022

work page 2022
[56]

Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

work page 2021
[57]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002

work page 2002
[58]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015

work page 2015
[59]

Meteor universal: Language speciﬁc translation evalua- tion for any target language

Michael Denkowski and Alon Lavie. Meteor universal: Language speciﬁc translation evalua- tion for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA, June 2014. Association for Computa- tional Linguistics

work page 2014
[60]

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Caption Evaluation. In ECCV, 2016

work page 2016
[61]

Cogview2: Faster and better text-to- image generation via hierarchical transformers, 2022

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to- image generation via hierarchical transformers, 2022

work page 2022
[62]

mindall-e on conceptual captions

Chiheon Kim Doyup Lee Saehoon Kim, Sanghun Cho and Woonhyuk Baek. mindall-e on conceptual captions. https://github.com/kakaobrain/minDALL-E, 2021

work page 2021
[63]

X-lxmert: Paint, caption and answer questions with multi-modal transformers

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. X-lxmert: Paint, caption and answer questions with multi-modal transformers. ArXiv, abs/2009.11278, 2020

work page arXiv 2009
[64]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:664–676, 2017

work page 2017
[65]

How to marry a star: Probabilistic constraints for meaning in context

Katrin Erk and Aurélie Herbelot. How to marry a star: Probabilistic constraints for meaning in context. In Proceedings of the Society for Computation in Linguistics 2021, pages 451–453, Online, February 2021. Association for Computational Linguistics

work page 2021
[66]

Wordseye: an automatic text-to-scene conversion system

Bob Coyne and Richard Sproat. Wordseye: an automatic text-to-scene conversion system. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001

work page 2001
[67]

Generative adversarial text to image synthesis

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016

work page 2016
[68]

StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017

work page 2017
[69]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 2018

work page 2018
[70]

Learning what and where to draw

Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. NeurIPS, 29, 2016

work page 2016
[71]

Inferring semantic layout for hierarchical text-to-image synthesis

Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018. 32

work page 2018
[72]

Generating multiple objects at spatially distinct locations

Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Generating multiple objects at spatially distinct locations. In ICLR, 2019

work page 2019
[73]

Dall·e mini, 7 2021

Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phuc Le Khac, Luke Melas, and Ritobrata Ghosh. Dall·e mini, 7 2021

work page 2021
[74]

MaskGIT: Masked Generative Image Transformer,

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. arXiv preprint arXiv:2202.04200, 2022

work page arXiv 2022
[75]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[78]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022
[79]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

work page 2020
[80]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016

work page 2016
[81]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

Showing first 80 references.