arxiv: 2110.04627 · v3 · submitted 2021-10-09 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu , Xin Li , Jing Yu Koh , Han Zhang , Ruoming Pang , James Qin , Alexander Ku , Yuanzhong Xu

show 2 more authors

Jason Baldridge Yonghui Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords vector-quantized image modelingViT-VQGANautoregressive transformerimage generationImageNetunsupervised representation learningdiscrete tokens

0 comments

The pith

An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores pretraining a Transformer to predict rasterized image tokens autoregressively, mirroring the success of next-token prediction in language models. It first improves the VQGAN encoder and codebook learning by adopting a Vision Transformer backbone, which raises reconstruction fidelity and efficiency. These higher-quality discrete tokens then support stronger unconditional and class-conditioned image generation. The same tokens also enable unsupervised pretraining of the Transformer, whose averaged intermediate features yield better linear-probe accuracy than previous image GPT models on ImageNet at 256 by 256 resolution.

Core claim

By replacing the convolutional backbone of vanilla VQGAN with a Vision Transformer and refining codebook learning, the resulting discrete tokens retain enough visual information for an autoregressive Transformer to model images effectively. When trained on ImageNet at 256 by 256, this vector-quantized image modeling approach achieves an Inception Score of 175.1 and Fréchet Inception Distance of 4.17, compared with 70.6 and 17.04 for the original VQGAN. The same ImageNet-pretrained Transformer also raises linear-probe accuracy from 60.3 percent for iGPT-L to 73.2 percent for a comparable model size.

What carries the argument

Improved ViT-VQGAN, a Vision Transformer-based vector-quantized generative adversarial network that encodes images into discrete tokens with higher reconstruction fidelity for use in autoregressive next-token prediction.

If this is right

Unconditional image generation on ImageNet reaches substantially higher quality metrics than prior VQGAN methods.
Class-conditioned generation benefits from the same token improvements.
Unsupervised pretraining on the tokens yields stronger transferable image representations than earlier autoregressive vision models.
The approach demonstrates that discrete token modeling can close much of the performance gap between vision and language pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better discrete representations may allow vision models to adopt the same scaling laws that have driven language model progress.
The method could be extended to higher resolutions or additional modalities by keeping the same tokenization and autoregressive backbone.
Stronger tokens might reduce reliance on massive unlabeled web datasets for competitive representation learning.

Load-bearing premise

The discrete tokens retain enough visual detail that autoregressive modeling can succeed at both high-quality generation and useful representations without critical loss of information or mode collapse.

What would settle it

If replacing the vanilla VQGAN tokens with the improved ViT-VQGAN tokens produces no gain in Inception Score above 100 or leaves linear-probe accuracy below 65 percent, the central claim would be falsified.

read the original abstract

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViT-VQGAN upgrades give measurable lifts on both generation and linear probing, but the paper needs ablations to pin down exactly why.

read the letter

The main takeaway is that swapping a ViT backbone into VQGAN plus some codebook-learning tweaks produces clear gains: IS jumps from 70.6 to 175.1 and FID drops from 17 to 4.17 on 256x256 ImageNet, and the same tokens support a linear-probe accuracy lift from 60.3% to 73.2% that beats the larger iGPT-XL. Those are the concrete results worth noting first. The work extends the original VQGAN and iGPT lines by making the discrete tokenizer stronger, then feeding the tokens into an autoregressive Transformer for both unconditional and class-conditioned generation plus unsupervised representation learning. The empirical comparisons are direct and the reported numbers move the needle on standard benchmarks. The soft spots are the missing pieces that would let us trust the causal story. There are no detailed ablations showing how much each change (ViT architecture, codebook size, commitment loss weight) contributes versus simply training longer or at larger scale. The stress-test concern about high-frequency detail retention is reasonable; if the quantization step still drops critical information, the autoregressive model could be doing extra work to compensate, and we cannot tell from the current text. Training hyperparameters and full reproducibility details are also thin. This paper is aimed at groups already building discrete token models for vision or scaling unsupervised pretraining. A reader who needs stronger baselines for generative quality or label-free features will find the numbers useful. It deserves a serious referee because the claims are testable against public benchmarks and the baseline comparisons are fair, even if the causal attribution needs more work.

Referee Report

2 major / 2 minor

Summary. The paper introduces architectural and codebook-learning improvements to a Vision-Transformer-based VQGAN (ViT-VQGAN) to produce higher-fidelity discrete image tokens. These tokens are then used for autoregressive pretraining of a Transformer under the vector-quantized image modeling (VIM) framework. On ImageNet 256×256 the method reports IS 175.1 / FID 4.17 (vs. vanilla VQGAN 70.6 / 17.04) and raises linear-probe accuracy from iGPT-L’s 60.3 % to 73.2 %.

Significance. If the empirical numbers hold, the work shows that modest, targeted changes to the tokenizer can materially improve both high-fidelity autoregressive image synthesis and unsupervised representation quality, offering a concrete route to scale language-model-style pretraining to vision without requiring continuous latent spaces.

major comments (2)

[§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.
[Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.

minor comments (2)

[Abstract] Abstract: “Fr'echet” should be spelled “Fréchet”.
[§3.1] §3.1: the precise definition of the new commitment-loss weighting schedule is stated only in the appendix; a one-sentence summary in the main text would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major point below and will update the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.

Authors: We agree that these implementation details are essential for reproducibility. In the revised manuscript we will add a dedicated subsection (and supplementary material if needed) that fully specifies the training schedules, optimizer choices (Adam with explicit learning rates, betas, and weight decay), batch sizes, number of epochs/steps, and all data-augmentation pipelines used for both the ViT-VQGAN tokenizer training and the subsequent autoregressive Transformer pretraining. This will enable independent verification of the reported IS 175.1 / FID 4.17 results. revision: yes
Referee: [Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.

Authors: We acknowledge the need for clearer isolation of contributions. The reported gains arise from the combination of the Vision-Transformer encoder/decoder architecture and the revised codebook learning (factorized codebook plus improved commitment loss). In the revision we will add an explicit ablation that compares (i) the original CNN-based VQGAN, (ii) a ViT-based encoder/decoder with the original codebook loss, and (iii) the full ViT-VQGAN with the revised codebook loss, keeping model capacity and training duration as comparable as possible. The new results will be inserted into Table 2 or presented as an additional table in §4.2 to demonstrate that the FID improvement is attributable to the proposed changes. revision: yes

Circularity Check

0 steps flagged

No circularity: all results are direct empirical measurements on held-out data

full rationale

The paper reports standard Inception Score and FID metrics obtained by training the improved ViT-VQGAN and the subsequent autoregressive Transformer on ImageNet 256x256 and evaluating on the official test split. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz; the claimed gains are measured quantities compared against a re-implemented vanilla VQGAN baseline under identical evaluation protocols. The derivation chain is therefore self-contained experimental work.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions of neural network optimization and the premise that raster-order autoregressive modeling on discrete tokens is sufficient for image distributions; no new entities are postulated.

free parameters (1)

codebook size and commitment loss weight
Hyperparameters of the vector quantizer that are tuned to achieve the reported reconstruction and downstream performance.

pith-pipeline@v0.9.0 · 5601 in / 1119 out tokens · 113119 ms · 2026-05-16T18:36:26.666226+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiveGesture Streamable Co-Speech Gesture Generation Model
cs.CV 2026-04 unverdicted novelty 7.0

LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline...
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Finite Scalar Quantization: VQ-VAE Made Simple
cs.CV 2023-09 conditional novelty 7.0

Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
cs.CV 2022-05 accept novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
cs.CV 2026-05 unverdicted novelty 6.0

ArcVQ-VAE constrains VQ-VAE codebook vectors inside a time-dependent ball and adds angular margin loss to increase separability and codebook utilization.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Visual Implicit Autoregressive Modeling
cs.CV 2026-05 unverdicted novelty 6.0

VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
cs.CV 2026-04 unverdicted novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
Frequency-Aware Flow Matching for High-Quality Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

CRAB mitigates popularity bias in generative recommenders by rebalancing the semantic token codebook through splitting popular tokens and applying a tree-structured regularizer to boost representations for unpopular items.
Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

A shared codebook with cross-view reconstruction plus fused-teacher self-distillation improves classification accuracy on incomplete multi-view multi-label data.
Mirai: Autoregressive Visual Generation Needs Foresight
cs.CV 2026-01 conditional novelty 6.0

Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
cs.CV 2024-09 unverdicted novelty 6.0

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 14 internal anchors

[1]

Schwing, Jan Kautz, and Arash Vahdat

Jyoti Aneja, Alexander G. Schwing, Jan Kautz, and Arash Vahdat. NCP-VAE: variational autoencoders with noise contrastive priors. CoRR, abs/2010.02917, 2020. URL https://arxiv.org/abs/2010.02917

work page arXiv 2010
[2]

Learning Representations by Maximizing Mutual Information Across Views

Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[3]

Towards causal benchmarking of bias in face analysis algorithms, 2020

Guha Balakrishnan, Yuanjun Xiong, Wei Xia, and Pietro Perona. Towards causal benchmarking of bias in face analysis algorithms, 2020

work page 2020
[4]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Large scale gan training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR , 2019

work page 2019
[6]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020

work page arXiv 2006
[8]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020 a

work page 2020
[10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020 b

work page 2020
[11]

Big self-supervised models are strong semi-supervised learners

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020 c

work page arXiv 2006
[12]

Pixelsnail: An improved autoregressive generative model

Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018

work page 2018
[13]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 d

work page internal anchor Pith review Pith/arXiv arXiv 2003
[14]

Very deep \ vae \ s generalize autoregressive models and can outperform them on images

Rewon Child. Very deep \ vae \ s generalize autoregressive models and can outperform them on images. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RLRXCV6DbEJ

work page 2021
[15]

Semi-supervised sequence learning

Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing systems, 28: 0 3079--3087, 2015

work page 2015
[16]

On the genealogy of machine learning datasets: A critical history of imagenet

Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. On the genealogy of machine learning datasets: A critical history of imagenet. Big Data & Society, 8 0 (2): 0 20539517211035955, 2021. doi:10.1177/20539517211035955. URL https://doi.org/10.1177/20539517211035955

work page doi:10.1177/20539517211035955 2021
[17]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), NAACL-HLT , 2019

work page 2019
[18]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Unsupervised visual representation learning by context prediction

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp.\ 1422--1430, 2015

work page 2015
[20]

Large scale adversarial representation learning

Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019 a

work page 2019
[21]

Large scale adversarial representation learning

Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544, 2019 b

work page arXiv 1907
[22]

Adversarial Feature Learning

Jeff Donahue, Philipp Kr \"a henb \"u hl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

a henb \

Jeff Donahue, Philipp Kr \" a henb \" u hl, and Trevor Darrell. Adversarial feature learning. In ICLR , 2017

work page 2017
[24]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[25]

A note on data biases in generative models

Patrick Esser, Robin Rombach, and Bj \"o rn Ommer. A note on data biases in generative models. In NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020. URL https://arxiv.org/abs/2012.02516

work page arXiv 2020
[26]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj \" o rn Ommer. Taming transformers for high-resolution image synthesis. In CVPR , 2021

work page 2021
[27]

Unsupervised representation learning by predicting image rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR , 2018 a

work page 2018
[28]

Unsupervised Representation Learning by Predicting Image Rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018 b

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS , 2014

work page 2014
[30]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020

work page arXiv 2006
[31]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9729--9738, 2020

work page 2020
[32]

Pioneer networks: Progressively growing generative autoencoder

Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer networks: Progressively growing generative autoencoder. In Asia conference on computer vision, 2018

work page 2018
[33]

Data-efficient image recognition with contrastive predictive coding

Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp.\ 4182--4192. PMLR, 2020

work page 2020
[34]

beta-vae: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Lo \" c Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , 2017

work page 2017
[35]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

work page 2020
[36]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5967--5976, 2017. doi:10.1109/CVPR.2017.632

work page doi:10.1109/cvpr.2017.632 2017
[37]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp.\ 694--711. Springer, 2016

work page 2016
[38]

Progressive growing of GAN s for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GAN s for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb

work page 2018
[39]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[40]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8110--8119, 2020

work page 2020
[41]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[42]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR , 2014

work page 2014
[43]

Glow: Generative flow with invertible 1x1 convolutions

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d139db6a236200b21cc7f752979...

work page 2018
[44]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 0 1097--1105, 2012

work page 2012
[45]

Principled hybrids of generative and discriminative models

Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 1, pp.\ 87--94. IEEE, 2006

work page 2006
[46]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Pulse: Self-supervised photo upsampling via latent space exploration of generative models

Sachit Menon, Alex Damian, McCourt Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[48]

Generating images with sparse representations

Charlie Nash, Jacob Menick, S. Dieleman, and P. Battaglia. Generating images with sparse representations. ICML, abs/2103.03841, 2021

work page arXiv 2021
[49]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 8162--8171. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/nichol21a.html

work page 2021
[50]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), ECCV , 2016 a

work page 2016
[51]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp.\ 69--84. Springer, 2016 b

work page 2016
[52]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Dual contradistinctive generative autoencoder

Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 823--832, June 2021

work page 2021
[54]

Image transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018

work page 2018
[55]

Adversarial latent autoencoders

Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [to appear]

work page 2020
[56]

Unsupervised representation learning with deep convolutional generative adversarial networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun (eds.), ICLR , 2016

work page 2016
[57]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[58]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[59]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), ICML , 2021

work page 2021
[60]

Generating diverse high-fidelity images with VQ-VAE-2

Ali Razavi, A \" a ron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2 . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019

work page 2019
[61]

A u-net based discriminator for generative adversarial networks

Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8207--8216, 2020

work page 2020
[62]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596--4604. PMLR, 2018

work page 2018
[63]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[64]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019

work page 2019
[65]

Image representations learned with unsupervised pre-training contain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In The 2021 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2021), 2021. URL https://arxiv.org/abs/2010.15052

work page arXiv 2021
[66]

NVAE: A deep hierarchical variational autoencoder

Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), NeurIPS , 2020

work page 2020
[67]

Conditional image generation with pixelcnn decoders

A \" a ron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), NeurIPS , 2016

work page 2016
[68]

Neural discrete representation learning

A \" a ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), NeurIPS , 2017

work page 2017
[69]

Representation Learning with Contrastive Predictive Coding

A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, abs/1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS , 2017

work page 2017
[71]

The emergence of deepfake technology: A review

Mika Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9: 0 40--53, 11/2019 2019. ISSN 1927-0321. doi:http://doi.org/10.22215/timreview/1282. URL timreview.ca/article/1282

work page doi:10.22215/timreview/1282 2019
[72]

\ VAEBM \ : A symbiosis between variational autoencoders and energy-based models

Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. \ VAEBM \ : A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5m3SEczOV8L

work page 2021
[73]

Self-attention generative adversarial networks

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pp.\ 7354--7363. PMLR, 2019 a

work page 2019
[74]

Goodfellow, Dimitris N

Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML , 2019 b

work page 2019
[75]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018

work page 2018