pith. machine review for the scientific record. sign in

arxiv: 2110.04627 · v3 · submitted 2021-10-09 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Vector-quantized Image Modeling with Improved VQGAN

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords vector-quantized image modelingViT-VQGANautoregressive transformerimage generationImageNetunsupervised representation learningdiscrete tokens
0
0 comments X

The pith

An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores pretraining a Transformer to predict rasterized image tokens autoregressively, mirroring the success of next-token prediction in language models. It first improves the VQGAN encoder and codebook learning by adopting a Vision Transformer backbone, which raises reconstruction fidelity and efficiency. These higher-quality discrete tokens then support stronger unconditional and class-conditioned image generation. The same tokens also enable unsupervised pretraining of the Transformer, whose averaged intermediate features yield better linear-probe accuracy than previous image GPT models on ImageNet at 256 by 256 resolution.

Core claim

By replacing the convolutional backbone of vanilla VQGAN with a Vision Transformer and refining codebook learning, the resulting discrete tokens retain enough visual information for an autoregressive Transformer to model images effectively. When trained on ImageNet at 256 by 256, this vector-quantized image modeling approach achieves an Inception Score of 175.1 and Fréchet Inception Distance of 4.17, compared with 70.6 and 17.04 for the original VQGAN. The same ImageNet-pretrained Transformer also raises linear-probe accuracy from 60.3 percent for iGPT-L to 73.2 percent for a comparable model size.

What carries the argument

Improved ViT-VQGAN, a Vision Transformer-based vector-quantized generative adversarial network that encodes images into discrete tokens with higher reconstruction fidelity for use in autoregressive next-token prediction.

If this is right

  • Unconditional image generation on ImageNet reaches substantially higher quality metrics than prior VQGAN methods.
  • Class-conditioned generation benefits from the same token improvements.
  • Unsupervised pretraining on the tokens yields stronger transferable image representations than earlier autoregressive vision models.
  • The approach demonstrates that discrete token modeling can close much of the performance gap between vision and language pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better discrete representations may allow vision models to adopt the same scaling laws that have driven language model progress.
  • The method could be extended to higher resolutions or additional modalities by keeping the same tokenization and autoregressive backbone.
  • Stronger tokens might reduce reliance on massive unlabeled web datasets for competitive representation learning.

Load-bearing premise

The discrete tokens retain enough visual detail that autoregressive modeling can succeed at both high-quality generation and useful representations without critical loss of information or mode collapse.

What would settle it

If replacing the vanilla VQGAN tokens with the improved ViT-VQGAN tokens produces no gain in Inception Score above 100 or leaves linear-probe accuracy below 65 percent, the central claim would be falsified.

read the original abstract

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces architectural and codebook-learning improvements to a Vision-Transformer-based VQGAN (ViT-VQGAN) to produce higher-fidelity discrete image tokens. These tokens are then used for autoregressive pretraining of a Transformer under the vector-quantized image modeling (VIM) framework. On ImageNet 256×256 the method reports IS 175.1 / FID 4.17 (vs. vanilla VQGAN 70.6 / 17.04) and raises linear-probe accuracy from iGPT-L’s 60.3 % to 73.2 %.

Significance. If the empirical numbers hold, the work shows that modest, targeted changes to the tokenizer can materially improve both high-fidelity autoregressive image synthesis and unsupervised representation quality, offering a concrete route to scale language-model-style pretraining to vision without requiring continuous latent spaces.

major comments (2)
  1. [§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.
  2. [Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.
minor comments (2)
  1. [Abstract] Abstract: “Fr'echet” should be spelled “Fréchet”.
  2. [§3.1] §3.1: the precise definition of the new commitment-loss weighting schedule is stated only in the appendix; a one-sentence summary in the main text would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major point below and will update the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.

    Authors: We agree that these implementation details are essential for reproducibility. In the revised manuscript we will add a dedicated subsection (and supplementary material if needed) that fully specifies the training schedules, optimizer choices (Adam with explicit learning rates, betas, and weight decay), batch sizes, number of epochs/steps, and all data-augmentation pipelines used for both the ViT-VQGAN tokenizer training and the subsequent autoregressive Transformer pretraining. This will enable independent verification of the reported IS 175.1 / FID 4.17 results. revision: yes

  2. Referee: [Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.

    Authors: We acknowledge the need for clearer isolation of contributions. The reported gains arise from the combination of the Vision-Transformer encoder/decoder architecture and the revised codebook learning (factorized codebook plus improved commitment loss). In the revision we will add an explicit ablation that compares (i) the original CNN-based VQGAN, (ii) a ViT-based encoder/decoder with the original codebook loss, and (iii) the full ViT-VQGAN with the revised codebook loss, keeping model capacity and training duration as comparable as possible. The new results will be inserted into Table 2 or presented as an additional table in §4.2 to demonstrate that the FID improvement is attributable to the proposed changes. revision: yes

Circularity Check

0 steps flagged

No circularity: all results are direct empirical measurements on held-out data

full rationale

The paper reports standard Inception Score and FID metrics obtained by training the improved ViT-VQGAN and the subsequent autoregressive Transformer on ImageNet 256x256 and evaluating on the official test split. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz; the claimed gains are measured quantities compared against a re-implemented vanilla VQGAN baseline under identical evaluation protocols. The derivation chain is therefore self-contained experimental work.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions of neural network optimization and the premise that raster-order autoregressive modeling on discrete tokens is sufficient for image distributions; no new entities are postulated.

free parameters (1)
  • codebook size and commitment loss weight
    Hyperparameters of the vector quantizer that are tuned to achieve the reported reconstruction and downstream performance.

pith-pipeline@v0.9.0 · 5601 in / 1119 out tokens · 113119 ms · 2026-05-16T18:36:26.666226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LiveGesture Streamable Co-Speech Gesture Generation Model

    cs.CV 2026-04 unverdicted novelty 7.0

    LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline...

  2. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  3. Finite Scalar Quantization: VQ-VAE Made Simple

    cs.CV 2023-09 conditional novelty 7.0

    Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.

  4. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  5. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  6. ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

    cs.CV 2026-05 unverdicted novelty 6.0

    ArcVQ-VAE constrains VQ-VAE codebook vectors inside a time-dependent ball and adds angular margin loss to increase separability and codebook utilization.

  7. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  8. Visual Implicit Autoregressive Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.

  9. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  10. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  11. Frequency-Aware Flow Matching for High-Quality Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.

  12. CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    CRAB mitigates popularity bias in generative recommenders by rebalancing the semantic token codebook through splitting popular tokens and applying a tree-structured regularizer to boost representations for unpopular items.

  13. Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    A shared codebook with cross-view reconstruction plus fused-teacher self-distillation improves classification accuracy on incomplete multi-view multi-label data.

  14. Mirai: Autoregressive Visual Generation Needs Foresight

    cs.CV 2026-01 conditional novelty 6.0

    Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.

  15. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    cs.CV 2024-09 unverdicted novelty 6.0

    VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

  16. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  17. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    cs.CV 2022-06 unverdicted novelty 6.0

    Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.

  18. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 14 internal anchors

  1. [1]

    Schwing, Jan Kautz, and Arash Vahdat

    Jyoti Aneja, Alexander G. Schwing, Jan Kautz, and Arash Vahdat. NCP-VAE: variational autoencoders with noise contrastive priors. CoRR, abs/2010.02917, 2020. URL https://arxiv.org/abs/2010.02917

  2. [2]

    Learning Representations by Maximizing Mutual Information Across Views

    Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019

  3. [3]

    Towards causal benchmarking of bias in face analysis algorithms, 2020

    Guha Balakrishnan, Yuanjun Xiong, Wei Xia, and Pietro Perona. Towards causal benchmarking of bias in face analysis algorithms, 2020

  4. [4]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

  5. [5]

    Large scale gan training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR , 2019

  6. [6]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  7. [7]

    Unsupervised learning of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020

  8. [8]

    Emerging Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021

  9. [9]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020 a

  10. [10]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020 b

  11. [11]

    Big self-supervised models are strong semi-supervised learners

    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020 c

  12. [12]

    Pixelsnail: An improved autoregressive generative model

    Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018

  13. [13]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 d

  14. [14]

    Very deep \ vae \ s generalize autoregressive models and can outperform them on images

    Rewon Child. Very deep \ vae \ s generalize autoregressive models and can outperform them on images. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RLRXCV6DbEJ

  15. [15]

    Semi-supervised sequence learning

    Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing systems, 28: 0 3079--3087, 2015

  16. [16]

    On the genealogy of machine learning datasets: A critical history of imagenet

    Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. On the genealogy of machine learning datasets: A critical history of imagenet. Big Data & Society, 8 0 (2): 0 20539517211035955, 2021. doi:10.1177/20539517211035955. URL https://doi.org/10.1177/20539517211035955

  17. [17]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), NAACL-HLT , 2019

  18. [18]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021

  19. [19]

    Unsupervised visual representation learning by context prediction

    Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp.\ 1422--1430, 2015

  20. [20]

    Large scale adversarial representation learning

    Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019 a

  21. [21]

    Large scale adversarial representation learning

    Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544, 2019 b

  22. [22]

    Adversarial Feature Learning

    Jeff Donahue, Philipp Kr \"a henb \"u hl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016

  23. [23]

    a henb \

    Jeff Donahue, Philipp Kr \" a henb \" u hl, and Trevor Darrell. Adversarial feature learning. In ICLR , 2017

  24. [24]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  25. [25]

    A note on data biases in generative models

    Patrick Esser, Robin Rombach, and Bj \"o rn Ommer. A note on data biases in generative models. In NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020. URL https://arxiv.org/abs/2012.02516

  26. [26]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bj \" o rn Ommer. Taming transformers for high-resolution image synthesis. In CVPR , 2021

  27. [27]

    Unsupervised representation learning by predicting image rotations

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR , 2018 a

  28. [28]

    Unsupervised Representation Learning by Predicting Image Rotations

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018 b

  29. [29]

    Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS , 2014

  30. [30]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020

  31. [31]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9729--9738, 2020

  32. [32]

    Pioneer networks: Progressively growing generative autoencoder

    Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer networks: Progressively growing generative autoencoder. In Asia conference on computer vision, 2018

  33. [33]

    Data-efficient image recognition with contrastive predictive coding

    Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp.\ 4182--4192. PMLR, 2020

  34. [34]

    beta-vae: Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Lo \" c Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , 2017

  35. [35]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

  36. [36]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5967--5976, 2017. doi:10.1109/CVPR.2017.632

  37. [37]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp.\ 694--711. Springer, 2016

  38. [38]

    Progressive growing of GAN s for improved quality, stability, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GAN s for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb

  39. [39]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  40. [40]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8110--8119, 2020

  41. [41]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  42. [42]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR , 2014

  43. [43]

    Glow: Generative flow with invertible 1x1 convolutions

    Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d139db6a236200b21cc7f752979...

  44. [44]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 0 1097--1105, 2012

  45. [45]

    Principled hybrids of generative and discriminative models

    Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 1, pp.\ 87--94. IEEE, 2006

  46. [46]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  47. [47]

    Pulse: Self-supervised photo upsampling via latent space exploration of generative models

    Sachit Menon, Alex Damian, McCourt Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  48. [48]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, S. Dieleman, and P. Battaglia. Generating images with sparse representations. ICML, abs/2103.03841, 2021

  49. [49]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 8162--8171. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/nichol21a.html

  50. [50]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), ECCV , 2016 a

  51. [51]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp.\ 69--84. Springer, 2016 b

  52. [52]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017

  53. [53]

    Dual contradistinctive generative autoencoder

    Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 823--832, June 2021

  54. [54]

    Image transformer

    Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018

  55. [55]

    Adversarial latent autoencoders

    Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [to appear]

  56. [56]

    Unsupervised representation learning with deep convolutional generative adversarial networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun (eds.), ICLR , 2016

  57. [57]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

  58. [58]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  59. [59]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), ICML , 2021

  60. [60]

    Generating diverse high-fidelity images with VQ-VAE-2

    Ali Razavi, A \" a ron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2 . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019

  61. [61]

    A u-net based discriminator for generative adversarial networks

    Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8207--8216, 2020

  62. [62]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596--4604. PMLR, 2018

  63. [63]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  64. [64]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019

  65. [65]

    Image representations learned with unsupervised pre-training contain human-like biases

    Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In The 2021 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2021), 2021. URL https://arxiv.org/abs/2010.15052

  66. [66]

    NVAE: A deep hierarchical variational autoencoder

    Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), NeurIPS , 2020

  67. [67]

    Conditional image generation with pixelcnn decoders

    A \" a ron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), NeurIPS , 2016

  68. [68]

    Neural discrete representation learning

    A \" a ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), NeurIPS , 2017

  69. [69]

    Representation Learning with Contrastive Predictive Coding

    A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, abs/1807.03748, 2018

  70. [70]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS , 2017

  71. [71]

    The emergence of deepfake technology: A review

    Mika Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9: 0 40--53, 11/2019 2019. ISSN 1927-0321. doi:http://doi.org/10.22215/timreview/1282. URL timreview.ca/article/1282

  72. [72]

    \ VAEBM \ : A symbiosis between variational autoencoders and energy-based models

    Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. \ VAEBM \ : A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5m3SEczOV8L

  73. [73]

    Self-attention generative adversarial networks

    Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pp.\ 7354--7363. PMLR, 2019 a

  74. [74]

    Goodfellow, Dimitris N

    Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML , 2019 b

  75. [75]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018