Recognition: 2 theorem links
· Lean TheoremVector-quantized Image Modeling with Improved VQGAN
Pith reviewed 2026-05-16 18:36 UTC · model grok-4.3
The pith
An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing the convolutional backbone of vanilla VQGAN with a Vision Transformer and refining codebook learning, the resulting discrete tokens retain enough visual information for an autoregressive Transformer to model images effectively. When trained on ImageNet at 256 by 256, this vector-quantized image modeling approach achieves an Inception Score of 175.1 and Fréchet Inception Distance of 4.17, compared with 70.6 and 17.04 for the original VQGAN. The same ImageNet-pretrained Transformer also raises linear-probe accuracy from 60.3 percent for iGPT-L to 73.2 percent for a comparable model size.
What carries the argument
Improved ViT-VQGAN, a Vision Transformer-based vector-quantized generative adversarial network that encodes images into discrete tokens with higher reconstruction fidelity for use in autoregressive next-token prediction.
If this is right
- Unconditional image generation on ImageNet reaches substantially higher quality metrics than prior VQGAN methods.
- Class-conditioned generation benefits from the same token improvements.
- Unsupervised pretraining on the tokens yields stronger transferable image representations than earlier autoregressive vision models.
- The approach demonstrates that discrete token modeling can close much of the performance gap between vision and language pretraining.
Where Pith is reading between the lines
- Better discrete representations may allow vision models to adopt the same scaling laws that have driven language model progress.
- The method could be extended to higher resolutions or additional modalities by keeping the same tokenization and autoregressive backbone.
- Stronger tokens might reduce reliance on massive unlabeled web datasets for competitive representation learning.
Load-bearing premise
The discrete tokens retain enough visual detail that autoregressive modeling can succeed at both high-quality generation and useful representations without critical loss of information or mode collapse.
What would settle it
If replacing the vanilla VQGAN tokens with the improved ViT-VQGAN tokens produces no gain in Inception Score above 100 or leaves linear-probe accuracy below 65 percent, the central claim would be falsified.
read the original abstract
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces architectural and codebook-learning improvements to a Vision-Transformer-based VQGAN (ViT-VQGAN) to produce higher-fidelity discrete image tokens. These tokens are then used for autoregressive pretraining of a Transformer under the vector-quantized image modeling (VIM) framework. On ImageNet 256×256 the method reports IS 175.1 / FID 4.17 (vs. vanilla VQGAN 70.6 / 17.04) and raises linear-probe accuracy from iGPT-L’s 60.3 % to 73.2 %.
Significance. If the empirical numbers hold, the work shows that modest, targeted changes to the tokenizer can materially improve both high-fidelity autoregressive image synthesis and unsupervised representation quality, offering a concrete route to scale language-model-style pretraining to vision without requiring continuous latent spaces.
major comments (2)
- [§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.
- [Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.
minor comments (2)
- [Abstract] Abstract: “Fr'echet” should be spelled “Fréchet”.
- [§3.1] §3.1: the precise definition of the new commitment-loss weighting schedule is stated only in the appendix; a one-sentence summary in the main text would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive comments. We address each major point below and will update the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): full training schedules, optimizer settings, and data-augmentation details for both the ViT-VQGAN stage and the subsequent autoregressive Transformer are omitted, preventing independent verification of the headline IS 175.1 / FID 4.17 numbers.
Authors: We agree that these implementation details are essential for reproducibility. In the revised manuscript we will add a dedicated subsection (and supplementary material if needed) that fully specifies the training schedules, optimizer choices (Adam with explicit learning rates, betas, and weight decay), batch sizes, number of epochs/steps, and all data-augmentation pipelines used for both the ViT-VQGAN tokenizer training and the subsequent autoregressive Transformer pretraining. This will enable independent verification of the reported IS 175.1 / FID 4.17 results. revision: yes
-
Referee: [Table 2] Table 2 / §4.2: no ablation isolates the contribution of the ViT encoder versus the revised codebook loss; without this decomposition it is impossible to confirm that the reported 13-point FID reduction is attributable to the claimed VQGAN improvements rather than increased model capacity or longer training.
Authors: We acknowledge the need for clearer isolation of contributions. The reported gains arise from the combination of the Vision-Transformer encoder/decoder architecture and the revised codebook learning (factorized codebook plus improved commitment loss). In the revision we will add an explicit ablation that compares (i) the original CNN-based VQGAN, (ii) a ViT-based encoder/decoder with the original codebook loss, and (iii) the full ViT-VQGAN with the revised codebook loss, keeping model capacity and training duration as comparable as possible. The new results will be inserted into Table 2 or presented as an additional table in §4.2 to demonstrate that the FID improvement is attributable to the proposed changes. revision: yes
Circularity Check
No circularity: all results are direct empirical measurements on held-out data
full rationale
The paper reports standard Inception Score and FID metrics obtained by training the improved ViT-VQGAN and the subsequent autoregressive Transformer on ImageNet 256x256 and evaluating on the official test split. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz; the claimed gains are measured quantities compared against a re-implemented vanilla VQGAN baseline under identical evaluation protocols. The derivation chain is therefore self-contained experimental work.
Axiom & Free-Parameter Ledger
free parameters (1)
- codebook size and commitment loss weight
Forward citations
Cited by 18 Pith papers
-
LiveGesture Streamable Co-Speech Gesture Generation Model
LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline...
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Finite Scalar Quantization: VQ-VAE Made Simple
Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
ArcVQ-VAE constrains VQ-VAE codebook vectors inside a time-dependent ball and adds angular margin loss to increase separability and codebook utilization.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Visual Implicit Autoregressive Modeling
VIAR embeds implicit equilibrium layers in visual autoregressive models to achieve ImageNet FID 2.16 with 38.4% of VAR parameters and controllable inference compute.
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
-
Frequency-Aware Flow Matching for High-Quality Image Generation
FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
-
CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation
CRAB mitigates popularity bias in generative recommenders by rebalancing the semantic token codebook through splitting popular tokens and applying a tree-structured regularizer to boost representations for unpopular items.
-
Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation
A shared codebook with cross-view reconstruction plus fused-teacher self-distillation improves classification accuracy on incomplete multi-view multi-label data.
-
Mirai: Autoregressive Visual Generation Needs Foresight
Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Reference graph
Works this paper leans on
-
[1]
Schwing, Jan Kautz, and Arash Vahdat
Jyoti Aneja, Alexander G. Schwing, Jan Kautz, and Arash Vahdat. NCP-VAE: variational autoencoders with noise contrastive priors. CoRR, abs/2010.02917, 2020. URL https://arxiv.org/abs/2010.02917
-
[2]
Learning Representations by Maximizing Mutual Information Across Views
Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[3]
Towards causal benchmarking of bias in face analysis algorithms, 2020
Guha Balakrishnan, Yuanjun Xiong, Wei Xia, and Pietro Perona. Towards causal benchmarking of bias in face analysis algorithms, 2020
work page 2020
-
[4]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Large scale gan training for high fidelity natural image synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR , 2019
work page 2019
-
[6]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020
-
[8]
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020 a
work page 2020
-
[10]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PMLR, 2020 b
work page 2020
-
[11]
Big self-supervised models are strong semi-supervised learners
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020 c
-
[12]
Pixelsnail: An improved autoregressive generative model
Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018
work page 2018
-
[13]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 d
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[14]
Very deep \ vae \ s generalize autoregressive models and can outperform them on images
Rewon Child. Very deep \ vae \ s generalize autoregressive models and can outperform them on images. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RLRXCV6DbEJ
work page 2021
-
[15]
Semi-supervised sequence learning
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing systems, 28: 0 3079--3087, 2015
work page 2015
-
[16]
On the genealogy of machine learning datasets: A critical history of imagenet
Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole. On the genealogy of machine learning datasets: A critical history of imagenet. Big Data & Society, 8 0 (2): 0 20539517211035955, 2021. doi:10.1177/20539517211035955. URL https://doi.org/10.1177/20539517211035955
-
[17]
BERT: pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), NAACL-HLT , 2019
work page 2019
-
[18]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Unsupervised visual representation learning by context prediction
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp.\ 1422--1430, 2015
work page 2015
-
[20]
Large scale adversarial representation learning
Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019 a
work page 2019
-
[21]
Large scale adversarial representation learning
Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544, 2019 b
-
[22]
Jeff Donahue, Philipp Kr \"a henb \"u hl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [23]
-
[24]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[25]
A note on data biases in generative models
Patrick Esser, Robin Rombach, and Bj \"o rn Ommer. A note on data biases in generative models. In NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020. URL https://arxiv.org/abs/2012.02516
-
[26]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bj \" o rn Ommer. Taming transformers for high-resolution image synthesis. In CVPR , 2021
work page 2021
-
[27]
Unsupervised representation learning by predicting image rotations
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR , 2018 a
work page 2018
-
[28]
Unsupervised Representation Learning by Predicting Image Rotations
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018 b
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C
Ian J. Goodfellow, Jean Pouget - Abadie, Mehdi Mirza, Bing Xu, David Warde - Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS , 2014
work page 2014
-
[30]
Bootstrap your own latent: A new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020
-
[31]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9729--9738, 2020
work page 2020
-
[32]
Pioneer networks: Progressively growing generative autoencoder
Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer networks: Progressively growing generative autoencoder. In Asia conference on computer vision, 2018
work page 2018
-
[33]
Data-efficient image recognition with contrastive predictive coding
Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp.\ 4182--4192. PMLR, 2020
work page 2020
-
[34]
beta-vae: Learning basic visual concepts with a constrained variational framework
Irina Higgins, Lo \" c Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , 2017
work page 2017
-
[35]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH
work page 2020
-
[36]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5967--5976, 2017. doi:10.1109/CVPR.2017.632
-
[37]
Perceptual losses for real-time style transfer and super-resolution
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp.\ 694--711. Springer, 2016
work page 2016
-
[38]
Progressive growing of GAN s for improved quality, stability, and variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GAN s for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb
work page 2018
-
[39]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[40]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8110--8119, 2020
work page 2020
-
[41]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[42]
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR , 2014
work page 2014
-
[43]
Glow: Generative flow with invertible 1x1 convolutions
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d139db6a236200b21cc7f752979...
work page 2018
-
[44]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 0 1097--1105, 2012
work page 2012
-
[45]
Principled hybrids of generative and discriminative models
Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 1, pp.\ 87--94. IEEE, 2006
work page 2006
-
[46]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
Pulse: Self-supervised photo upsampling via latent space exploration of generative models
Sachit Menon, Alex Damian, McCourt Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
work page 2020
-
[48]
Generating images with sparse representations
Charlie Nash, Jacob Menick, S. Dieleman, and P. Battaglia. Generating images with sparse representations. ICML, abs/2103.03841, 2021
-
[49]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 8162--8171. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/nichol21a.html
work page 2021
-
[50]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), ECCV , 2016 a
work page 2016
-
[51]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp.\ 69--84. Springer, 2016 b
work page 2016
-
[52]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Dual contradistinctive generative autoencoder
Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 823--832, June 2021
work page 2021
-
[54]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer G. Dy and Andreas Krause (eds.), ICML , 2018
work page 2018
-
[55]
Adversarial latent autoencoders
Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [to appear]
work page 2020
-
[56]
Unsupervised representation learning with deep convolutional generative adversarial networks
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Yoshua Bengio and Yann LeCun (eds.), ICLR , 2016
work page 2016
-
[57]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018
work page 2018
-
[58]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[59]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), ICML , 2021
work page 2021
-
[60]
Generating diverse high-fidelity images with VQ-VAE-2
Ali Razavi, A \" a ron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2 . In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019
work page 2019
-
[61]
A u-net based discriminator for generative adversarial networks
Edgar Schonfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8207--8216, 2020
work page 2020
-
[62]
Adafactor: Adaptive learning rates with sublinear memory cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596--4604. PMLR, 2018
work page 2018
-
[63]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[64]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), NeurIPS , 2019
work page 2019
-
[65]
Image representations learned with unsupervised pre-training contain human-like biases
Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In The 2021 ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2021), 2021. URL https://arxiv.org/abs/2010.15052
-
[66]
NVAE: A deep hierarchical variational autoencoder
Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), NeurIPS , 2020
work page 2020
-
[67]
Conditional image generation with pixelcnn decoders
A \" a ron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), NeurIPS , 2016
work page 2016
-
[68]
Neural discrete representation learning
A \" a ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), NeurIPS , 2017
work page 2017
-
[69]
Representation Learning with Contrastive Predictive Coding
A \" a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, abs/1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[70]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS , 2017
work page 2017
-
[71]
The emergence of deepfake technology: A review
Mika Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9: 0 40--53, 11/2019 2019. ISSN 1927-0321. doi:http://doi.org/10.22215/timreview/1282. URL timreview.ca/article/1282
-
[72]
\ VAEBM \ : A symbiosis between variational autoencoders and energy-based models
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. \ VAEBM \ : A symbiosis between variational autoencoders and energy-based models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5m3SEczOV8L
work page 2021
-
[73]
Self-attention generative adversarial networks
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pp.\ 7354--7363. PMLR, 2019 a
work page 2019
-
[74]
Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML , 2019 b
work page 2019
-
[75]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.