{"paper":{"title":"Vector-quantized Image Modeling with Improved VQGAN","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Alexander Ku, Han Zhang, James Qin, Jason Baldridge, Jiahui Yu, Jing Yu Koh, Ruoming Pang, Xin Li, Yonghui Wu, Yuanzhong Xu","submitted_at":"2021-10-09T18:36:00Z","abstract_excerpt":"Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learnin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the discrete tokens produced by the improved ViT-VQGAN retain enough visual information for autoregressive modeling to succeed at both high-quality generation and strong unsupervised representations without critical loss of detail or mode collapse.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b4daf46dd03dca15b66d07073bf100f3c56513e2ce4f99fd67b364bc3e1ab25b"},"source":{"id":"2110.04627","kind":"arxiv","version":3},"verdict":{"id":"68ed06e5-14f0-41ec-ba4d-07d5e8bc5c83","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T18:36:26.666226Z","strongest_claim":"When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively.","one_line_summary":"Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the discrete tokens produced by the improved ViT-VQGAN retain enough visual information for autoregressive modeling to succeed at both high-quality generation and strong unsupervised representations without critical loss of detail or mode collapse.","pith_extraction_headline":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet."},"references":{"count":75,"sample":[{"doi":"","year":2010,"title":"Schwing, Jan Kautz, and Arash Vahdat","work_id":"0fe7dc60-401a-4c98-a434-f44dc197d8f0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1906,"title":"Learning representations by maximizing mutual information across views","work_id":"65bca3d5-78f4-42d4-a6b1-cec1b702a646","ref_index":2,"cited_arxiv_id":"1906.00910","is_internal_anchor":true},{"doi":"","year":2020,"title":"Towards causal benchmarking of bias in face analysis algorithms, 2020","work_id":"6501ce5d-4889-498d-9bfe-6701c3b26b5b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"BEiT: BERT Pre-Training of Image Transformers","work_id":"d74eda3c-bf7e-45f1-a8f1-a0137ecca3f4","ref_index":4,"cited_arxiv_id":"2106.08254","is_internal_anchor":true},{"doi":"","year":2019,"title":"Large scale gan training for high fidelity natural image synthesis","work_id":"c8a7029d-c609-4cc5-8c21-654c100ca506","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":75,"snapshot_sha256":"35425ef17774a3871e1cc345ffc0fd40dfeb6c986012deae26f725c5ac88b615","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0eec636ccc08a7e63b3e7b5fb0dd1306cee67f0d27686e029740c6322eb95b4e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}