{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:FMR6TAKPY4GK7B2J6PENHAJL7N","short_pith_number":"pith:FMR6TAKP","schema_version":"1.0","canonical_sha256":"2b23e9814fc70caf8749f3c8d3812bfb427c20561b0eb78e59e0f5adcfb00686","source":{"kind":"arxiv","id":"2110.04627","version":3},"attestation_state":"computed","paper":{"title":"Vector-quantized Image Modeling with Improved VQGAN","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Alexander Ku, Han Zhang, James Qin, Jason Baldridge, Jiahui Yu, Jing Yu Koh, Ruoming Pang, Xin Li, Yonghui Wu, Yuanzhong Xu","submitted_at":"2021-10-09T18:36:00Z","abstract_excerpt":"Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learnin"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2110.04627","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2021-10-09T18:36:00Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"a7b3bfcc264af0f8477ff69f4872107cbe3567a6ffb64ac00190bf727d3d480c","abstract_canon_sha256":"1281174dc868bc9ba29a7618a8026ba8d4e9294520a78257df36a7f3417a8909"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.953492Z","signature_b64":"3v/dICYmc+w1JF4DUHwwfybQw4G3iJkbUnOd6djYbfNMvhtQtP6pJjrs1wRO+hBEWqijG4GR8yj1tBDBQMeMDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"2b23e9814fc70caf8749f3c8d3812bfb427c20561b0eb78e59e0f5adcfb00686","last_reissued_at":"2026-05-17T23:38:46.953044Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.953044Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Vector-quantized Image Modeling with Improved VQGAN","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Alexander Ku, Han Zhang, James Qin, Jason Baldridge, Jiahui Yu, Jing Yu Koh, Ruoming Pang, Xin Li, Yonghui Wu, Yuanzhong Xu","submitted_at":"2021-10-09T18:36:00Z","abstract_excerpt":"Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learnin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the discrete tokens produced by the improved ViT-VQGAN retain enough visual information for autoregressive modeling to succeed at both high-quality generation and strong unsupervised representations without critical loss of detail or mode collapse.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b4daf46dd03dca15b66d07073bf100f3c56513e2ce4f99fd67b364bc3e1ab25b"},"source":{"id":"2110.04627","kind":"arxiv","version":3},"verdict":{"id":"68ed06e5-14f0-41ec-ba4d-07d5e8bc5c83","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T18:36:26.666226Z","strongest_claim":"When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively.","one_line_summary":"Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the discrete tokens produced by the improved ViT-VQGAN retain enough visual information for autoregressive modeling to succeed at both high-quality generation and strong unsupervised representations without critical loss of detail or mode collapse.","pith_extraction_headline":"An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet."},"references":{"count":75,"sample":[{"doi":"","year":2010,"title":"Schwing, Jan Kautz, and Arash Vahdat","work_id":"0fe7dc60-401a-4c98-a434-f44dc197d8f0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1906,"title":"Learning representations by maximizing mutual information across views","work_id":"65bca3d5-78f4-42d4-a6b1-cec1b702a646","ref_index":2,"cited_arxiv_id":"1906.00910","is_internal_anchor":true},{"doi":"","year":2020,"title":"Towards causal benchmarking of bias in face analysis algorithms, 2020","work_id":"6501ce5d-4889-498d-9bfe-6701c3b26b5b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"BEiT: BERT Pre-Training of Image Transformers","work_id":"d74eda3c-bf7e-45f1-a8f1-a0137ecca3f4","ref_index":4,"cited_arxiv_id":"2106.08254","is_internal_anchor":true},{"doi":"","year":2019,"title":"Large scale gan training for high fidelity natural image synthesis","work_id":"c8a7029d-c609-4cc5-8c21-654c100ca506","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":75,"snapshot_sha256":"35425ef17774a3871e1cc345ffc0fd40dfeb6c986012deae26f725c5ac88b615","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0eec636ccc08a7e63b3e7b5fb0dd1306cee67f0d27686e029740c6322eb95b4e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2110.04627","created_at":"2026-05-17T23:38:46.953118+00:00"},{"alias_kind":"arxiv_version","alias_value":"2110.04627v3","created_at":"2026-05-17T23:38:46.953118+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2110.04627","created_at":"2026-05-17T23:38:46.953118+00:00"},{"alias_kind":"pith_short_12","alias_value":"FMR6TAKPY4GK","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"FMR6TAKPY4GK7B2J","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"FMR6TAKP","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2406.16042","citing_title":"Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2503.14324","citing_title":"DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2601.01593","citing_title":"Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16384","citing_title":"Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18390","citing_title":"Vision Foundation Models as Generalist Tokenizers for Image Generation","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20170","citing_title":"KoRe: Compact Knowledge Representations for Large Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2505.24437","citing_title":"SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18457","citing_title":"VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2309.15505","citing_title":"Finite Scalar Quantization: VQ-VAE Made Simple","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14671","citing_title":"Mirai: Autoregressive Visual Generation Needs Foresight","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2409.04429","citing_title":"VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14333","citing_title":"InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13517","citing_title":"ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04170","citing_title":"Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2205.11487","citing_title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2206.10789","citing_title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06525","citing_title":"Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24885","citing_title":"VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05646","citing_title":"MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality","ref_index":157,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01220","citing_title":"Visual Implicit Autoregressive Modeling","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00503","citing_title":"End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2410.13720","citing_title":"Movie Gen: A Cast of Media Foundation Models","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10927","citing_title":"LiveGesture Streamable Co-Speech Gesture Generation Model","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2209.14792","citing_title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05113","citing_title":"CRAB: Codebook Rebalancing for Bias Mitigation in Generative Recommendation","ref_index":20,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N","json":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N.json","graph_json":"https://pith.science/api/pith-number/FMR6TAKPY4GK7B2J6PENHAJL7N/graph.json","events_json":"https://pith.science/api/pith-number/FMR6TAKPY4GK7B2J6PENHAJL7N/events.json","paper":"https://pith.science/paper/FMR6TAKP"},"agent_actions":{"view_html":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N","download_json":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N.json","view_paper":"https://pith.science/paper/FMR6TAKP","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2110.04627&json=true","fetch_graph":"https://pith.science/api/pith-number/FMR6TAKPY4GK7B2J6PENHAJL7N/graph.json","fetch_events":"https://pith.science/api/pith-number/FMR6TAKPY4GK7B2J6PENHAJL7N/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N/action/timestamp_anchor","attest_storage":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N/action/storage_attestation","attest_author":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N/action/author_attestation","sign_citation":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N/action/citation_signature","submit_replication":"https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N/action/replication_record"}},"created_at":"2026-05-17T23:38:46.953118+00:00","updated_at":"2026-05-17T23:38:46.953118+00:00"}