{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:U5KALI76GX5OZF7IVTQ3GLLULE","short_pith_number":"pith:U5KALI76","schema_version":"1.0","canonical_sha256":"a75405a3fe35faec97e8ace1b32d7459196712903675423b00ec02d36aade776","source":{"kind":"arxiv","id":"2309.15505","version":2},"attestation_state":"computed","paper":{"title":"Finite Scalar Quantization: VQ-VAE Made Simple","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"David Minnen, Eirikur Agustsson, Fabian Mentzer, Michael Tschannen","submitted_at":"2023-09-27T09:13:40Z","abstract_excerpt":"We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-V"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2309.15505","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-09-27T09:13:40Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"62083644e49d78ab01722dcc2435bd6d5edf475e93cf1896c3bc04031e788647","abstract_canon_sha256":"38f0f105f4217052b88ab6b5dbfc6a4369c298813925934c25e84d837197197d"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.967607Z","signature_b64":"svzM7nOEdKc2Tqqgc6BBB6uwpvkfbvCzbsbthsixs/MxGVWSajxEgoD4o3MZb6wMudOCsgWKIJpFOmxduBnkCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a75405a3fe35faec97e8ace1b32d7459196712903675423b00ec02d36aade776","last_reissued_at":"2026-05-17T23:38:46.967163Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.967163Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Finite Scalar Quantization: VQ-VAE Made Simple","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"David Minnen, Eirikur Agustsson, Fabian Mentzer, Michael Tschannen","submitted_at":"2023-09-27T09:13:40Z","abstract_excerpt":"We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-V"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That projecting the VAE latent to a small number of dimensions (typically less than 10) and quantizing each independently to fixed levels preserves sufficient representational capacity for the downstream tasks to match VQ performance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"424a47ba721b65628e105cf5f81374a2933f1b44348becba0bca40b1a56ee31d"},"source":{"id":"2309.15505","kind":"arxiv","version":2},"verdict":{"id":"ffeb40f3-390e-4869-b3d4-2e702891f5db","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T18:28:56.392306Z","strongest_claim":"Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.","one_line_summary":"Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That projecting the VAE latent to a small number of dimensions (typically less than 10) and quantizing each independently to fixed levels preserves sufficient representational capacity for the downstream tasks to match VQ performance.","pith_extraction_headline":"FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels."},"references":{"count":22,"sample":[{"doi":"","year":null,"title":"Cm3: A causal masked multimodal model of the internet","work_id":"a4a6d3b6-13f5-437f-8081-765dd23198b9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling laws for generative mixed-modal language models.arXiv preprint arXiv:2301.03728","work_id":"22042a59-e502-4dd1-8288-7f2de7d5f7af","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"High Quality Monocular Depth Estimation via Transfer Learning","work_id":"9919fef4-288f-41ca-84ef-aec0a2134c9c","ref_index":3,"cited_arxiv_id":"1812.11941","is_internal_anchor":true},{"doi":"","year":null,"title":"End-to-end optimized image compression","work_id":"84c3cd9a-bc02-4db1-b13c-eb3806b8477f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation","work_id":"1fe8c7c8-aff7-4b94-9096-e549d7e60789","ref_index":5,"cited_arxiv_id":"1308.3432","is_internal_anchor":true}],"resolved_work":22,"snapshot_sha256":"00eeb833771c1a217378391d710e5dd9be137029044fe49d21b5e7d798026324","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2923d75a6040f9a72fdeb579d88de63a444019396d5de2a7cb5c07eacc20f03e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2309.15505","created_at":"2026-05-17T23:38:46.967224+00:00"},{"alias_kind":"arxiv_version","alias_value":"2309.15505v2","created_at":"2026-05-17T23:38:46.967224+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2309.15505","created_at":"2026-05-17T23:38:46.967224+00:00"},{"alias_kind":"pith_short_12","alias_value":"U5KALI76GX5O","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"U5KALI76GX5OZF7I","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"U5KALI76","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2511.07820","citing_title":"SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2512.01537","citing_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16384","citing_title":"Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09040","citing_title":"UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17085","citing_title":"Taming Audio VAEs via Target-KL Regularization","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17486","citing_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18390","citing_title":"Vision Foundation Models as Generalist Tokenizers for Image Generation","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2509.04072","citing_title":"Computational Narrative Understanding for Expressive Text-to-Speech","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21760","citing_title":"fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14333","citing_title":"InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09040","citing_title":"UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09040","citing_title":"UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09981","citing_title":"Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10438","citing_title":"Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09693","citing_title":"Do multimodal models imagine electric sheep?","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27607","citing_title":"JaiTTS: A Thai Voice Cloning Model","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23522","citing_title":"Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06582","citing_title":"PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05646","citing_title":"MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01418","citing_title":"TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00503","citing_title":"End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13030","citing_title":"Generative Refinement Networks for Visual Synthesis","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2501.03575","citing_title":"Cosmos World Foundation Model Platform for Physical AI","ref_index":133,"is_internal_anchor":true},{"citing_arxiv_id":"2505.14683","citing_title":"Emerging Properties in Unified Multimodal Pretraining","ref_index":51,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE","json":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE.json","graph_json":"https://pith.science/api/pith-number/U5KALI76GX5OZF7IVTQ3GLLULE/graph.json","events_json":"https://pith.science/api/pith-number/U5KALI76GX5OZF7IVTQ3GLLULE/events.json","paper":"https://pith.science/paper/U5KALI76"},"agent_actions":{"view_html":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE","download_json":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE.json","view_paper":"https://pith.science/paper/U5KALI76","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2309.15505&json=true","fetch_graph":"https://pith.science/api/pith-number/U5KALI76GX5OZF7IVTQ3GLLULE/graph.json","fetch_events":"https://pith.science/api/pith-number/U5KALI76GX5OZF7IVTQ3GLLULE/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE/action/timestamp_anchor","attest_storage":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE/action/storage_attestation","attest_author":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE/action/author_attestation","sign_citation":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE/action/citation_signature","submit_replication":"https://pith.science/pith/U5KALI76GX5OZF7IVTQ3GLLULE/action/replication_record"}},"created_at":"2026-05-17T23:38:46.967224+00:00","updated_at":"2026-05-17T23:38:46.967224+00:00"}