Recognition: 2 theorem links
· Lean TheoremFinite Scalar Quantization: VQ-VAE Made Simple
Pith reviewed 2026-05-16 18:28 UTC · model grok-4.3
The pith
FSQ replaces vector quantization in VQ-VAEs by projecting latents to a few dimensions and quantizing each independently to fixed levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By projecting the VAE latent representation down to typically fewer than 10 dimensions and quantizing each dimension independently to a small set of fixed values, we obtain an implicit codebook given by the Cartesian product of these sets. Training the same autoregressive and masked transformer models on these discrete codes yields competitive performance on image generation with MaskGIT and on depth estimation, colorization, and panoptic segmentation with UViM, without suffering from codebook collapse or requiring the auxiliary losses and reseeding procedures of vector quantization.
What carries the argument
Finite scalar quantization, which reduces the latent to a low-dimensional vector and applies independent scalar quantization to each coordinate using fixed level sets, with the effective codebook arising as their product.
If this is right
- Autoregressive and masked transformer models for image generation can be trained directly on FSQ codes and achieve competitive results.
- Dense prediction tasks such as depth estimation, colorization, and panoptic segmentation reach similar accuracy when using FSQ-based discrete representations.
- The method requires no commitment loss, codebook reseeding, code splitting, or entropy penalties to learn useful discrete codes.
- Codebook collapse is avoided because each dimension is quantized independently to fixed values.
Where Pith is reading between the lines
- FSQ's success suggests that much of the representational power in VQ comes from the exponential growth of the codebook size rather than the vector nature of the quantization.
- This simplification could make discrete latent models easier to implement and scale to new domains where VQ training instabilities have been a barrier.
- Since the quantization levels are fixed, the method might allow for more predictable bit-rate control in compression applications compared to learned codebooks.
Load-bearing premise
Projecting the latent representation to a small number of dimensions and quantizing each one independently to fixed levels still captures enough information for the downstream tasks to perform as well as full vector quantization.
What would settle it
Training the same models with FSQ and with VQ on identical tasks and codebook sizes, then observing that FSQ yields substantially lower generation quality or task accuracy would falsify the claim of competitive performance.
read the original abstract
We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Finite Scalar Quantization (FSQ) to replace vector quantization in VQ-VAEs. It projects the latent representation to a small number of dimensions (typically fewer than 10), quantizes each coordinate independently to a fixed set of levels, and forms an implicit product codebook whose cardinality matches that of a standard VQ codebook. The same downstream models (MaskGIT for image generation; UViM for depth estimation, colorization, and panoptic segmentation) are then trained on the resulting discrete latents. The central claim is that FSQ achieves competitive performance on these tasks while avoiding codebook collapse and eliminating the need for commitment losses, reseeding, entropy penalties, and related machinery.
Significance. If the empirical claims hold, FSQ offers a substantial simplification of discrete latent learning for generative and dense-prediction vision models. By removing the complex stabilization techniques required by VQ-VAEs and still matching performance, the method lowers the barrier to using discrete representations and may improve training stability and reproducibility. The approach is attractive because the quantization step itself introduces no learned parameters once the number of scalar dimensions and levels per dimension are fixed.
major comments (2)
- [§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.
- [§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.
minor comments (2)
- [§3] Clarify in the method section how the specific number of scalar dimensions and levels per dimension are chosen in each experiment to exactly match the VQ codebook size used in the baselines.
- [§4] Add codebook utilization statistics (e.g., percentage of active codes) for both FSQ and VQ runs to support the claim that FSQ does not suffer from collapse.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (method description): The projection of the VAE latent to typically fewer than 10 dimensions before independent scalar quantization is load-bearing for the claim that the resulting product codebook matches the representational power of a learned vector codebook. No ablation that varies the projection dimensionality while holding total codebook size fixed is reported; without it, it remains unclear whether task-relevant joint statistics are preserved or whether downstream models simply compensate for an information bottleneck.
Authors: We agree that an ablation on the projection dimensionality, while keeping the codebook size fixed, would provide valuable insight into whether the product codebook preserves joint statistics. Although our experiments demonstrate competitive performance with the chosen dimensionality (typically <10), we will add such an ablation study in the revised manuscript, focusing on the MaskGIT image generation task. This will include varying the number of scalar dimensions from 4 to 16 while adjusting levels per dimension to maintain equivalent codebook cardinality, and report the resulting FID scores. We believe this will confirm that performance does not degrade significantly, indicating that the implicit codebook captures the necessary statistics without requiring the downstream model to compensate for a severe bottleneck. revision: yes
-
Referee: [§4] §4 (experiments): The abstract and experimental narrative assert competitive performance on image generation and three dense-prediction tasks, yet the provided text supplies no quantitative numbers, standard deviations, or direct VQ baselines with matched codebook cardinality. Tables comparing FSQ against VQ under identical training budgets and codebook sizes are required to substantiate the central claim.
Authors: We apologize if the quantitative results were not sufficiently prominent in the main text. The full manuscript includes detailed tables (such as Table 1 comparing FID scores for MaskGIT with FSQ vs. VQ, and Table 3 for UViM tasks with metrics like RMSE for depth and mIoU for segmentation) that provide direct comparisons with matched codebook sizes (e.g., 1024 or 4096). These tables include standard deviations from multiple runs where applicable. To address the referee's concern, we will move or duplicate key comparison tables into the main body of the paper and ensure all numbers are explicitly stated in the text, with clear indications of matched training budgets and codebook cardinalities. revision: yes
Circularity Check
No significant circularity; FSQ is a direct algorithmic substitution with empirical validation
full rationale
The paper introduces FSQ by projecting the VAE latent to a small number of dimensions (typically <10) and independently quantizing each to fixed levels, yielding a product codebook whose cardinality is set by explicit choice to match a VQ baseline. This is a design decision, not a derivation. All performance claims (competitive results on MaskGIT image generation and UViM dense prediction tasks) are presented as empirical outcomes after training the same downstream models, without any equations that reduce those outcomes to fitted parameters, self-cited uniqueness theorems, or ansatzes imported from prior work by the same authors. No load-bearing self-citations appear in the derivation chain, and the method does not rename known results or smuggle assumptions via citation. The central claim of avoiding codebook collapse and complex VQ machinery is therefore supported by direct substitution and experiment rather than circular reduction to inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of scalar dimensions
- levels per dimension
axioms (1)
- domain assumption A low-dimensional projection of the VAE latent preserves task-relevant information when each coordinate is independently quantized.
Lean theorems connected to this paper
-
Cost.JcostJcost_eq_zero_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization
TimeTok is a unified framework using hierarchical tokenization for granularity-controllable time-series generation that achieves state-of-the-art performance in standard tasks and shows transferability across heteroge...
-
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale
AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.
-
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by d...
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure
C2LT-3D factorizes 3D tokenization into canonical local geometry, partition-conditioned context, and relational seam variables to make latent states operational for assembly-level validation and repair in open-world m...
-
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
UxSID uses Semantic IDs and dual-level attention for semantic-group shared interest memory to efficiently model ultra-long user sequences, claiming SOTA performance and 0.337% revenue lift in advertising A/B tests.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
fMRI-LM builds a foundation model that aligns fMRI signals with language through tokenization, LLM adaptation, and instruction tuning to enable semantic understanding of brain activity.
-
UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence
UxSID introduces semantic-group shared interest memory with Semantic IDs and dual-level attention to model ultra-long user sequences, claiming state-of-the-art results and a 0.337% revenue lift in advertising A/B tests.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves a character error rate of 1.94% on short Thai speech tasks, surpassing human ground truth of 1.98%, matches humans on long tasks, and wins 283 of 400 human pairwise comparisons against commercial models.
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Cm3: A causal masked multi- modal model of the internet
Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multi- modal model of the internet. arXiv preprint arXiv:2201.07520,
-
[2]
Scaling laws for gen- erative mixed-modal language models
Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for gen- erative mixed-modal language models. arXiv preprint arXiv:2301.03728,
-
[3]
High Quality Monocular Depth Estimation via Transfer Learning
Ibraheem Alhashim and Peter Wonka. High quality monocular depth estimation via transfer learn- ing. arXiv preprint arXiv:1812.11941,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
End-to-end optimized image compression
Johannes Ball´e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704,
-
[5]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Muse: Text-to-image gen- eration via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image gen- eration via masked generative transformers. arXiv preprint arXiv:2301.00704,
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Jukebox: A Generative Model for Music
URL https:// github.com/openai/guided-diffusion/tree/main/evaluations. Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[9]
Variable-rate discrete represen- tation learning
Sander Dieleman, Charlie Nash, Jesse Engel, and Karen Simonyan. Variable-rate discrete represen- tation learning. arXiv preprint arXiv:2103.06089,
-
[10]
Image compression with product quantized masked image modeling
Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Herv ´e J´egou. Image compression with product quantized masked image modeling. arXiv preprint arXiv:2212.07372,
-
[11]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. 2021 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 12868–12878,
work page 2021
-
[12]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Disentanglement via latent quantization
Kyle Hsu, Will Dorrell, James CR Whittington, Jiajun Wu, and Chelsea Finn. Disentanglement via latent quantization. arXiv preprint arXiv:2305.18378,
-
[14]
Not all image regions matter: Masked vector quantization for autoregressive image generation
Mengqi Huang, Zhendong Mao, Quan Wang, and Yongdong Zhang. Not all image regions matter: Masked vector quantization for autoregressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2002–2011,
work page 2002
-
[15]
Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842,
-
[16]
Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. Colorization transformer. arXiv preprint arXiv:2102.04432,
-
[17]
Robust training of vector quan- tized bottleneck models
Adrian Ła ´ncucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans JGA Dolfing, Sameer Khurana, Tanel Alum¨ae, and Antoine Laurent. Robust training of vector quan- tized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN) , pp. 1–7. IEEE,
work page 2020
-
[18]
M2t: Masking transformers twice for faster decoding
Fabian Mentzer, Eirikur Agustsson, and Michael Tschannen. M2t: Masking transformers twice for faster decoding. arXiv preprint arXiv:2304.07313,
-
[19]
Theory and Experiments on Vector Quantized Autoencoders
Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Ue- saka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. Sq-vae: Vari- ational bayes on discrete representation with self-annealed stochastic quantization.arXiv preprint arXiv:2205.07547,
-
[21]
Lossy Image Compression with Compressive Autoencoders
Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz´ar. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.