pith. machine review for the scientific record. sign in

arxiv: 2605.06870 · v2 · submitted 2026-05-07 · 💻 cs.LG

Recognition: no theorem link

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords VQ-VAEdimensional collapsewarm-up phaseautoencodervector quantizationlatent representationscodebook dimensionreconstruction loss
0
0 comments X

The pith

Training VQ-VAEs first as continuous autoencoders prevents dimensional collapse and raises final performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that VQ-VAEs routinely collapse to extremely low-dimensional latent subspaces, creating an irreducible lower bound on reconstruction loss that adjustments to codebook size or utilization cannot overcome. It traces this collapse to the quantization step suppressing directions of lower variance in the encoder output, using an extension of sequential learning analysis combined with rate-distortion ideas. The central proposal is a warm-up interval of unquantized autoencoder training that lets the encoder first occupy the full latent space before quantization begins. Experiments on both image and audio models show this restores effective codebook dimension from single digits to the mid-teens and delivers measurable gains in reconstruction and perceptual metrics at the same total training steps. Theory further supplies a functional relation between warm-up length and achievable loss, supporting an adaptive switch criterion.

Core claim

Dimensional collapse in VQ-VAEs is caused by the quantizer suppressing lower-variance directions, which imposes a hard loss floor independent of codebook improvements. A warm-up phase that first optimizes the model as an unquantized autoencoder allows the encoder to learn higher-rank representations; switching to VQ-VAE training afterward preserves much of that rank, yielding higher effective dimension and lower final loss.

What carries the argument

The AE warm-up phase that trains the model as a continuous autoencoder before vector quantization is introduced, allowing full-dimensional latent capture prior to discrete suppression.

If this is right

  • Effective codebook dimension rises from 3-5 to 17-19 on VQGAN and from 4 to 17-19 on WavTokenizer across tested sizes.
  • Reconstruction and perceptual losses improve, with rFID dropping 17-35 percent on images and PESQ rising 11-14 percent on audio at fixed training budget.
  • Downstream performance becomes predictable from warm-up duration, enabling an adaptive rule for when to introduce quantization.
  • The benefit appears consistently across codebook cardinalities from 2^10 to 2^16.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same continuous-first ordering could be tested in other discrete latent architectures such as vector-quantized diffusion models to check whether collapse is similarly mitigated.
  • If the rate-distortion account generalizes, analogous warm-up periods might address rank collapse in non-quantized high-dimensional representation learners.
  • The functional dependence of final loss on warm-up length supplies a practical knob for trading compute between continuous and discrete stages without exhaustive search.

Load-bearing premise

Dimensional collapse remains the dominant source of the observed loss bound and the switch to quantization after warm-up introduces no new optimization failures or hyperparameter retuning needs.

What would settle it

Measure the effective dimension of the codebook after a sufficiently long warm-up; if it stays below roughly 10 and the reconstruction loss fails to drop below the previously reported floor on the same architecture and data, the mechanism does not hold.

Figures

Figures reproduced from arXiv: 2605.06870 by Hamed Hassani, Nikita Karagodin, Paul Pu Liang, Sinan Hersek, Xinyu Zhao, Yury Polyanskiy.

Figure 1
Figure 1. Figure 1: Dimensional collapse in VQ-VAEs and the warm-up fix. During plain autoencoder training (gray), the latent effective dimension deff grows as directions of data variance (“modes") are learned sequentially. Turning on the quantizer too early collapses deff; reconstruction loss hits a floor that is independent of codebook size |C| (blue). Training the encoder–decoder as a plain autoencoder before introducing t… view at source ↗
Figure 2
Figure 2. Figure 2: Sequential learning in AE and VQ-VAE. A 2-layer linear AE learns latent modes sequentially Saxe et al. [2014] (left). In VQ-VAEs, however, quantizing the bottleneck freezes lower modes (those under the water level D∗ ), thus severely constraining effective dimension (right). See Section 3 for details. gains from AE pretraining followed by VQ fine-tuning, Wang et al. [2025] take this to its limit by applyin… view at source ↗
Figure 3
Figure 3. Figure 3: The linear RD-AE. The linear encoder W1 maps data to the pre-quantization latent z; the (potentially stochastic) quantizer Q assigns z to code zq; the linear decoder W2 reconstructs xˆ. The backward pass through the quantizer uses the straight-through estimator. so modes activate sequentially in decreasing order of variance ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Plain AE vs. RD-AE flow. d = 64, Σ = diag(j −1 ), β = 1. W1, W2 are d × d matrices with i.i.d. entries (Wi)jk ∼ N (0, s2/(4d)), s = 0.01, evolved under the RD-AE flow (6)–(7). Each curve is the median over 128 seeds; cross-seed variance is below 10−3 . (Left) Effective dimension deff. Plain AE (black) dips and recovers; RD-AE (colored by rate R) dips and plateaus. (Right) Reconstruction loss. Dashed lines … view at source ↗
Figure 5
Figure 5. Figure 5: VQGAN on ImageNet-100 (20k), |C| = 214. Codebook deff (left) and validation reconstruc￾tion loss (center) during training for AE warm-up durations Twu ∈ {0, 1k, . . . , 40k}; longer warm-up raises codebook dimension and lowers the loss floor, with returns diminishing past Twu ≈ 20k. Right: reverse water-filling on the AE checkpoint’s PCA spectrum at rate R = 14 is an empirical upper bound on the final trai… view at source ↗
Figure 6
Figure 6. Figure 6: Warmup theorem: predictions vs. simulation. Balanced initialization W1(0) = W2(0) = εI with ε = 0.1, β = 1, d = 64, σ 2 j = j −1 , three rates R ∈ {12, 14, 16} bits. (a) Active-mode count at convergence. Solid lines: predicted mwu(Twu, R) from (16). Dashed lines: observed k∞ from simulated RD-AE flow. (b) Reconstruction loss. Solid lines: loss bound from (17). Dashed lines: observed L∞ rec. Dotted horizont… view at source ↗
Figure 7
Figure 7. Figure 7: Without VQ, real architectures exhibit Saxe-like sequential activation when trained with Adam. VQ causes dimensional collapse. Latent effective dimension (deff, 99%-variance threshold) during autoencoder and standard VQ-VAE training. In both (a) VQGAN and (b) WavTok￾enizer, standard VQ training pins latent dimension at 2-4, while dimension rises as modes are learned sequentially when training as an autoenc… view at source ↗
Figure 8
Figure 8. Figure 8: VQGAN on ImageNet-100: codebook size × warm-up duration. Training runs spanning K ∈ {1k, 16k, 65k} and Twu ∈ {0, 1k, 5k, 10k, 20k, 30k, 40k} steps of AE Warm-up, all with k￾means init and respawning of the codebook (solid lines), compared with the vanilla VQGAN training recipe with random codebook init and no respawn (dashed line). Top row: validation reconstruction loss (L1 + LPIPS) versus training step. … view at source ↗
Figure 9
Figure 9. Figure 9: AE warm-up vs. codebook-size scaling in VQGAN. Final reconstruction metrics on the ImageNet-100 validation set at step 100,000—L1 (left), LPIPS (center), and rFID (right)—as a function of AE warm-up length Twu ∈ {0, 1k, 5k, 10k, 15k, 20k, 30k, 40k}, for three codebook sizes K ∈ {2 10 , 2 14 , 2 16}. Twu = 0 corresponds to the VQGAN w/ Respawn baseline (k-means initialization plus dead-code respawn, no AE p… view at source ↗
Figure 10
Figure 10. Figure 10: AE latent PCA spectrum and water levels at three warmup durations. Gray bars: PCA eigenvalues of the pre-quantization latent at AE warmup steps 1k, 10k, and 40k (left to right), computed over the full ImageNet-100 (20k) training set. Dashed horizontal lines: Shannon water level D⋆ at rates R = log2 K for codebook sizes K ∈ {2 10 , 2 14 , 2 16}, computed by reverse water-filling (Eq. 8) on the displayed sp… view at source ↗
Figure 11
Figure 11. Figure 11: Water-filling on the AE latent spectrum upper-bounds the trained VQ-VAE codebook dimension. For each AE warm-up checkpoint Twu ∈ {0, 1k, 5k, 10k, 15k, 20k, 30k, 40k} and each codebook size K ∈ {2 10 , 2 14 , 2 16} (one panel per K), we compare the water-filling prediction mwu(Twu, R) from the AE PCA spectrum (light, “WF theory”) against the codebook effective dimension of the corresponding VQ-VAE trained … view at source ↗
Figure 12
Figure 12. Figure 12: VQGAN on ImageNet-100 (20k) with the commitment term removed. Training curves at β = 0 (vanilla and respawn variants, dashed) overlaid on the standard β > 0 runs from [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: WavTokenizer on LibriTTS: cold-start training fails to benefit from larger codebooks; AE Warm-up restores codebook scaling. Cold-start WavTokenizer (blue curves, K ∈ {4k, 8k, 16k}) versus AE-VQ (warm curves, K ∈ {4k, 8k, 16k, 65k}). From left: validation mel loss, PESQ, codebook deff (99% variance). AE Warm-up runs are trained for 43 epochs before introducing VQ. Cold-start runs show flat loss and low eff… view at source ↗
read the original abstract

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that dimensional collapse in VQ-VAEs—where learned representations occupy an extremely low-dimensional subspace—imposes a hard lower bound on reconstruction loss that codebook-focused techniques cannot overcome. It extends the sequential learning analysis of Saxe et al. (2014) using rate-distortion theory to argue that vector quantization suppresses lower-variance latent directions, and proposes a simple 'AE warm-up' phase (unquantized autoencoder training before introducing VQ) as a remedy. Large-scale experiments on VQGAN (images) and WavTokenizer (audio) show that warm-up increases effective codebook dimension (e.g., 3-5 to 17-19) and yields concrete gains (rFID reduced 17-35%, PESQ improved 11-14%) across codebook sizes, with the theory predicting final performance as a function of warm-up length.

Significance. If the theoretical mechanism is shown to hold beyond the linear case, the work provides a low-overhead, theoretically motivated fix for a pervasive issue in discrete latent models, backed by reproducible large-scale gains on standard benchmarks. The predictive link between warm-up duration and downstream loss, plus explicit dimension measurements, strengthens the contribution over purely empirical codebook fixes.

major comments (2)
  1. [analytic framework] The analytic framework section: the extension of Saxe et al. (2014) via rate-distortion ideas to nonlinear VQ-VAEs (including commitment loss, stop-gradient, and EMA/k-means codebook updates) lacks explicit equations deriving the hard loss lower bound or the quantitative dependence of effective dimension on warm-up length. This is load-bearing for the central causal claim and for using the theory to justify the warm-up schedule over an optimization artifact.
  2. [experiments] Experimental results (abstract and § on VQGAN/WavTokenizer): reported rFID/PESQ gains and dimension jumps (3-5 to 17-19) are given without error bars, multiple random seeds, or ablation on whether warm-up requires retuning of other hyperparameters (e.g., learning rate, commitment weight). This weakens verification that the observed scaling matches the predicted warm-up dependence.
minor comments (1)
  1. [abstract] The abstract states the theory 'predicts downstream performance as a function of warm-up length' but does not reference the specific functional form or fitting procedure used to obtain the adaptive switching criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [analytic framework] The analytic framework section: the extension of Saxe et al. (2014) via rate-distortion ideas to nonlinear VQ-VAEs (including commitment loss, stop-gradient, and EMA/k-means codebook updates) lacks explicit equations deriving the hard loss lower bound or the quantitative dependence of effective dimension on warm-up length. This is load-bearing for the central causal claim and for using the theory to justify the warm-up schedule over an optimization artifact.

    Authors: We agree that additional explicit derivations would improve the clarity and rigor of the analytic framework. In the revised manuscript, we will expand this section with step-by-step equations that derive the hard loss lower bound using rate-distortion principles applied to the VQ-VAE objective. We will also provide a quantitative characterization of how effective dimension scales with warm-up length, explicitly incorporating the nonlinear effects of the commitment loss, stop-gradient operation, and EMA/k-means codebook updates. These additions will more clearly distinguish the proposed mechanism from potential optimization artifacts. revision: yes

  2. Referee: [experiments] Experimental results (abstract and § on VQGAN/WavTokenizer): reported rFID/PESQ gains and dimension jumps (3-5 to 17-19) are given without error bars, multiple random seeds, or ablation on whether warm-up requires retuning of other hyperparameters (e.g., learning rate, commitment weight). This weakens verification that the observed scaling matches the predicted warm-up dependence.

    Authors: We concur that reporting statistical variability and conducting targeted ablations would strengthen the experimental claims. In the revision, we will augment the results with error bars computed over multiple random seeds (minimum of three), and we will include an ablation study that evaluates the warm-up procedure without retuning other hyperparameters such as learning rate or commitment weight. These changes will allow readers to better assess whether the observed performance scaling aligns with the theoretical predictions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends external reference without reduction to inputs by construction.

full rationale

The paper's central analytic framework extends the sequential learning effect from the external citation Saxe et al. [2014] by incorporating rate-distortion theory to explain how VQ suppresses lower-variance directions, leading to dimensional collapse and a hard loss lower bound. This is used to justify the AE warm-up phase as a solution, with the theory claimed to predict downstream performance as a function of warm-up length. No load-bearing steps reduce by definition or by fitting parameters to the target outputs; the warm-up duration is treated as a tunable hyperparameter validated empirically rather than derived tautologically. The cited Saxe et al. work is independent (linear networks), and the extension introduces new rate-distortion elements without self-citation chains or ansatz smuggling. Empirical results on VQGAN and WavTokenizer are presented separately from the theory, with no evidence that predictions are forced by construction from the inputs. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on an extension of the Saxe et al. sequential learning result plus rate-distortion theory; the warm-up length is the main free parameter whose optimal value is characterized empirically.

free parameters (1)
  • warm-up duration
    Length of the initial continuous AE phase before quantization is introduced; the paper states it governs final loss and provides an adaptive switching criterion.
axioms (2)
  • domain assumption The sequential learning effect identified by Saxe et al. (2014) applies to the encoder-decoder dynamics under VQ.
    Invoked to explain why lower-variance directions are suppressed once quantization begins.
  • domain assumption Rate-distortion theory can be used to derive a hard lower bound on reconstruction loss once dimensional collapse occurs.
    Used to justify that codebook-size increases alone cannot overcome the collapse-induced bound.

pith-pipeline@v0.9.0 · 5641 in / 1415 out tokens · 24670 ms · 2026-05-13T01:21:55.460549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

  2. [2]

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341,

  3. [3]

    Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello

    arXiv:2408.16532. Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A convolutional repre- sentation for pitch estimation. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 161–165,

  4. [4]

    Contrastive multiview coding

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, page 776–794, Berlin, Heidelberg,

  5. [5]

    URLhttps://doi.org/10.1007/978-3-030-58621-8_45

    1007/978-3-030-58621-8_45. URLhttps://doi.org/10.1007/978-3-030-58621-8_45. Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30,

  6. [6]

    Bridging continuous and discrete tokens for autoregressive visual generation.arXiv preprint arXiv:2503.16430,

    Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, and Xihui Liu. Bridging continuous and discrete tokens for autoregressive visual generation.arXiv preprint arXiv:2503.16430,

  7. [7]

    Early quan- tization shrinks codebook: A simple fix for diversity-preserving tokenization.arXiv preprint arXiv:2603.17052,

    Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, and Dianbo Liu. Early quan- tization shrinks codebook: A simple fix for diversity-preserving tokenization.arXiv preprint arXiv:2603.17052,

  8. [8]

    Representation collapsing problems in vector quantization.arXiv preprint arXiv:2411.16550,

    Wentao Zhao, Zijie Liu, Songlin Chen, Xiangyun Cao, and Guanghui He. Representation collapsing problems in vector quantization.arXiv preprint arXiv:2411.16550,

  9. [9]

    A Proofs for Section 3 A.1 Dense RD-AE flow We derive (6)–(7) using the notation of Section 3.1

    arXiv:2411.02038. A Proofs for Section 3 A.1 Dense RD-AE flow We derive (6)–(7) using the notation of Section 3.1. All channel parameters (cj, τ2 j , Dj, D⋆) are treated as constant throughout, consistent with the assumption that the quantizer remains optimal for the current latent statistics. 11 Channel in the eigenbasis.We denote the covariance matrix o...

  10. [10]

    Concretely, STE replaces zq,j with a surrogate zsurr q,j =z j + sg(zq,j −z j), which equals zq,j in value but has identity Jacobian in zj

    Encoder reconstruction gradient under STE.The straight-through estimator redefines the back- ward pass to treat ∂zq,j /∂zj := 1 , while the forward pass uses the actual zq,j from the channel. Concretely, STE replaces zq,j with a surrogate zsurr q,j =z j + sg(zq,j −z j), which equals zq,j in value but has identity Jacobian in zj. Under this surrogate, the ...

  11. [11]

    At 128×128 input resolution each image is tokenized to an 8×8 grid of 64 discrete tokens

    with the imagenet_vqgan.yaml configuration (128 base channels, multipliers [1,1,2,2,4] , two residual blocks per level, self-attention at 16×16 , 256-channel bottleneck, 256-dimensional codebook embedding space). At 128×128 input resolution each image is tokenized to an 8×8 grid of 64 discrete tokens. As per the original VQGAN training recipe, the discrim...

  12. [12]

    Cold-start baselines use the WavTokenizer default threshold of 1.0

    Codebook respawn.We respawn codes whose EMA usage falls below a codebook-size-dependent threshold, aggregating usage counts across both GPUs each epoch for stability at large K. Cold-start baselines use the WavTokenizer default threshold of 1.0. Warm-start runs scale the threshold inversely withKto keep the expected number of respawned codes per epoch rou...

  13. [13]

    WF theory

    The number of PCA components whose eigenvalue exceeds D⋆ gives the predicted active-mode count 19 Figure 11:Water-filling on the AE latent spectrum upper-bounds the trained VQ-V AE codebook dimension.For each AE warm-up checkpoint Twu ∈ {0,1k,5k,10k,15k,20k,30k,40k} and each codebook size K∈ {2 10,2 14,2 16} (one panel per K), we compare the water-filling...