SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang; Sheng Fang; Wenbin Jiang; Xiangbo Wang; Yubo You

arxiv: 2505.24437 · v4 · submitted 2025-05-30 · 💻 cs.SD · eess.AS

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang , Wenbin Jiang , Xiangbo Wang , Yubo You , Sheng Fang This is my paper

Pith reviewed 2026-05-19 13:05 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords neural audio codecresidual experts vector quantizationsparse quantizationaudio compressionlow bitratehigh fidelitymulti-tiered discriminatorspectral blur reduction

0 comments

The pith

Residual experts vector quantization expands the embedding space for neural audio codecs to sustain high fidelity at 2.67 kbps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a neural audio compression approach that targets the sharp drop in quality when bitrates are severely restricted. Current methods lose fidelity because the embedding space shrinks dramatically, limiting how well speech, music, and general audio can be represented. The authors introduce Residual Experts Vector Quantization to enlarge that space without substantially increasing bandwidth, paired with a gentle load-balancing method to use the new representations fully and a multi-tiered discriminator that focuses training on important spectral details. A post-training step then enables the same model to handle several bitrates while cutting overall training time. If these elements work together as described, the result would be more efficient audio storage and transmission with less perceptual degradation than earlier codecs achieve at comparable rates.

Core claim

The central claim is that Residual Experts Vector Quantization substantially expands the embedding space with minimal bandwidth cost, and when combined with gentle load balancing and a multi-tiered STFT discriminator the model reaches PESQ and ViSQOL scores of 2.87 and 4.27 at 2.67 kbps while reducing distance to the original mel-spectrogram by 13 percent; the post-training strategy further allows multiple bitrates to be supported with performance comparable to fixed-rate models and half the training time.

What carries the argument

Residual Experts Vector Quantization (REVQ), which enlarges the set of usable representations with only small bandwidth overhead and is kept active by a gentle load-balancing strategy plus a multi-tiered discriminator that stratifies STFT spectra.

If this is right

The same model can handle multiple bitrates without quality loss at the lower end.
Training time for supporting several rates drops by half relative to training separate fixed-bitrate models.
Reconstructed audio exhibits less spectral blur and lies closer to the original mel-spectrogram.
Ablation results indicate the full combination outperforms the tested baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same expansion of representation space might be tested on image or video compression tasks that also face tight bitrate limits.
Low-bitrate audio with these scores could improve streaming quality on mobile networks or in regions with constrained connectivity.
Further experiments on music and environmental sounds would clarify whether the multi-tiered discriminator generalizes beyond the speech-heavy tests reported.

Load-bearing premise

The gentle load-balancing strategy fully utilizes the expanded embedding space created by REVQ without introducing new artifacts or needing extensive hyperparameter tuning that would erase the reported quality gains.

What would settle it

Reproducing the evaluation at 2.67 kbps and observing a PESQ score below 2.87 together with a mel-spectrogram distance reduction smaller than 13 percent would show the central performance claim does not hold.

Figures

Figures reproduced from arXiv: 2505.24437 by Jin Wang, Sheng Fang, Wenbin Jiang, Xiangbo Wang, Yubo You.

**Figure 1.** Figure 1: The overall architecture of the proposed SwitchCodec. An input audio waveform is first segmented into windows. The encoder then maps each window to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Boxplot visualization of encoded latent Z reconstruction for fixed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of STFT spectrogram segmentation strategies for dis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the Multi-Tiered STFT Discriminator (MTSD). The discriminator takes an input waveform and first computes its STFT, separating it into [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of mel spectrograms: (a) natural mel spectrogram; (b), (c), (d) mel spectrograms generated by SwitchCodec, DAC, and EnCodec, respectively. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Subjective listening tests for SwitchCodec, DAC, EnCodec and the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: E [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: PESQ scores for the model using dropout, models trained with di [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vector Quantization (REVQ), which substantially expands the embedding space with minimal impact on bandwidth. A gentle load-balancing strategy is introduced to ensure the full utilization of this expanded space. Furthermore, we develop a novel multi-tiered discriminator that periodically stratifies STFT spectra, guiding the generator to focus on critical spectral regions. To support multiple bitrates without quality loss at the lower end, we adopt an efficient post-training strategy. Our proposed model achieves impressive performance, with PESQ and ViSQOL scores of 2.87 and 4.27, respectively, at 2.67 kbps bandwidth. The approach effectively reduces spectral blur, decreasing the distance to the original mel-spectrogram by 13%. Notably, our post-training strategy achieves performance comparable to dedicated fixed-bitrate models while reducing the required training time by half. Extensive ablation studies confirm the superiority of our method over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SwitchCodec adds REVQ to expand codebook space at low bitrates plus a stratified multi-tier discriminator, delivering reported gains at 2.67 kbps but resting on unshown utilization stats for the balancing step.

read the letter

The main point is that this work targets low-bitrate neural audio compression by expanding the embedding space through Residual Experts Vector Quantization while keeping bandwidth cost low. They pair it with a gentle load-balancing approach and a multi-tiered discriminator that periodically stratifies STFT spectra to cut spectral blur. A post-training step then lets the model handle variable bitrates without much quality drop at the low end and halves training time compared to fixed-rate versions.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces SwitchCodec, a neural audio codec featuring Residual Experts Vector Quantization (REVQ) to expand the embedding space with minimal bandwidth cost, a gentle load-balancing strategy to utilize this space, a multi-tiered discriminator that stratifies STFT spectra, and a post-training strategy to support multiple bitrates efficiently. It reports PESQ of 2.87 and ViSQOL of 4.27 at 2.67 kbps, a 13% reduction in mel-spectrogram distance to the original, and superiority via ablation studies over baselines.

Significance. If the results hold, the work advances low-bitrate neural audio compression by showing how expanded quantization can be leveraged without proportional bandwidth increases, with practical value in the post-training approach that halves training time while matching fixed-bitrate performance. The multi-tier discriminator provides a targeted way to address spectral blur, potentially benefiting applications in streaming and storage.

major comments (2)

[§3.2] §3.2 (REVQ and load-balancing): The central claim attributes the PESQ 2.87 / ViSQOL 4.27 gains and 13% mel-spectrogram improvement at 2.67 kbps to REVQ expanding the embedding space plus the gentle load-balancing strategy that fully utilizes it. However, the manuscript provides no codebook activation histograms, utilization statistics, or sensitivity analysis to the load-balancing strength hyperparameter. Without these, it is difficult to confirm that the expanded space is effectively used without under-utilization or new artifacts, which is load-bearing for crediting the gains to REVQ rather than the multi-tier discriminator or post-training.
[§5] §5 (Ablation studies and results): The reported 13% reduction in distance to the original mel-spectrogram is presented without specifying the exact distance metric (e.g., L1 or L2 on log-mel), the precise baseline model, or error bars across multiple runs. This weakens the ability to assess whether the improvement is robust and directly tied to the proposed components at the lowest bitrate.

minor comments (3)

[Abstract] The abstract uses 'universal' but evaluations appear concentrated on speech and music; a brief clarification of the audio domains tested would improve scope clarity.
[§3.3] Notation for the multi-tier discriminator (e.g., how STFT strata are defined and combined) could be formalized in an equation to aid reproducibility.
[Table 1] Table 1 or equivalent results table: Ensure all baseline comparisons include the same bandwidth and training conditions for fair assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and indicate the revisions made to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3.2] §3.2 (REVQ and load-balancing): The central claim attributes the PESQ 2.87 / ViSQOL 4.27 gains and 13% mel-spectrogram improvement at 2.67 kbps to REVQ expanding the embedding space plus the gentle load-balancing strategy that fully utilizes it. However, the manuscript provides no codebook activation histograms, utilization statistics, or sensitivity analysis to the load-balancing strength hyperparameter. Without these, it is difficult to confirm that the expanded space is effectively used without under-utilization or new artifacts, which is load-bearing for crediting the gains to REVQ rather than the multi-tier discriminator or post-training.

Authors: We agree that direct evidence of codebook utilization would make the contribution of REVQ and the load-balancing strategy more transparent. Our ablation studies in Section 5 already isolate the performance gains from these components, but we acknowledge the value of additional diagnostics. In the revised manuscript we have added codebook activation histograms and utilization statistics for the 2.67 kbps configuration in Section 3.2. We have also included a sensitivity analysis to the load-balancing strength hyperparameter in the supplementary material, showing that performance remains stable and that no new artifacts are introduced within the operating range used in our experiments. revision: yes
Referee: [§5] §5 (Ablation studies and results): The reported 13% reduction in distance to the original mel-spectrogram is presented without specifying the exact distance metric (e.g., L1 or L2 on log-mel), the precise baseline model, or error bars across multiple runs. This weakens the ability to assess whether the improvement is robust and directly tied to the proposed components at the lowest bitrate.

Authors: We appreciate this request for greater precision. The reported 13% reduction is the relative decrease in L1 distance on log-mel spectrograms between our full model and the baseline model trained without REVQ and the multi-tiered discriminator. We have revised Section 5 to explicitly state both the metric (L1 on log-mel) and the baseline definition. We have also added error bars computed from three independent training runs to the relevant figures and tables, confirming that the observed improvement is consistent across runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from proposed architecture and ablations

full rationale

The paper introduces REVQ, a gentle load-balancing strategy, multi-tier discriminator, and post-training for a neural audio codec, then reports empirical PESQ/ViSQOL scores and mel-spectrogram distance reductions at specific bitrates. These are obtained via training and evaluation on audio data, with ablation studies confirming component contributions. No equations, derivations, or first-principles steps are present that reduce the reported performance metrics to fitted parameters or self-citations by construction. The central claims rest on external experimental benchmarks rather than internal redefinitions or forced predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claims rest on the effectiveness of the newly introduced REVQ and load-balancing components whose performance is demonstrated only through the reported metrics.

free parameters (1)

load-balancing strength
Gentle load-balancing strategy is introduced to utilize the expanded space; its exact weighting or schedule is a tunable element that affects utilization.

invented entities (1)

Residual Experts Vector Quantization (REVQ) no independent evidence
purpose: Expand embedding space with minimal bandwidth increase via residual expert selection
Newly proposed quantization scheme presented as the key innovation.

pith-pipeline@v0.9.0 · 5755 in / 1202 out tokens · 37236 ms · 2026-05-19T13:05:43.238849+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We propose Residual Experts Vector Quantization (REVQ)... A gentle load-balancing strategy is introduced... Multi-Tiered STFT Discriminator that segments spectrograms into hierarchical frequency bands... periods p to [2, 4, 8]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 9 internal anchors

[1]

Van Den Oord, and O

A. Van Den Oord, and O. Vinyals, Neural discrete repre- sentation learning, Advances in Neural Information Pro- cessing Systems. 30 (2017)

work page 2017
[2]

Guoet al., Recent Advances in Discrete Speech To- kens: A Review, 2025, arXiv preprint arXiv:2502.06490

Y .-W. Guoet al., Recent Advances in Discrete Speech To- kens: A Review, 2025, arXiv preprint arXiv:2502.06490

work page arXiv 2025
[3]

Zeghidour, A

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, SoundStream: An End-to-End Neural Au- dio Codec, IEEE/ACM Trans. on Audio, Speech, and Lan- guage Processing. 30 (2022) 495-507

work page 2022
[4]

Juang, and A

B.-H. Juang, and A. Gray, Multiple Stage Vector Quanti- zation for Speech Coding, in: Proc. IEEE ICASSP, 1982, pp. 597-600

work page 1982
[5]

HiFi- Codec: Group-residual vector quantization for high fidelity audio codec,

D. Yang et al., HiFi-Codec: Group-Residual Vector Quan- tization for High Fidelity Audio Codec, 2023, arXiv preprint arXiv:2305.02765

work page arXiv 2023
[6]

Chae et al., Variable Bitrate Residual Vector Quantiza- tion for Audio Coding, in: Proc

Y . Chae et al., Variable Bitrate Residual Vector Quantiza- tion for Audio Coding, in: Proc. IEEE ICASSP, 2025, pp. 1-5

work page 2025
[7]

J. Pons, S. Pascual, G. Cengarle, and J. Serrà, Upsampling artifacts in neural audio synthesis, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 3005-3009

work page 2021
[8]

Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y . Bengio, Chunked autoregressive gan for conditional waveform synthesis, 2021, arXiv preprint arXiv:2110.10139

work page arXiv 2021
[9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton and J. Dean, Outrageously large neural net- works: The sparsely-gated mixture-of-experts layer, 2017, arXiv preprint arXiv:1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Lepikhin, H

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, Gshard: Scaling gi- ant models with conditional computation and automatic sharding, International Conference on Learning Repre- sentations. (2021)

work page 2021
[11]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and e ffi- cient sparsity, Journal of Machine Learning Research. 23 (2022). 10

work page 2022
[12]

DeepSeek-V3 Technical Report

A. Liu et al. , Deepseek-v3 technical report, 2024, arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Valin, K

J.-M. Valin, K. V os, and T. B. Terriberry, Definition of the opus audio codec, IETF RFC 6716, 2012, [Online]. Available: https://tools.ietf.org/ html/rfc6716

work page 2012
[14]

Dietz et al., Overview of the EVS codec architecture, in: Proc

M. Dietz et al., Overview of the EVS codec architecture, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2015, pp. 5698-5702

work page 2015
[15]

Neuendorf et al., The ISO/MPEG Unified speech and audio coding standard - Consistent high quality for all content types and at all bit rates, J

M. Neuendorf et al., The ISO/MPEG Unified speech and audio coding standard - Consistent high quality for all content types and at all bit rates, J. Audio Eng. Soc. 61 (2013) 956-977

work page 2013
[16]

Gârbacea et al., Low bit-rate speech coding with VQ- V AE and a WaveNet decoder, in: Proc

C. Gârbacea et al., Low bit-rate speech coding with VQ- V AE and a WaveNet decoder, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2019, pp. 735-739

work page 2019
[17]

WaveNet: A Generative Model for Raw Audio

A. van den Oord et al., WaveNet: A generative model for raw audio, 2016, arXiv preprint arXiv:1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

J. Eric, S. Gu, and B. Poole, Categorical reparame- terization with gumbel-softmax, 2016, arXiv preprint arXiv:1611.01144

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Jiang, X

X. Jiang, X. Peng, H. Xue, Y . Zhang and Y . Lu, Latent-Domain Predictive Neural Speech Coding, IEEE/ACM Trans. on Audio, Speech, and Language Processing, IEEE, 2023, vol. 31, pp. 2111-2123, doi: 10.1109/TASLP.2023.3277693

work page doi:10.1109/taslp.2023.3277693 2023
[20]

Jiang, X

X. Jiang, X. Peng, Y . Zhang, Y . Lu, Disentangled feature learning for real-time neural speech coding, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1-5

work page 2023
[21]

Y . Li, M. Tagliasacchi, O. Rybakov, V . Ungureanu, and D. Roblek, Realtime speech frequency bandwidth extension, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2021, pp. 691-695

work page 2021
[22]

Défossez, J

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi, High fi- delity neural audio compression, Transactions on Machine Learning Research. (2023)

work page 2023
[23]

S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, Bigvgan: A universal neural vocoder with large-scale training, 2022, arXiv preprint arXiv:2206.04658

work page arXiv 2022
[24]

Kumar, P

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, High-fidelity audio compression with improved rvqgan, Advances in Neural Information Processing Sys- tems. 36 (2023)

work page 2023
[25]

Z. Liu, T. Hartwig, and M. Ueda. Neural networks fail to learn periodic functions and how to fix it, Advances in Neural Information Processing Systems. 33 (2020) 1583- 1594

work page 2020
[26]

J. Yu, X. Li, J. Y . Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y . Xu, J. Baldridge, and Y . Wu. Vector-quantized im- age modeling with improved vqgan, 2021, arXiv preprint arXiv:2110.04627

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

X. Bie, X. Liu, G. Richard, Learning Source Disentan- glement in Neural Audio Codec, 2024, arXiv preprint arXiv:2409.11228

work page arXiv 2024
[28]

Sound- storm: Efficient parallel audio generation,

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, M. Tagliasacchi, Soundstorm: E ffi- cient parallel audio generation, 2023, arXiv preprint arXiv:2305.09636

work page arXiv 2023
[29]

H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, Single-codec: Single-codebook speech codec towards high-performance speech generation, 2024, arXiv preprint arXiv:2406.07422

work page arXiv 2024
[30]

Ji et al., WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, 2024, arXiv preprint arXiv:2408.16532

S. Ji et al., WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, 2024, arXiv preprint arXiv:2408.16532

work page arXiv 2024
[31]

Vasuki, P.T

A. Vasuki, P.T. Vanathi, A review of vector quantization techniques, IEEE Potentials, vol. 25, 2006, pp. 39-47

work page 2006
[32]

Gray, Vector quantization, IEEE Assp Magazine, vol

R. Gray, Vector quantization, IEEE Assp Magazine, vol. 1, 1984, pp. 4-29

work page 1984
[33]

Generating Diverse High-Fidelity Images with VQ-VAE-2

A. Razavi, A. van den Oord, and O. Vinyals, Generating diverse highfidelity images with VQ-V AE-2, 2019, arXiv preprint arXiv:1906.00446

work page internal anchor Pith review Pith/arXiv arXiv 2019
[34]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, Jukebox: A generative model for music, 2020, arXiv preprint arXiv:2005.00341

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

MacQueen, Some methods for classification and analy- sis of multivariate observations, Proc

J. MacQueen, Some methods for classification and analy- sis of multivariate observations, Proc. 5th Berkeley Symp. Math. Statist. Probability. (1967) 281-297

work page 1967
[36]

Welker, M

S. Welker, M. Le, R. T. Q. Chen, W. Hsu, T. Gerkmann, A. Richard, and Y . Wu, FlowDec: A flow-based full-band general audio codec with high perceptual quality, 2025, arXiv preprint arXiv:2503.01485

work page arXiv 2025
[37]

J. Yao, H. Liu, C. Chen, Y . Hu, ES Chng, L Xie, GenSE: Generative Speech Enhancement via Language Mod- els using Hierarchical Modeling, 2025, arXiv preprint arXiv:2502.02942

work page arXiv 2025
[38]

W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, Uni- vnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation, in: Proc. Interspeech, 2021, pp. 2207-2211

work page 2021
[39]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. Léonard, and A. Courville, Estimat- ing or propagating gradients through stochastic neu- rons for conditional computation, 2013, arXiv preprint arXiv:1308.3432. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013
[40]

Shazeer, Y

N. Shazeer, Y . Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young,et al., Mesh-tensorflow: Deep learning for supercomputers, Advances in Neural Information Processing Systems. 31 (2018) 10414-10423

work page 2018
[41]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai , Auxiliary- loss-free load balancing strategy for mixture-of-experts, 2024, arXiv preprint arXiv:2408.15664

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

J. Kong, J. Kim, and J. Bae, Hifi-gan: Generative ad- versarial networks for e fficient and high fidelity speech synthesis, Advances in neural information processing sys- tems. 33 (2020) 17022-17033

work page 2020
[43]

Loshchilov and F

I. Loshchilov and F. Hutter, Decoupled weight decay regu- larization, in: International Conference on Learning Rep- resentations. (2019)

work page 2019
[44]

Schoe ffler, F

M. Schoe ffler, F. Stöter, B. Edler, and J. Herre, Towards the next generation of web-based experiments: A case study assessing basic audio quality following the ITU-R recommendation BS. 1534 (MUSHRA), in: 1st Web Au- dio Conference, 2015, pp. 1-6

work page 2015
[45]

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, Percep- tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in: 2001 IEEE international conference on acous- tics, speech, and signal processing(ICASSP), IEEE, 2001, pp. 749-752

work page 2001
[46]

Chinen, F

M. Chinen, F. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines., Visqol v3: An open source production ready objective speech and audio metric, in: 2020 twelfth international conference on quality of multimedia experi- ence (QoMEX), IEEE, 2020, pp. 1-6

work page 2020
[47]

Veaux, J

C. Veaux, J. Yamagishi, K. MacDonald et al. , Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, University of Edinburgh. The Centre for Speech Technology Research (CSTR), vol. 6, 2017, p. 15

work page 2017
[48]

H. Zen, V . Dang, R. Clark, Y . Zhang, R. Weiss, Y . Jia, Z. Chen, and Y . Wu, LibriTTS: A Corpus Derived from Lib- riSpeech for Text-to-Speech, in: Proc. Interspeech, 2019, pp. 1526-1530

work page 2019
[49]

arXiv preprint arXiv:1912.06670 , year=

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. Tyers, and G. Weber, Common voice: A massively-multilingual speech corpus, 2019, arXiv preprint arXiv:1912.06670

work page arXiv 2019
[50]

De fferrard, K

M. De fferrard, K. Benzi, P. Vandergheynst, and X. Bres- son, FMA: A Dataset For Music Analysis, in: 18th Inter- national Society for Music Information Retrieval Confer- ence, 2017. 12

work page 2017

[1] [1]

Van Den Oord, and O

A. Van Den Oord, and O. Vinyals, Neural discrete repre- sentation learning, Advances in Neural Information Pro- cessing Systems. 30 (2017)

work page 2017

[2] [2]

Guoet al., Recent Advances in Discrete Speech To- kens: A Review, 2025, arXiv preprint arXiv:2502.06490

Y .-W. Guoet al., Recent Advances in Discrete Speech To- kens: A Review, 2025, arXiv preprint arXiv:2502.06490

work page arXiv 2025

[3] [3]

Zeghidour, A

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, SoundStream: An End-to-End Neural Au- dio Codec, IEEE/ACM Trans. on Audio, Speech, and Lan- guage Processing. 30 (2022) 495-507

work page 2022

[4] [4]

Juang, and A

B.-H. Juang, and A. Gray, Multiple Stage Vector Quanti- zation for Speech Coding, in: Proc. IEEE ICASSP, 1982, pp. 597-600

work page 1982

[5] [5]

HiFi- Codec: Group-residual vector quantization for high fidelity audio codec,

D. Yang et al., HiFi-Codec: Group-Residual Vector Quan- tization for High Fidelity Audio Codec, 2023, arXiv preprint arXiv:2305.02765

work page arXiv 2023

[6] [6]

Chae et al., Variable Bitrate Residual Vector Quantiza- tion for Audio Coding, in: Proc

Y . Chae et al., Variable Bitrate Residual Vector Quantiza- tion for Audio Coding, in: Proc. IEEE ICASSP, 2025, pp. 1-5

work page 2025

[7] [7]

J. Pons, S. Pascual, G. Cengarle, and J. Serrà, Upsampling artifacts in neural audio synthesis, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 3005-3009

work page 2021

[8] [8]

Chunked autoregressive gan for conditional waveform synthesis.arXiv preprint arXiv:2110.10139,

M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y . Bengio, Chunked autoregressive gan for conditional waveform synthesis, 2021, arXiv preprint arXiv:2110.10139

work page arXiv 2021

[9] [9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton and J. Dean, Outrageously large neural net- works: The sparsely-gated mixture-of-experts layer, 2017, arXiv preprint arXiv:1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Lepikhin, H

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, Gshard: Scaling gi- ant models with conditional computation and automatic sharding, International Conference on Learning Repre- sentations. (2021)

work page 2021

[11] [11]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and e ffi- cient sparsity, Journal of Machine Learning Research. 23 (2022). 10

work page 2022

[12] [12]

DeepSeek-V3 Technical Report

A. Liu et al. , Deepseek-v3 technical report, 2024, arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Valin, K

J.-M. Valin, K. V os, and T. B. Terriberry, Definition of the opus audio codec, IETF RFC 6716, 2012, [Online]. Available: https://tools.ietf.org/ html/rfc6716

work page 2012

[14] [14]

Dietz et al., Overview of the EVS codec architecture, in: Proc

M. Dietz et al., Overview of the EVS codec architecture, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2015, pp. 5698-5702

work page 2015

[15] [15]

Neuendorf et al., The ISO/MPEG Unified speech and audio coding standard - Consistent high quality for all content types and at all bit rates, J

M. Neuendorf et al., The ISO/MPEG Unified speech and audio coding standard - Consistent high quality for all content types and at all bit rates, J. Audio Eng. Soc. 61 (2013) 956-977

work page 2013

[16] [16]

Gârbacea et al., Low bit-rate speech coding with VQ- V AE and a WaveNet decoder, in: Proc

C. Gârbacea et al., Low bit-rate speech coding with VQ- V AE and a WaveNet decoder, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2019, pp. 735-739

work page 2019

[17] [17]

WaveNet: A Generative Model for Raw Audio

A. van den Oord et al., WaveNet: A generative model for raw audio, 2016, arXiv preprint arXiv:1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

J. Eric, S. Gu, and B. Poole, Categorical reparame- terization with gumbel-softmax, 2016, arXiv preprint arXiv:1611.01144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Jiang, X

X. Jiang, X. Peng, H. Xue, Y . Zhang and Y . Lu, Latent-Domain Predictive Neural Speech Coding, IEEE/ACM Trans. on Audio, Speech, and Language Processing, IEEE, 2023, vol. 31, pp. 2111-2123, doi: 10.1109/TASLP.2023.3277693

work page doi:10.1109/taslp.2023.3277693 2023

[20] [20]

Jiang, X

X. Jiang, X. Peng, Y . Zhang, Y . Lu, Disentangled feature learning for real-time neural speech coding, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1-5

work page 2023

[21] [21]

Y . Li, M. Tagliasacchi, O. Rybakov, V . Ungureanu, and D. Roblek, Realtime speech frequency bandwidth extension, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, 2021, pp. 691-695

work page 2021

[22] [22]

Défossez, J

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi, High fi- delity neural audio compression, Transactions on Machine Learning Research. (2023)

work page 2023

[23] [23]

S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, Bigvgan: A universal neural vocoder with large-scale training, 2022, arXiv preprint arXiv:2206.04658

work page arXiv 2022

[24] [24]

Kumar, P

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, High-fidelity audio compression with improved rvqgan, Advances in Neural Information Processing Sys- tems. 36 (2023)

work page 2023

[25] [25]

Z. Liu, T. Hartwig, and M. Ueda. Neural networks fail to learn periodic functions and how to fix it, Advances in Neural Information Processing Systems. 33 (2020) 1583- 1594

work page 2020

[26] [26]

J. Yu, X. Li, J. Y . Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y . Xu, J. Baldridge, and Y . Wu. Vector-quantized im- age modeling with improved vqgan, 2021, arXiv preprint arXiv:2110.04627

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

X. Bie, X. Liu, G. Richard, Learning Source Disentan- glement in Neural Audio Codec, 2024, arXiv preprint arXiv:2409.11228

work page arXiv 2024

[28] [28]

Sound- storm: Efficient parallel audio generation,

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, M. Tagliasacchi, Soundstorm: E ffi- cient parallel audio generation, 2023, arXiv preprint arXiv:2305.09636

work page arXiv 2023

[29] [29]

H. Li, L. Xue, H. Guo, X. Zhu, Y . Lv, L. Xie, Y . Chen, H. Yin, and Z. Li, Single-codec: Single-codebook speech codec towards high-performance speech generation, 2024, arXiv preprint arXiv:2406.07422

work page arXiv 2024

[30] [30]

Ji et al., WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, 2024, arXiv preprint arXiv:2408.16532

S. Ji et al., WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, 2024, arXiv preprint arXiv:2408.16532

work page arXiv 2024

[31] [31]

Vasuki, P.T

A. Vasuki, P.T. Vanathi, A review of vector quantization techniques, IEEE Potentials, vol. 25, 2006, pp. 39-47

work page 2006

[32] [32]

Gray, Vector quantization, IEEE Assp Magazine, vol

R. Gray, Vector quantization, IEEE Assp Magazine, vol. 1, 1984, pp. 4-29

work page 1984

[33] [33]

Generating Diverse High-Fidelity Images with VQ-VAE-2

A. Razavi, A. van den Oord, and O. Vinyals, Generating diverse highfidelity images with VQ-V AE-2, 2019, arXiv preprint arXiv:1906.00446

work page internal anchor Pith review Pith/arXiv arXiv 2019

[34] [34]

Jukebox: A Generative Model for Music

P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, Jukebox: A generative model for music, 2020, arXiv preprint arXiv:2005.00341

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

MacQueen, Some methods for classification and analy- sis of multivariate observations, Proc

J. MacQueen, Some methods for classification and analy- sis of multivariate observations, Proc. 5th Berkeley Symp. Math. Statist. Probability. (1967) 281-297

work page 1967

[36] [36]

Welker, M

S. Welker, M. Le, R. T. Q. Chen, W. Hsu, T. Gerkmann, A. Richard, and Y . Wu, FlowDec: A flow-based full-band general audio codec with high perceptual quality, 2025, arXiv preprint arXiv:2503.01485

work page arXiv 2025

[37] [37]

J. Yao, H. Liu, C. Chen, Y . Hu, ES Chng, L Xie, GenSE: Generative Speech Enhancement via Language Mod- els using Hierarchical Modeling, 2025, arXiv preprint arXiv:2502.02942

work page arXiv 2025

[38] [38]

W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, Uni- vnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation, in: Proc. Interspeech, 2021, pp. 2207-2211

work page 2021

[39] [39]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. Léonard, and A. Courville, Estimat- ing or propagating gradients through stochastic neu- rons for conditional computation, 2013, arXiv preprint arXiv:1308.3432. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013

[40] [40]

Shazeer, Y

N. Shazeer, Y . Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young,et al., Mesh-tensorflow: Deep learning for supercomputers, Advances in Neural Information Processing Systems. 31 (2018) 10414-10423

work page 2018

[41] [41]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai , Auxiliary- loss-free load balancing strategy for mixture-of-experts, 2024, arXiv preprint arXiv:2408.15664

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

J. Kong, J. Kim, and J. Bae, Hifi-gan: Generative ad- versarial networks for e fficient and high fidelity speech synthesis, Advances in neural information processing sys- tems. 33 (2020) 17022-17033

work page 2020

[43] [43]

Loshchilov and F

I. Loshchilov and F. Hutter, Decoupled weight decay regu- larization, in: International Conference on Learning Rep- resentations. (2019)

work page 2019

[44] [44]

Schoe ffler, F

M. Schoe ffler, F. Stöter, B. Edler, and J. Herre, Towards the next generation of web-based experiments: A case study assessing basic audio quality following the ITU-R recommendation BS. 1534 (MUSHRA), in: 1st Web Au- dio Conference, 2015, pp. 1-6

work page 2015

[45] [45]

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, Percep- tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in: 2001 IEEE international conference on acous- tics, speech, and signal processing(ICASSP), IEEE, 2001, pp. 749-752

work page 2001

[46] [46]

Chinen, F

M. Chinen, F. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines., Visqol v3: An open source production ready objective speech and audio metric, in: 2020 twelfth international conference on quality of multimedia experi- ence (QoMEX), IEEE, 2020, pp. 1-6

work page 2020

[47] [47]

Veaux, J

C. Veaux, J. Yamagishi, K. MacDonald et al. , Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, University of Edinburgh. The Centre for Speech Technology Research (CSTR), vol. 6, 2017, p. 15

work page 2017

[48] [48]

H. Zen, V . Dang, R. Clark, Y . Zhang, R. Weiss, Y . Jia, Z. Chen, and Y . Wu, LibriTTS: A Corpus Derived from Lib- riSpeech for Text-to-Speech, in: Proc. Interspeech, 2019, pp. 1526-1530

work page 2019

[49] [49]

arXiv preprint arXiv:1912.06670 , year=

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. Tyers, and G. Weber, Common voice: A massively-multilingual speech corpus, 2019, arXiv preprint arXiv:1912.06670

work page arXiv 2019

[50] [50]

De fferrard, K

M. De fferrard, K. Benzi, P. Vandergheynst, and X. Bres- son, FMA: A Dataset For Music Analysis, in: 18th Inter- national Society for Music Information Retrieval Confer- ence, 2017. 12

work page 2017