pith. sign in

arxiv: 2512.01537 · v3 · pith:HHUFCINOnew · submitted 2025-12-01 · 💻 cs.SD · cs.AI· cs.IT· cs.LG· eess.SP· math.IT

Two-Dimensional Quantization for Geometry-Aware Audio Coding

Pith reviewed 2026-05-21 18:10 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.ITcs.LGeess.SPmath.IT
keywords neural audio codecquantizationtwo-dimensional quantizationlatent space geometrycodebook utilizationtoken ratespeech compressionmusic compression
0
0 comments X

The pith

Projecting pairs of audio features onto fixed 2D grids yields an implicit codebook that raises codebook utilization and lowers token rates while matching state-of-the-art reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neural audio codecs rely on vector or residual vector quantization that restricts the geometric layout of the latent space and makes it harder to exploit correlations between features. The paper introduces Q2D2, a scheme that takes pairs of latent features, projects each pair onto one of several structured 2D grids (hexagonal, rhombic, or rectangular), and quantizes the pair to the nearest grid point. This construction produces an implicit codebook whose size equals the product of the number of levels along each grid axis, keeping the overall vocabulary comparable to conventional methods. Across speech, general audio, and music datasets, the resulting models reach competitive or better objective and subjective scores at lower token rates and with higher codebook utilization than current baselines. Ablation experiments isolate the contribution of the grid choice and the pairing step.

Core claim

Q2D2 projects each pair of latent features onto a chosen 2D tiling (hexagonal, rhombic, or rectangular) and replaces the pair with the nearest grid coordinates, thereby defining an implicit codebook whose cardinality is the product of the per-axis level counts; the decoder then reconstructs from these quantized coordinates.

What carries the argument

Two-Dimensional Quantization (Q2D2), which maps paired latent vectors onto a fixed 2D grid and quantizes them jointly to the nearest grid point.

If this is right

  • Token rate can be reduced while reconstruction quality stays at or above current neural codec levels.
  • Codebook utilization rises because every grid point is reachable and the grid geometry encourages even occupancy.
  • Correlations between adjacent latent dimensions are captured directly by the joint 2D quantization step.
  • The same grid-based scheme works across speech, general audio, and music without domain-specific tuning.
  • Ablation results show that grid shape and pairing strategy each contribute measurably to the efficiency gain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometric projection may make the latent space more interpretable, because each quantized pair corresponds to a visible location on a regular lattice.
  • Similar 2D-grid quantization could be applied to image or video codecs where spatial or temporal correlations dominate.
  • Because the codebook is implicit and generated on the fly, memory footprint for very large effective vocabularies may shrink compared with explicit lookup tables.
  • Testing the method on longer audio sequences or on streaming settings would reveal whether the fixed-grid assumption holds when temporal context grows.

Load-bearing premise

Projecting feature pairs onto fixed 2D grids preserves the correlations needed for high-fidelity reconstruction without introducing distortions that the decoder cannot compensate for.

What would settle it

Training an otherwise identical codec with Q2D2 on a standard speech or music benchmark and observing both lower codebook utilization and worse perceptual quality than an RVQ baseline of the same token rate would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.01537 by Eliya Nachmani, Tal Shuster.

Figure 1
Figure 1. Figure 1: Visualization of quantization grids used in Q2D2: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Q2D2 (a): The final encoder layer is projected to d selected latent feature dimensions. Each projected dimension is first bounded between [−li/2, li/2], where li is the number of levels selected per dimension. Q2D2 then groups the dimensions into pairs (in example, 6 dimensions are reshaped into 3 pairs), and jointly quantizes each pair onto a structured 2D grid and finding the nearest point on the grid. F… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between different acoustic codec models. The y-axis UTMOS reflects re [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mutual information between the two coordinates of each 2D pair before and after quantiza [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Two-Dimensional Quantization (Q2D2) for neural audio coding. In this scheme, pairs of latent features are projected onto one of several structured 2D grids (hexagonal, rhombic, or rectangular) and assigned to the nearest grid point. This defines an implicit codebook whose cardinality is the product of the number of levels along each grid axis. The authors report that Q2D2 achieves high codebook utilization and low token rates while delivering reconstruction quality that is competitive with or better than existing RVQ-, VQ-, and FSQ-based codecs on speech, audio, and music tasks, backed by objective metrics, subjective listening tests, and ablation experiments.

Significance. Should the geometric projection prove robust across different latent distributions, the approach offers a lightweight, interpretable alternative to learned vector quantizers. Its simplicity and the reported high utilization could make it attractive for resource-constrained audio compression pipelines. The explicit use of tiling geometry also opens a line of inquiry into how latent-space structure can be imposed rather than discovered.

major comments (3)
  1. Section 3.1: The projection operator onto the 2D grid is described only at a high level. No explicit formula is given for the distance metric (Euclidean, Manhattan, or grid-specific) or for the coordinate transformation that maps paired features to the chosen tiling. Without this, it is impossible to verify whether the quantization is truly 'parameter-free' or whether the choice of grid introduces hidden hyperparameters.
  2. Section 4.3, Table 3: The cross-domain results show Q2D2 outperforming baselines on music but only matching on speech. The paper does not report whether the same grid type and level counts were used across domains or whether per-domain grid selection was performed; if the latter, the comparison to 'state of the art' models that use fixed RVQ configurations becomes less direct.
  3. Section 5: The central assumption that paired latent features naturally align with one of the fixed tilings is not empirically tested. No scatter plots of feature pairs, no measurement of alignment error, and no ablation on random versus structured pairing are provided. If the joint distribution deviates significantly from the grid geometry, the nearest-grid assignment can introduce irreducible quantization noise that the decoder cannot fully compensate, weakening the claim that gains stem from geometry-aware correlation capture rather than from dimensionality reduction via pairing.
minor comments (2)
  1. Abstract: The phrase 'implicit codebook defined by the product of grid levels' would benefit from a short parenthetical example (e.g., 8×8 = 64) to clarify the scaling.
  2. Figure 2: The caption does not indicate whether the visualized grids are to scale or schematic; adding a note on the actual spacing used in experiments would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications where appropriate and outlining planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: Section 3.1: The projection operator onto the 2D grid is described only at a high level. No explicit formula is given for the distance metric (Euclidean, Manhattan, or grid-specific) or for the coordinate transformation that maps paired features to the chosen tiling. Without this, it is impossible to verify whether the quantization is truly 'parameter-free' or whether the choice of grid introduces hidden hyperparameters.

    Authors: We agree that Section 3.1 would benefit from greater mathematical detail. In the revised manuscript we will add the explicit projection formulas for each grid (hexagonal, rhombic, rectangular), specifying that nearest-grid assignment uses Euclidean distance in the appropriately transformed coordinates. The transformations are deterministic and contain no learnable parameters, so Q2D2 remains parameter-free apart from the discrete design choices of grid type and level counts, which are directly analogous to codebook size in conventional VQ. revision: yes

  2. Referee: Section 4.3, Table 3: The cross-domain results show Q2D2 outperforming baselines on music but only matching on speech. The paper does not report whether the same grid type and level counts were used across domains or whether per-domain grid selection was performed; if the latter, the comparison to 'state of the art' models that use fixed RVQ configurations becomes less direct.

    Authors: A single fixed grid configuration (hexagonal tiling with identical level counts per axis) was used for all domains. We will update Section 4.3 and the Table 3 caption to state this explicitly, thereby preserving the direct comparability with fixed-configuration baselines such as RVQ. revision: yes

  3. Referee: Section 5: The central assumption that paired latent features naturally align with one of the fixed tilings is not empirically tested. No scatter plots of feature pairs, no measurement of alignment error, and no ablation on random versus structured pairing are provided. If the joint distribution deviates significantly from the grid geometry, the nearest-grid assignment can introduce irreducible quantization noise that the decoder cannot fully compensate, weakening the claim that gains stem from geometry-aware correlation capture rather than from dimensionality reduction via pairing.

    Authors: We acknowledge the value of direct empirical support for the alignment assumption. The revised manuscript will include scatter plots of paired latent features, quantitative alignment-error statistics, and an ablation contrasting structured versus random pairing. These additions will clarify that performance improvements arise from geometry-aware correlation capture. Existing objective and subjective results already indicate that any residual quantization error is adequately compensated by the decoder. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with independent experimental validation

full rationale

The paper introduces Q2D2 as a geometric quantization scheme that projects paired latent features onto fixed 2D grids (hexagonal, rhombic, rectangular) to form an implicit codebook. All load-bearing claims rest on direct experimental comparisons of reconstruction metrics, codebook utilization, and token rates against RVQ/VQ/FSQ baselines across speech, audio, and music domains. No equations, derivations, or self-citations are shown that reduce any prediction or uniqueness result to a fitted parameter or prior author result by construction. The method is presented as a simple ansatz whose value is measured externally via ablation studies and objective/subjective scores, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated premise that 2D geometric regularity captures latent correlations better than independent per-dimension quantization; no explicit free parameters, axioms, or invented entities are named.

axioms (1)
  • domain assumption Structured 2D grids preserve feature correlations sufficiently for reconstruction
    Invoked implicitly when claiming efficiency gains from hexagonal/rhombic/rectangular tilings without loss of fidelity.

pith-pipeline@v0.9.0 · 5741 in / 1311 out tokens · 54600 ms · 2026-05-21T18:10:33.207914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    feature pairs are projected onto structured 2D grids—such as hexagonal, rhombic, or rectangular tiling—and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels

  • Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Q2D2 groups channels into pairs and maps them onto structured two-dimensional grids... This design achieve high utilization and robustness while introducing geometric structure that captures correlations between features

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 11 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

  5. [5]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I. Denk, Zal \'a n Borsos, Matthew Sharifi, Jesse Engel, Marco Tagliasacchi, Lukas B \"u rgener, Oleg Rybakov, Santiago Castro, Neil Zeghidour, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  6. [6]

    Apcodec: Advanced perceptual audio codec with convnextv2

    Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, and Nam Soo Kim. Apcodec: Advanced perceptual audio codec with convnextv2. arXiv preprint arXiv:2407.12345, 2024 a

  7. [7]

    Hilcodec: High fidelity and lightweight neural audio codec

    Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, Chanyeong Moon, and Nam Soo Kim. Hilcodec: High fidelity and lightweight neural audio codec. arXiv preprint arXiv:2405.04752, 2024 b

  8. [8]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024

  9. [9]

    Sorokin, and et al

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019

  10. [10]

    Data2vec: A general framework for self-supervised learning in speech, vision and language

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.\ 1298--1312. PMLR, 2022

  11. [11]

    Slurp: A spoken language understanding resource package

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205, 2020

  12. [12]

    Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark

    S \"o ren Becker. Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark. Pattern Recognition Letters, 361, 2024

  13. [13]

    The mtg-jamendo dataset for automatic music tagging

    Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The mtg-jamendo dataset for automatic music tagging. In Proceedings of the International Conference on Machine Learning (ICML), 2019

  14. [14]

    A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals

    Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp.\ 559--564, 2012

  15. [15]

    Language Models are Few-Shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  16. [16]

    Exploring product quantization for neural image compression

    Tianshi Chen, Xin Chen, et al. Exploring product quantization for neural image compression. In European Conference on Computer Vision (ECCV), 2020

  17. [17]

    Emovo corpus: An italian emotional speech database

    Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco, et al. Emovo corpus: An italian emotional speech database. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.\ 3501--3504. European Language Resources Association (ELRA), 2014

  18. [18]

    FMA: A Dataset For Music Analysis

    Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

  19. [19]

    High Fidelity Neural Audio Compression

    Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. URL https://arxiv.org/abs/2210.13438

  20. [20]

    Moshi: A speech-text foundation model for real-time dialogue

    Alexandre D \'e fossez, Sertan Girgin, Gabriel Synnaeve, and Yossi Adi. Moshi: A speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2403.14187, 2024

  21. [21]

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020

  22. [22]

    Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

    Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. arXiv preprint arXiv:2309.07405, 2023

  23. [23]

    Product quantization for transformers

    Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, and Herv \'e Jegou. Product quantization for transformers. In International Conference on Machine Learning (ICML), 2022. URL https://arxiv.org/abs/2209.14509

  24. [24]

    FSD50K: An Open Dataset of Human-Labeled Sound Events, volume 30

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: An Open Dataset of Human-Labeled Sound Events, volume 30. IEEE, 2021

  25. [25]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 776--780. IEEE, 2017

  26. [26]

    Vector quantization

    Robert M Gray. Vector quantization. IEEE ASSP Magazine, 1 0 (2): 0 4--29, 1984

  27. [27]

    Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

    Haoran He, Zhuo Shang, Chunyu Wang, Xudong Li, Yan Gu, Hong Hua, Lin Liu, Chao Yang, Jie Li, Peng Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 885--890. IEEE, 2024

  28. [28]

    The Variably Intense Vocalizations of Affect and Emotion (VIVAE) Corpus Prompts New Perspective on Nonspeech Perception, volume 22

    Natalie Holz, Pauline Larrouy-Maestri, and David Poeppel. The Variably Intense Vocalizations of Affect and Emotion (VIVAE) Corpus Prompts New Perspective on Nonspeech Perception, volume 22. American Psychological Association, 2022

  29. [29]

    HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, volume 29

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, volume 29. IEEE, 2021

  30. [30]

    Fewertokens: Efficient discrete representations for neural audio models

    Rongjie Huang, Chunlei Zhang, et al. Fewertokens: Efficient discrete representations for neural audio models. arXiv preprint arXiv:2309.11416, 2023. URL https://arxiv.org/abs/2309.11416

  31. [31]

    Make-a-voice: Revisiting voice large language models as scalable multilingual and multitask learners

    Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang, Ziyue Jiang, Xuankai Chang, et al. Make-a-voice: Revisiting voice large language models as scalable multilingual and multitask learners. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  32. [32]

    Repcodec: A speech representation codec for speech tokenization

    Zhichao Huang, Chutong Meng, and Tom Ko. Repcodec: A speech representation codec for speech tokenization. arXiv preprint arXiv:2309.00169, 2024 b

  33. [33]

    Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks

    Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842, 2023 a . URL https://arxiv.org/abs/2305.08842

  34. [34]

    Improving vector quantization for neural generative models

    Minyoung Huh, Dongyoon Lee, Jongmin Kim, et al. Improving vector quantization for neural generative models. arXiv preprint arXiv:2306.00960, 2023 b . URL https://arxiv.org/abs/2306.00960

  35. [35]

    BS.1534-1: Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems (MUSHRA)

    ITU-R . BS.1534-1: Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems (MUSHRA) . Technical Report BS.1534-1, International Telecommunication Union, Radiocommunication Sector, 2001. URL https://www.itu.int/rec/R-REC-BS.1534-1-200111-S/en

  36. [36]

    Language-codec: Reducing the gaps between discrete codec representation and speech language models

    Shengpeng Ji, Minghui Fang, Ziyue Jiang, Rongjie Huang, Jialong Zuo, Shulei Wang, and Zhou Zhao. Language-codec: Reducing the gaps between discrete codec representation and speech language models. arXiv preprint arXiv:2402.12208, 2024 a

  37. [37]

    Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling

    Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, and Zhou Zhao. Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532, 2024 b . URL https://arxiv.org/abs/...

  38. [38]

    Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias

    Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, et al. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023

  39. [39]

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In Forty-first International Conference on Machine Learning, 2024

  40. [40]

    Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

    Eugene Kharitonov, Damien Vincent, Zal \'a n Borsos, et al. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. In Advances in Neural Information Processing Systems, 2023

  41. [41]

    arXiv preprint arXiv:2209.15352 doi:10.48550/arXiv.2209.15352

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, and Alexandre D \'e fossez. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022

  42. [42]

    High fidelity audio compression with improved rvqgan

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023. URL https://arxiv.org/abs/2306.06546

  43. [43]

    Benchmarking representations for speech, music, and acoustic events

    Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, and Sabato Marco Siniscalchi. Benchmarking representations for speech, music, and acoustic events. arXiv preprint arXiv:2405.00934, 2024

  44. [44]

    Adrian a \'n cucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alum \"a e, and Antoine Laurent. Robust training of vector quantized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--7. IEEE, 2020. doi:10.1109/IJCNN48605.2020.9206750

  45. [45]

    High-resolution image synthesis with latent diffusion models

    Jacek a \'n cutki et al. High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2012.12877, 2020

  46. [46]

    Evaluation of algorithms using games: The case of music tagging

    Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp.\ 387--392, 2009

  47. [47]

    Residual quantization for learned image compression

    Hyun Lee, Soonyoung Cho, Youngjoon Lee, et al. Residual quantization for learned image compression. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. URL https://arxiv.org/abs/2203.02012

  48. [48]

    Single-codec: Single-codebook speech codec towards high-performance speech generation

    Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, and Zhifei Li. Single-codec: Single-codebook speech codec towards high-performance speech generation. arXiv preprint arXiv:2406.07422, 2024

  49. [49]

    Semanticodec: An ultra low bitrate semantic audio codec for general sound

    Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233, 2024

  50. [50]

    Livingstone and Frank A

    Steven R. Livingstone and Frank A. Russo. The ryerson audio-visual database of emotional speech and song (ravdess). Funding Information Natural Sciences and Engineering Research Council of Canada, 341583, 2012

  51. [51]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. In Proc.\ of ICLR, 2024. URL https://arxiv.org/abs/2309.15505

  52. [52]

    Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders

    Yu Pan, Lei Ma, and Jianjun Zhao. Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders. arXiv preprint arXiv:2404.02702, 2024

  53. [53]

    & Khudanpur, S

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 5206--5210. IEEE, 2015. doi:10.1109/ICASSP.2015.7178964

  54. [54]

    Parker, Anton Smirnov, Jordi Pons, C

    Julian D. Parker, Anton Smirnov, Jordi Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu. Scaling transformers for low-bitrate high-quality speech coding. arXiv preprint arXiv:2411.19842, 2024. URL https://arxiv.org/abs/2411.19842

  55. [55]

    Esc: Dataset for environmental sound classification

    Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, pp.\ 1015--1018, 2015

  56. [56]

    MLS: A Large-Scale Multilingual Dataset for Speech Research

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020

  57. [57]

    The musdb18 corpus for music separation, 2017

    Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The musdb18 corpus for music separation, 2017. URL https://doi.org/10.5281/zenodo.1117372

  58. [58]

    Fewer-token neural speech codec with time-invariant codes (ticodec)

    Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chu Yuan Zhang, and Junzuo Zhou. Fewer-token neural speech codec with time-invariant codes (ticodec). In ICASSP 2024 -- IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 12737--12741. IEEE, 2024

  59. [59]

    Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs

    Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp.\ 749--752. IEEE, 2001

  60. [60]

    Theory and Experiments on Vector Quantized Autoencoders

    Aurko Roy, Ashish Vaswani, Niki Parmar, and Yoshua Bengio. Theory and experiments on vector quantized autoencoders. In Workshop on Bayesian Deep Learning, NeurIPS, 2018. URL https://arxiv.org/abs/1805.11063

  61. [61]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022

  62. [62]

    A dataset and taxonomy for urban sound research

    Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, pp.\ 1041--1044. ACM, 2014

  63. [63]

    Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis

    Hubert Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. arXiv preprint arXiv:2306.00814, 2023

  64. [64]

    Voice activity detection and classification using convolutional neural networks

    Hubert Siuzdak, Pawe Drozdowski, Christian Rathgeb, and Christoph Busch. Voice activity detection and classification using convolutional neural networks. IEEE Access, 6: 0 2441--2450, 2018. doi:10.1109/ACCESS.2017.2786642

  65. [65]

    Taal, Richard C

    Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. In IEEE Transactions on Audio, Speech, and Language Processing, volume 19, pp.\ 2125--2136, 2011. doi:10.1109/TASL.2011.2114881

  66. [66]

    Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization,

    Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization. arXiv preprint arXiv:2205.07547, 2022 a . URL https://arxiv.org/abs/2205.07547

  67. [67]

    Preventing codebook collapse in vector-quantized models

    Yuichiro Takida et al. Preventing codebook collapse in vector-quantized models. arXiv preprint arXiv:2204.XXXXX, 2022 b

  68. [68]

    Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051,

    Tongyi SpeechTeam . Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms. arXiv preprint arXiv:2407.04051, 2024. URL https://arxiv.org/abs/2407.04051

  69. [69]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  70. [70]

    Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2016

    Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2016. URL https://datashare.ed.ac.uk/handle/10283/2651

  71. [71]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

  72. [72]

    Hierarchical quantized autoencoders

    Will Williams, Maximilian Riesenhuber, et al. Hierarchical quantized autoencoders. In Advances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2007.08088

  73. [73]

    Audiodec: An open-source streaming high-fidelity neural audio codec

    Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and Alexander Richard. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

  74. [74]

    Bigcodec: Pushing the limits of low-bitrate neural speech codec

    Detai Xin, Xu Tan, Shinnosuke Takamichi, and Hiroshi Saruwatari. Bigcodec: Pushing the limits of low-bitrate neural speech codec. arXiv preprint arXiv:2409.05377, 2024. URL https://arxiv.org/abs/2409.05377

  75. [75]

    Hifi-codec: Group-residual vector quantization for high fidelity audio codec

    Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023

  76. [76]

    Codec does matter: Exploring the semantic shortcoming of codec for audio language model

    Zhen Ye, Peihao Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jiajun Chen, Jun Pan, Qing Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. arXiv preprint arXiv:2408.17175, 2024. URL https://arxiv.org/abs/2408.17175

  77. [77]

    Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis

    Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis. arXiv preprint arXiv:2502.04128, 202...

  78. [78]

    SoundStream: An End-to-End Neural Audio Codec, volume 30

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An End-to-End Neural Audio Codec, volume 30. IEEE, 2021

  79. [79]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019

  80. [80]

    Anygpt: Unified multimodal llm with discrete sequence modeling

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024

Showing first 80 references.