Two-Dimensional Quantization for Geometry-Aware Audio Coding
Pith reviewed 2026-05-21 18:10 UTC · model grok-4.3
The pith
Projecting pairs of audio features onto fixed 2D grids yields an implicit codebook that raises codebook utilization and lowers token rates while matching state-of-the-art reconstruction quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Q2D2 projects each pair of latent features onto a chosen 2D tiling (hexagonal, rhombic, or rectangular) and replaces the pair with the nearest grid coordinates, thereby defining an implicit codebook whose cardinality is the product of the per-axis level counts; the decoder then reconstructs from these quantized coordinates.
What carries the argument
Two-Dimensional Quantization (Q2D2), which maps paired latent vectors onto a fixed 2D grid and quantizes them jointly to the nearest grid point.
If this is right
- Token rate can be reduced while reconstruction quality stays at or above current neural codec levels.
- Codebook utilization rises because every grid point is reachable and the grid geometry encourages even occupancy.
- Correlations between adjacent latent dimensions are captured directly by the joint 2D quantization step.
- The same grid-based scheme works across speech, general audio, and music without domain-specific tuning.
- Ablation results show that grid shape and pairing strategy each contribute measurably to the efficiency gain.
Where Pith is reading between the lines
- The geometric projection may make the latent space more interpretable, because each quantized pair corresponds to a visible location on a regular lattice.
- Similar 2D-grid quantization could be applied to image or video codecs where spatial or temporal correlations dominate.
- Because the codebook is implicit and generated on the fly, memory footprint for very large effective vocabularies may shrink compared with explicit lookup tables.
- Testing the method on longer audio sequences or on streaming settings would reveal whether the fixed-grid assumption holds when temporal context grows.
Load-bearing premise
Projecting feature pairs onto fixed 2D grids preserves the correlations needed for high-fidelity reconstruction without introducing distortions that the decoder cannot compensate for.
What would settle it
Training an otherwise identical codec with Q2D2 on a standard speech or music benchmark and observing both lower codebook utilization and worse perceptual quality than an RVQ baseline of the same token rate would falsify the central claim.
Figures
read the original abstract
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Two-Dimensional Quantization (Q2D2) for neural audio coding. In this scheme, pairs of latent features are projected onto one of several structured 2D grids (hexagonal, rhombic, or rectangular) and assigned to the nearest grid point. This defines an implicit codebook whose cardinality is the product of the number of levels along each grid axis. The authors report that Q2D2 achieves high codebook utilization and low token rates while delivering reconstruction quality that is competitive with or better than existing RVQ-, VQ-, and FSQ-based codecs on speech, audio, and music tasks, backed by objective metrics, subjective listening tests, and ablation experiments.
Significance. Should the geometric projection prove robust across different latent distributions, the approach offers a lightweight, interpretable alternative to learned vector quantizers. Its simplicity and the reported high utilization could make it attractive for resource-constrained audio compression pipelines. The explicit use of tiling geometry also opens a line of inquiry into how latent-space structure can be imposed rather than discovered.
major comments (3)
- Section 3.1: The projection operator onto the 2D grid is described only at a high level. No explicit formula is given for the distance metric (Euclidean, Manhattan, or grid-specific) or for the coordinate transformation that maps paired features to the chosen tiling. Without this, it is impossible to verify whether the quantization is truly 'parameter-free' or whether the choice of grid introduces hidden hyperparameters.
- Section 4.3, Table 3: The cross-domain results show Q2D2 outperforming baselines on music but only matching on speech. The paper does not report whether the same grid type and level counts were used across domains or whether per-domain grid selection was performed; if the latter, the comparison to 'state of the art' models that use fixed RVQ configurations becomes less direct.
- Section 5: The central assumption that paired latent features naturally align with one of the fixed tilings is not empirically tested. No scatter plots of feature pairs, no measurement of alignment error, and no ablation on random versus structured pairing are provided. If the joint distribution deviates significantly from the grid geometry, the nearest-grid assignment can introduce irreducible quantization noise that the decoder cannot fully compensate, weakening the claim that gains stem from geometry-aware correlation capture rather than from dimensionality reduction via pairing.
minor comments (2)
- Abstract: The phrase 'implicit codebook defined by the product of grid levels' would benefit from a short parenthetical example (e.g., 8×8 = 64) to clarify the scaling.
- Figure 2: The caption does not indicate whether the visualized grids are to scale or schematic; adding a note on the actual spacing used in experiments would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing clarifications where appropriate and outlining planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: Section 3.1: The projection operator onto the 2D grid is described only at a high level. No explicit formula is given for the distance metric (Euclidean, Manhattan, or grid-specific) or for the coordinate transformation that maps paired features to the chosen tiling. Without this, it is impossible to verify whether the quantization is truly 'parameter-free' or whether the choice of grid introduces hidden hyperparameters.
Authors: We agree that Section 3.1 would benefit from greater mathematical detail. In the revised manuscript we will add the explicit projection formulas for each grid (hexagonal, rhombic, rectangular), specifying that nearest-grid assignment uses Euclidean distance in the appropriately transformed coordinates. The transformations are deterministic and contain no learnable parameters, so Q2D2 remains parameter-free apart from the discrete design choices of grid type and level counts, which are directly analogous to codebook size in conventional VQ. revision: yes
-
Referee: Section 4.3, Table 3: The cross-domain results show Q2D2 outperforming baselines on music but only matching on speech. The paper does not report whether the same grid type and level counts were used across domains or whether per-domain grid selection was performed; if the latter, the comparison to 'state of the art' models that use fixed RVQ configurations becomes less direct.
Authors: A single fixed grid configuration (hexagonal tiling with identical level counts per axis) was used for all domains. We will update Section 4.3 and the Table 3 caption to state this explicitly, thereby preserving the direct comparability with fixed-configuration baselines such as RVQ. revision: yes
-
Referee: Section 5: The central assumption that paired latent features naturally align with one of the fixed tilings is not empirically tested. No scatter plots of feature pairs, no measurement of alignment error, and no ablation on random versus structured pairing are provided. If the joint distribution deviates significantly from the grid geometry, the nearest-grid assignment can introduce irreducible quantization noise that the decoder cannot fully compensate, weakening the claim that gains stem from geometry-aware correlation capture rather than from dimensionality reduction via pairing.
Authors: We acknowledge the value of direct empirical support for the alignment assumption. The revised manuscript will include scatter plots of paired latent features, quantitative alignment-error statistics, and an ablation contrasting structured versus random pairing. These additions will clarify that performance improvements arise from geometry-aware correlation capture. Existing objective and subjective results already indicate that any residual quantization error is adequately compensated by the decoder. revision: yes
Circularity Check
No circularity: empirical method proposal with independent experimental validation
full rationale
The paper introduces Q2D2 as a geometric quantization scheme that projects paired latent features onto fixed 2D grids (hexagonal, rhombic, rectangular) to form an implicit codebook. All load-bearing claims rest on direct experimental comparisons of reconstruction metrics, codebook utilization, and token rates against RVQ/VQ/FSQ baselines across speech, audio, and music domains. No equations, derivations, or self-citations are shown that reduce any prediction or uniqueness result to a fitted parameter or prior author result by construction. The method is presented as a simple ansatz whose value is measured externally via ablation studies and objective/subjective scores, making the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structured 2D grids preserve feature correlations sufficiently for reconstruction
Lean theorems connected to this paper
-
Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
feature pairs are projected onto structured 2D grids—such as hexagonal, rhombic, or rectangular tiling—and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels
-
Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Q2D2 groups channels into pairs and maps them onto structured two-dimensional grids... This design achieve high utilization and robustness while introducing geometric structure that captures correlations between features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
-
[5]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I. Denk, Zal \'a n Borsos, Matthew Sharifi, Jesse Engel, Marco Tagliasacchi, Lukas B \"u rgener, Oleg Rybakov, Santiago Castro, Neil Zeghidour, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Apcodec: Advanced perceptual audio codec with convnextv2
Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, and Nam Soo Kim. Apcodec: Advanced perceptual audio codec with convnextv2. arXiv preprint arXiv:2407.12345, 2024 a
-
[7]
Hilcodec: High fidelity and lightweight neural audio codec
Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, Chanyeong Moon, and Nam Soo Kim. Hilcodec: High fidelity and lightweight neural audio codec. arXiv preprint arXiv:2405.04752, 2024 b
-
[8]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019
-
[10]
Data2vec: A general framework for self-supervised learning in speech, vision and language
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.\ 1298--1312. PMLR, 2022
work page 2022
-
[11]
Slurp: A spoken language understanding resource package
Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205, 2020
-
[12]
Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark
S \"o ren Becker. Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark. Pattern Recognition Letters, 361, 2024
work page 2024
-
[13]
The mtg-jamendo dataset for automatic music tagging
Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The mtg-jamendo dataset for automatic music tagging. In Proceedings of the International Conference on Machine Learning (ICML), 2019
work page 2019
-
[14]
Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp.\ 559--564, 2012
work page 2012
-
[15]
Language Models are Few-Shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[16]
Exploring product quantization for neural image compression
Tianshi Chen, Xin Chen, et al. Exploring product quantization for neural image compression. In European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[17]
Emovo corpus: An italian emotional speech database
Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, Massimiliano Todisco, et al. Emovo corpus: An italian emotional speech database. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.\ 3501--3504. European Language Resources Association (ELRA), 2014
work page 2014
-
[18]
FMA: A Dataset For Music Analysis
Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
High Fidelity Neural Audio Compression
Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022. URL https://arxiv.org/abs/2210.13438
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Moshi: A speech-text foundation model for real-time dialogue
Alexandre D \'e fossez, Sertan Girgin, Gabriel Synnaeve, and Yossi Adi. Moshi: A speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2403.14187, 2024
-
[21]
Jukebox: A Generative Model for Music
Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[22]
Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec
Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. arXiv preprint arXiv:2309.07405, 2023
-
[23]
Product quantization for transformers
Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze, and Herv \'e Jegou. Product quantization for transformers. In International Conference on Machine Learning (ICML), 2022. URL https://arxiv.org/abs/2209.14509
-
[24]
FSD50K: An Open Dataset of Human-Labeled Sound Events, volume 30
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: An Open Dataset of Human-Labeled Sound Events, volume 30. IEEE, 2021
work page 2021
-
[25]
Audio set: An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 776--780. IEEE, 2017
work page 2017
-
[26]
Robert M Gray. Vector quantization. IEEE ASSP Magazine, 1 0 (2): 0 4--29, 1984
work page 1984
-
[27]
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Haoran He, Zhuo Shang, Chunyu Wang, Xudong Li, Yan Gu, Hong Hua, Lin Liu, Chao Yang, Jie Li, Peng Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 885--890. IEEE, 2024
work page 2024
-
[28]
Natalie Holz, Pauline Larrouy-Maestri, and David Poeppel. The Variably Intense Vocalizations of Affect and Emotion (VIVAE) Corpus Prompts New Perspective on Nonspeech Perception, volume 22. American Psychological Association, 2022
work page 2022
-
[29]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, volume 29. IEEE, 2021
work page 2021
-
[30]
Fewertokens: Efficient discrete representations for neural audio models
Rongjie Huang, Chunlei Zhang, et al. Fewertokens: Efficient discrete representations for neural audio models. arXiv preprint arXiv:2309.11416, 2023. URL https://arxiv.org/abs/2309.11416
-
[31]
Make-a-voice: Revisiting voice large language models as scalable multilingual and multitask learners
Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Jinchuan Tian, Zhenhui Ye, Luping Liu, Zehan Wang, Ziyue Jiang, Xuankai Chang, et al. Make-a-voice: Revisiting voice large language models as scalable multilingual and multitask learners. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
work page 2024
-
[32]
Repcodec: A speech representation codec for speech tokenization
Zhichao Huang, Chutong Meng, and Tom Ko. Repcodec: A speech representation codec for speech tokenization. arXiv preprint arXiv:2309.00169, 2024 b
-
[33]
Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842, 2023 a . URL https://arxiv.org/abs/2305.08842
-
[34]
Improving vector quantization for neural generative models
Minyoung Huh, Dongyoon Lee, Jongmin Kim, et al. Improving vector quantization for neural generative models. arXiv preprint arXiv:2306.00960, 2023 b . URL https://arxiv.org/abs/2306.00960
-
[35]
ITU-R . BS.1534-1: Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems (MUSHRA) . Technical Report BS.1534-1, International Telecommunication Union, Radiocommunication Sector, 2001. URL https://www.itu.int/rec/R-REC-BS.1534-1-200111-S/en
work page 2001
-
[36]
Language-codec: Reducing the gaps between discrete codec representation and speech language models
Shengpeng Ji, Minghui Fang, Ziyue Jiang, Rongjie Huang, Jialong Zuo, Shulei Wang, and Zhou Zhao. Language-codec: Reducing the gaps between discrete codec representation and speech language models. arXiv preprint arXiv:2402.12208, 2024 a
-
[37]
Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling
Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, and Zhou Zhao. Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532, 2024 b . URL https://arxiv.org/abs/...
-
[38]
Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias
Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, et al. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023
-
[39]
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[40]
Speak, read and prompt: High-fidelity text-to-speech with minimal supervision
Eugene Kharitonov, Damien Vincent, Zal \'a n Borsos, et al. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[41]
arXiv preprint arXiv:2209.15352 doi:10.48550/arXiv.2209.15352
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, and Alexandre D \'e fossez. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022
-
[42]
High fidelity audio compression with improved rvqgan
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023. URL https://arxiv.org/abs/2306.06546
-
[43]
Benchmarking representations for speech, music, and acoustic events
Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, and Sabato Marco Siniscalchi. Benchmarking representations for speech, music, and acoustic events. arXiv preprint arXiv:2405.00934, 2024
-
[44]
Adrian a \'n cucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alum \"a e, and Antoine Laurent. Robust training of vector quantized bottleneck models. In 2020 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--7. IEEE, 2020. doi:10.1109/IJCNN48605.2020.9206750
-
[45]
High-resolution image synthesis with latent diffusion models
Jacek a \'n cutki et al. High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2012.12877, 2020
-
[46]
Evaluation of algorithms using games: The case of music tagging
Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp.\ 387--392, 2009
work page 2009
-
[47]
Residual quantization for learned image compression
Hyun Lee, Soonyoung Cho, Youngjoon Lee, et al. Residual quantization for learned image compression. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. URL https://arxiv.org/abs/2203.02012
-
[48]
Single-codec: Single-codebook speech codec towards high-performance speech generation
Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, and Zhifei Li. Single-codec: Single-codebook speech codec towards high-performance speech generation. arXiv preprint arXiv:2406.07422, 2024
-
[49]
Semanticodec: An ultra low bitrate semantic audio codec for general sound
Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233, 2024
-
[50]
Steven R. Livingstone and Frank A. Russo. The ryerson audio-visual database of emotional speech and song (ravdess). Funding Information Natural Sciences and Engineering Research Council of Canada, 341583, 2012
work page 2012
-
[51]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. In Proc.\ of ICLR, 2024. URL https://arxiv.org/abs/2309.15505
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Yu Pan, Lei Ma, and Jianjun Zhao. Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders. arXiv preprint arXiv:2404.02702, 2024
-
[53]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 5206--5210. IEEE, 2015. doi:10.1109/ICASSP.2015.7178964
-
[54]
Parker, Anton Smirnov, Jordi Pons, C
Julian D. Parker, Anton Smirnov, Jordi Pons, C. Carr, Z. Zukowski, Z. Evans, and X. Liu. Scaling transformers for low-bitrate high-quality speech coding. arXiv preprint arXiv:2411.19842, 2024. URL https://arxiv.org/abs/2411.19842
-
[55]
Esc: Dataset for environmental sound classification
Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, pp.\ 1015--1018, 2015
work page 2015
-
[56]
MLS: A Large-Scale Multilingual Dataset for Speech Research
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[57]
The musdb18 corpus for music separation, 2017
Zafar Rafii, Antoine Liutkus, Fabian-Robert St \"o ter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The musdb18 corpus for music separation, 2017. URL https://doi.org/10.5281/zenodo.1117372
-
[58]
Fewer-token neural speech codec with time-invariant codes (ticodec)
Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chu Yuan Zhang, and Junzuo Zhou. Fewer-token neural speech codec with time-invariant codes (ticodec). In ICASSP 2024 -- IEEE International Conference on Acoustics, Speech and Signal Processing, pp.\ 12737--12741. IEEE, 2024
work page 2024
-
[59]
Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), pp.\ 749--752. IEEE, 2001
work page 2001
-
[60]
Theory and Experiments on Vector Quantized Autoencoders
Aurko Roy, Ashish Vaswani, Niki Parmar, and Yoshua Bengio. Theory and experiments on vector quantized autoencoders. In Workshop on Bayesian Deep Learning, NeurIPS, 2018. URL https://arxiv.org/abs/1805.11063
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Utmos: Utokyo-sarulab system for voicemos challenge 2022
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022
-
[62]
A dataset and taxonomy for urban sound research
Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, pp.\ 1041--1044. ACM, 2014
work page 2014
-
[63]
Hubert Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. arXiv preprint arXiv:2306.00814, 2023
-
[64]
Voice activity detection and classification using convolutional neural networks
Hubert Siuzdak, Pawe Drozdowski, Christian Rathgeb, and Christoph Busch. Voice activity detection and classification using convolutional neural networks. IEEE Access, 6: 0 2441--2450, 2018. doi:10.1109/ACCESS.2017.2786642
-
[65]
Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. In IEEE Transactions on Audio, Speech, and Language Processing, volume 19, pp.\ 2125--2136, 2011. doi:10.1109/TASL.2011.2114881
-
[66]
Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization,
Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. Sq-vae: Variational bayes on discrete representation with self-annealed stochastic quantization. arXiv preprint arXiv:2205.07547, 2022 a . URL https://arxiv.org/abs/2205.07547
-
[67]
Preventing codebook collapse in vector-quantized models
Yuichiro Takida et al. Preventing codebook collapse in vector-quantized models. arXiv preprint arXiv:2204.XXXXX, 2022 b
work page 2022
-
[68]
Tongyi SpeechTeam . Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms. arXiv preprint arXiv:2407.04051, 2024. URL https://arxiv.org/abs/2407.04051
-
[69]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
work page 2017
-
[70]
Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2016
Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2016. URL https://datashare.ed.ac.uk/handle/10283/2651
work page 2016
-
[71]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Hierarchical quantized autoencoders
Will Williams, Maximilian Riesenhuber, et al. Hierarchical quantized autoencoders. In Advances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2007.08088
-
[73]
Audiodec: An open-source streaming high-fidelity neural audio codec
Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and Alexander Richard. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023
work page 2023
-
[74]
Bigcodec: Pushing the limits of low-bitrate neural speech codec
Detai Xin, Xu Tan, Shinnosuke Takamichi, and Hiroshi Saruwatari. Bigcodec: Pushing the limits of low-bitrate neural speech codec. arXiv preprint arXiv:2409.05377, 2024. URL https://arxiv.org/abs/2409.05377
-
[75]
Hifi-codec: Group-residual vector quantization for high fidelity audio codec
Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023
-
[76]
Codec does matter: Exploring the semantic shortcoming of codec for audio language model
Zhen Ye, Peihao Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jiajun Chen, Jun Pan, Qing Liu, et al. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. arXiv preprint arXiv:2408.17175, 2024. URL https://arxiv.org/abs/2408.17175
-
[77]
Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis
Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis. arXiv preprint arXiv:2502.04128, 202...
-
[78]
SoundStream: An End-to-End Neural Audio Codec, volume 30
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. SoundStream: An End-to-End Neural Audio Codec, volume 30. IEEE, 2021
work page 2021
-
[79]
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[80]
Anygpt: Unified multimodal llm with discrete sequence modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.