FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Chaoren Wang; Haizhou Li; Jiaqi Li; Junwen Qiu; Jun Zhang; Lu Lu; Mingjie Chen; Xiaohai Tian; Xinyu Liang; Xu Li

arxiv: 2606.31247 · v1 · pith:HLBQ7NJAnew · submitted 2026-06-30 · 💻 cs.SD · eess.AS

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Jiaqi Li , Chaoren Wang , Xiaohai Tian , Mingjie Chen , Xinyu Liang , Xu Li , Yufan Lin , Junwen Qiu

show 4 more authors

Jun Zhang Lu Lu Haizhou Li Zhizheng Wu

This is my paper

Pith reviewed 2026-07-01 03:46 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords spoken language modelsdynamic frame ratespeech codingcontrollable inferenceaudio tokenizersspeech-to-speechinference efficiency

0 comments

The pith

FlexiSLM is the first spoken language model with dynamic and controllable frame rates for both speech input and output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlexiSLM as a spoken language model that adapts its frame rate to the varying information density in speech. Existing models fix the rate at values like 25 Hz or 12.5 Hz, missing opportunities for efficiency and control. By incorporating dynamic frame rate techniques previously used only in audio tokenizers, FlexiSLM achieves better performance than other 7B models at high-quality settings and allows users to lower the rate for faster inference. This matters for applications needing a balance between speech quality and processing speed.

Core claim

FlexiSLM supports dynamic and controllable frame rates on both speech input and output. Using these representations, it outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at high-quality operating points. It can be steered down to 4.0 Hz, and at 6.25 Hz it halves inference time relative to 12.5 Hz while keeping strong speech-to-speech quality.

What carries the argument

Dynamic frame rate representations integrated into the spoken language model that enable variable rates on input and output.

If this is right

Outperforms fixed-frame-rate 7B models at high-quality operating points.
Can be steered accurately down to frame rates as low as 4.0 Hz.
At 6.25 Hz roughly halves inference time relative to 12.5 Hz while retaining strong quality.
Provides inference-time controllability to trade quality for speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic-rate approach may extend to other multimodal language models that process non-uniform signals.
Inference controllability could support new real-time applications where latency targets are set by the user.

Load-bearing premise

Dynamic frame rate speech coding techniques from audio tokenizers transfer effectively when integrated into spoken language models.

What would settle it

Experiments in which FlexiSLM fails to outperform fixed-frame-rate models at equivalent high-quality operating points or in which lowering the frame rate produces no reduction in inference time.

Figures

Figures reproduced from arXiv: 2606.31247 by Chaoren Wang, Haizhou Li, Jiaqi Li, Junwen Qiu, Jun Zhang, Lu Lu, Mingjie Chen, Xiaohai Tian, Xinyu Liang, Xu Li, Yufan Lin, Zhizheng Wu.

**Figure 2.** Figure 2: Overall architecture of FlexiSLM. into 5 Hz or 6.25 Hz sequences). 3 Method 3.1 Architecture Overview [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Talker Transformer input-output structure. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of FlexiSLM audio outputs for the response “The Hubble Space Telescope was launched in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at https://flexislm.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexiSLM shows how to move SLMs from fixed to dynamic frame rates on both input and output, but the abstract leaves the performance claims without enough supporting detail to judge their strength.

read the letter

The core advance is taking dynamic frame rate coding that already exists in audio tokenizers and wiring it into a full spoken language model so the rate can vary on the fly for both sides. This directly tackles the fact that speech does not carry information at a constant pace, and it adds a knob for trading quality against speed during inference.

The paper does a straightforward job of stating the problem with existing fixed-rate SLMs and showing that the tokenizer technique can be lifted over. The controllability result, down to 4 Hz with usable quality and roughly half the inference time at 6.25 Hz, is the part that would matter most to practitioners.

The soft spot is the lack of concrete evidence in the summary. It says FlexiSLM beats Qwen2.5-Omni and Kimi-Audio at high-quality points, yet gives no metrics, no training details, and no description of how the baselines were run. That makes it impossible to tell whether the gains come from the dynamic rates or from other differences in the model. If the full paper has the tables and ablations, this concern shrinks; if not, the central claim stays hard to evaluate.

The work is aimed at researchers building or optimizing SLMs who already follow the tokenizer literature. A reader who wants to try variable-rate speech modeling would get value from the idea even if the numbers need checking.

I would send it for peer review. The direction is clear and the transfer from tokenizers to SLMs is worth testing in public.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces FlexiSLM, the first spoken language model supporting dynamic and controllable frame rates on both speech input and output by integrating dynamic frame rate speech coding from audio tokenizers. It claims outperformance over fixed-frame-rate 7B models (Qwen2.5-Omni, Kimi-Audio) at high-quality operating points and demonstrates controllability down to 4.0 Hz, with 6.25 Hz halving inference time while retaining strong speech-to-speech quality. Audio samples are linked.

Significance. If the empirical results hold, the work is significant for spoken language models by enabling inference-time trade-offs between quality and speed via dynamic rates, extending tokenizer techniques to SLMs in a novel way and addressing a limitation of fixed-rate models.

minor comments (2)

Abstract lacks any mention of specific metrics, baselines, datasets, or experimental setup used to support the outperformance and controllability claims; this should be summarized briefly even in the abstract for clarity.
The manuscript should include a dedicated section or table detailing the exact frame rate operating points tested, corresponding quality metrics, and inference time measurements to allow direct comparison with the cited 7B baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of FlexiSLM and the recommendation for minor revision. The review correctly identifies the core contributions around dynamic frame-rate controllability and its potential for quality-speed trade-offs in spoken language models. No major comments were provided in the report, so we have no specific points requiring response or rebuttal at this stage. We remain available to address any minor revisions or clarifications the editor may request.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents FlexiSLM as the first integration of dynamic frame rate speech coding (previously developed for audio tokenizers) into spoken language models for both input and output. Central claims rest on empirical outperformance versus fixed-rate baselines and controllability verification at specific frame rates (e.g., 4.0 Hz, 6.25 Hz), with no equations, fitted parameters, or self-citations that reduce any prediction or uniqueness claim to the inputs by construction. No self-definitional steps, fitted-input-as-prediction, or ansatz smuggling via citation are present in the provided abstract or description. The work is a straightforward architectural application with external benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no specific free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5783 in / 1176 out tokens · 53678 ms · 2026-07-01T03:46:53.941828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

241 extracted references · 6 canonical work pages · 1 internal anchor

[2]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference, pages 4218--4222

2020
[5]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. https://aclanthology.org/D13-1160/ Semantic parsing on F reebase from question-answer pairs . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533--1544, Seattle, Washington, USA. Association for Computational Linguistics

2013
[6]

Zal \'a n Borsos, Rapha \"e l Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and 1 others. 2023. Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31:2523--2533

2023
[7]

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377--390

2014
[17]

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio synthesis of musical notes with wavenet autoencoders. In International conference on machine learning, pages 1068--1077. PMLR

2017
[18]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, and 1 others. 2024. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 682--689. IEEE

2024
[20]

Yuan Gong, Jin Yu, and James Glass. 2022. Vocalsound: A dataset for improving human vocal sounds recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151--155. IEEE

2022
[22]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451--3460

2021
[23]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3

2022
[28]

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and 1 others. 2023. Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36:14005--14034

2023
[29]

Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, and Zhizheng Wu. 2025 a . Dualcodec: A low-frame-rate, semantically-enhanced neural audio codec for speech generation. In Proceedings of Interspeech 2025

2025
[32]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

2023
[36]

Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. 2023. Librispeech-pc: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr models. In 2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1--7. IEEE

2023
[38]

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. 2024. https://openreview.net/forum?id=izrOLJov5y Spoken question answering and speech continuation using spectrogram-powered LLM . In The Twelfth International Conference on Learning Repres...

2024
[39]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[40]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206--5210. IEEE

2015
[41]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 527--536

2019
[43]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023
[44]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728--53741

2023
[45]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505--3506

2020
[50]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca

2023
[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[55]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025 a . https://arxiv.org/abs/2503.20215 Qwen2.5-omni technical report . Preprint, arXiv:2503.20215

Pith/arXiv arXiv 2025
[57]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025 c . Magpie: Alignment data synthesis from scratch by prompting aligned LLM s with nothing. In The Thirteenth International Conference on Learning Representations

2025
[58]

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/\ idea/readings/rainbow. htm)

2019
[60]

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495--507

2021
[65]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat : 1m ChatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations

2024
[69]

arXiv preprint arXiv:2504.08528 , year=

On the landscape of spoken language models: A comprehensive survey , author=. arXiv preprint arXiv:2504.08528 , year=

Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2601.05543 , year=

Closing the Modality Reasoning Gap for Speech Large Language Models , author=. arXiv preprint arXiv:2601.05543 , year=

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:2410.07168 , year=

Sylber: Syllabic embedding representation of speech from raw audio , author=. arXiv preprint arXiv:2410.07168 , year=

arXiv
[72]

arXiv preprint arXiv:2510.00981 , year=

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates , author=. arXiv preprint arXiv:2510.00981 , year=

arXiv
[73]

arXiv preprint arXiv:1308.3432 , year=

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

Pith/arXiv arXiv
[74]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[75]

(No Title) , year=

TIMIT acoustic-phonetic continuous speech corpus , author=. (No Title) , year=
[76]

arXiv preprint arXiv:2406.06992 , year=

Scaling up masked audio encoder learning for general audio classification , author=. arXiv preprint arXiv:2406.06992 , year=

arXiv
[77]

arXiv preprint arXiv:2512.23808 , year=

MiMo-Audio: Audio Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2512.23808 , year=

arXiv
[78]

Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
[79]

arXiv preprint arXiv:2411.18138 , year=

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation , author=. arXiv preprint arXiv:2411.18138 , year=

arXiv
[80]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[81]

arXiv preprint arXiv:2012.03411 , year=

Mls: A large-scale multilingual dataset for speech research , author=. arXiv preprint arXiv:2012.03411 , year=

Pith/arXiv arXiv 2012
[82]

arXiv preprint arXiv:2508.15418 , year=

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model , author=. arXiv preprint arXiv:2508.15418 , year=

arXiv
[83]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[84]

arXiv preprint arXiv:2505.16845 , year=

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate , author=. arXiv preprint arXiv:2505.16845 , year=

arXiv
[85]

2) , author=

Overview of the Amphion Toolkit (v0. 2) , author=. arXiv preprint arXiv:2501.15442 , year=

arXiv
[86]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

Investigating neural audio codecs for speech language model-based speech generation , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024
[87]

Advances in neural information processing systems , volume=

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis , author=. Advances in neural information processing systems , volume=
[88]

Advances in Neural Information Processing Systems , volume=

Neural networks fail to learn periodic functions and how to fix it , author=. Advances in Neural Information Processing Systems , volume=
[89]

arXiv preprint arXiv:2509.04685 , year=

Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding , author=. arXiv preprint arXiv:2509.04685 , year=

arXiv
[90]

2023 IEEE automatic speech recognition and understanding workshop (ASRU) , pages=

Librispeech-pc: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr models , author=. 2023 IEEE automatic speech recognition and understanding workshop (ASRU) , pages=. 2023 , organization=

2023
[91]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[92]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024
[93]

arXiv preprint arXiv:2505.17589 , year=

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

Pith/arXiv arXiv
[94]

arXiv preprint arXiv:2504.07053 , year=

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling , author=. arXiv preprint arXiv:2504.07053 , year=

arXiv
[95]

arXiv preprint arXiv:2411.01156 , year=

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis , author=. arXiv preprint arXiv:2411.01156 , year=

arXiv
[96]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Libri-light: A benchmark for asr with limited or no supervision , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

2020
[97]

arXiv preprint arXiv:2508.16790 , year=

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling , author=. arXiv preprint arXiv:2508.16790 , year=

arXiv
[98]

arXiv preprint arXiv:2410.04029 , year=

Syllablelm: Learning coarse semantic units for speech language models , author=. arXiv preprint arXiv:2410.04029 , year=

arXiv
[99]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Sd-hubert: Sentence-level self-distillation induces syllabic organization in hubert , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[100]

arXiv preprint arXiv:2407.04051 , year=

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms , author=. arXiv preprint arXiv:2407.04051 , year=

arXiv
[101]

1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings , volume=

Unit selection in a concatenative speech synthesis system using a large speech database , author=. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings , volume=. 1996 , organization=

1996
[102]

arXiv preprint arXiv:2502.07243 , year=

Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement , author=. arXiv preprint arXiv:2502.07243 , year=

arXiv
[103]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[104]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[105]

arXiv preprint arXiv:1703.10135 , year=

Tacotron: Towards end-to-end speech synthesis , author=. arXiv preprint arXiv:1703.10135 , year=

Pith/arXiv arXiv
[106]

arXiv preprint arXiv:2406.07422 , year=

Single-codec: Single-codebook speech codec towards high-performance speech generation , author=. arXiv preprint arXiv:2406.07422 , year=

arXiv
[107]

arXiv preprint arXiv:1609.03499 , volume=

Wavenet: A generative model for raw audio , author=. arXiv preprint arXiv:1609.03499 , volume=

Pith/arXiv arXiv
[108]

arXiv preprint arXiv:2411.18803 , year=

Ts3-codec: Transformer-based simple streaming single codec , author=. arXiv preprint arXiv:2411.18803 , year=

arXiv
[109]

arXiv preprint arXiv:2506.23325 , year=

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs , author=. arXiv preprint arXiv:2506.23325 , year=

arXiv
[110]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[111]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[112]

arXiv preprint arXiv:2402.08093 , year=

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data , author=. arXiv preprint arXiv:2402.08093 , year=

arXiv
[113]

arXiv preprint arXiv:2305.07243 , year=

Better speech synthesis through scaling , author=. arXiv preprint arXiv:2305.07243 , year=

arXiv
[114]

arXiv preprint arXiv:2207.12598 , year=

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

Pith/arXiv arXiv
[115]

arXiv preprint arXiv:2306.17806 , year=

Stay on topic with classifier-free guidance , author=. arXiv preprint arXiv:2306.17806 , year=

arXiv
[116]

arXiv preprint arXiv:2306.00814 , year=

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis , author=. arXiv preprint arXiv:2306.00814 , year=

arXiv
[117]

Advances in Neural Information Processing Systems , volume=

High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=
[118]

arXiv preprint arXiv:2206.04658 , year=

Bigvgan: A universal neural vocoder with large-scale training , author=. arXiv preprint arXiv:2206.04658 , year=

arXiv
[119]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv
[120]

arXiv preprint arXiv:2312.09911 , year=

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit , author=. arXiv preprint arXiv:2312.09911 , year=

arXiv
[121]

arXiv preprint arXiv:2210.13438 , year=

High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [2]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference, pages 4218--4222

2020

[2] [5]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. https://aclanthology.org/D13-1160/ Semantic parsing on F reebase from question-answer pairs . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533--1544, Seattle, Washington, USA. Association for Computational Linguistics

2013

[3] [6]

Zal \'a n Borsos, Rapha \"e l Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and 1 others. 2023. Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31:2523--2533

2023

[4] [7]

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377--390

2014

[5] [17]

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio synthesis of musical notes with wavenet autoencoders. In International conference on machine learning, pages 1068--1077. PMLR

2017

[6] [18]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, and 1 others. 2024. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 682--689. IEEE

2024

[7] [20]

Yuan Gong, Jin Yu, and James Glass. 2022. Vocalsound: A dataset for improving human vocal sounds recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151--155. IEEE

2022

[8] [22]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451--3460

2021

[9] [23]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3

2022

[10] [28]

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and 1 others. 2023. Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36:14005--14034

2023

[11] [29]

Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, and Zhizheng Wu. 2025 a . Dualcodec: A low-frame-rate, semantically-enhanced neural audio codec for speech generation. In Proceedings of Interspeech 2025

2025

[12] [32]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

2023

[13] [36]

Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. 2023. Librispeech-pc: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr models. In 2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1--7. IEEE

2023

[14] [38]

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. 2024. https://openreview.net/forum?id=izrOLJov5y Spoken question answering and speech continuation using spectrogram-powered LLM . In The Twelfth International Conference on Learning Repres...

2024

[15] [39]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022

[16] [40]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206--5210. IEEE

2015

[17] [41]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 527--536

2019

[18] [43]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023

[19] [44]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728--53741

2023

[20] [45]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505--3506

2020

[21] [50]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca

2023

[22] [51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017

[23] [55]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025 a . https://arxiv.org/abs/2503.20215 Qwen2.5-omni technical report . Preprint, arXiv:2503.20215

Pith/arXiv arXiv 2025

[24] [57]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025 c . Magpie: Alignment data synthesis from scratch by prompting aligned LLM s with nothing. In The Thirteenth International Conference on Learning Representations

2025

[25] [58]

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/\ idea/readings/rainbow. htm)

2019

[26] [60]

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495--507

2021

[27] [65]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat : 1m ChatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations

2024

[28] [69]

arXiv preprint arXiv:2504.08528 , year=

On the landscape of spoken language models: A comprehensive survey , author=. arXiv preprint arXiv:2504.08528 , year=

Pith/arXiv arXiv

[29] [70]

arXiv preprint arXiv:2601.05543 , year=

Closing the Modality Reasoning Gap for Speech Large Language Models , author=. arXiv preprint arXiv:2601.05543 , year=

Pith/arXiv arXiv

[30] [71]

arXiv preprint arXiv:2410.07168 , year=

Sylber: Syllabic embedding representation of speech from raw audio , author=. arXiv preprint arXiv:2410.07168 , year=

arXiv

[31] [72]

arXiv preprint arXiv:2510.00981 , year=

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates , author=. arXiv preprint arXiv:2510.00981 , year=

arXiv

[32] [73]

arXiv preprint arXiv:1308.3432 , year=

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

Pith/arXiv arXiv

[33] [74]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[34] [75]

(No Title) , year=

TIMIT acoustic-phonetic continuous speech corpus , author=. (No Title) , year=

[35] [76]

arXiv preprint arXiv:2406.06992 , year=

Scaling up masked audio encoder learning for general audio classification , author=. arXiv preprint arXiv:2406.06992 , year=

arXiv

[36] [77]

arXiv preprint arXiv:2512.23808 , year=

MiMo-Audio: Audio Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2512.23808 , year=

arXiv

[37] [78]

Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

[38] [79]

arXiv preprint arXiv:2411.18138 , year=

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation , author=. arXiv preprint arXiv:2411.18138 , year=

arXiv

[39] [80]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[40] [81]

arXiv preprint arXiv:2012.03411 , year=

Mls: A large-scale multilingual dataset for speech research , author=. arXiv preprint arXiv:2012.03411 , year=

Pith/arXiv arXiv 2012

[41] [82]

arXiv preprint arXiv:2508.15418 , year=

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model , author=. arXiv preprint arXiv:2508.15418 , year=

arXiv

[42] [83]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[43] [84]

arXiv preprint arXiv:2505.16845 , year=

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate , author=. arXiv preprint arXiv:2505.16845 , year=

arXiv

[44] [85]

2) , author=

Overview of the Amphion Toolkit (v0. 2) , author=. arXiv preprint arXiv:2501.15442 , year=

arXiv

[45] [86]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

Investigating neural audio codecs for speech language model-based speech generation , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024

[46] [87]

Advances in neural information processing systems , volume=

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis , author=. Advances in neural information processing systems , volume=

[47] [88]

Advances in Neural Information Processing Systems , volume=

Neural networks fail to learn periodic functions and how to fix it , author=. Advances in Neural Information Processing Systems , volume=

[48] [89]

arXiv preprint arXiv:2509.04685 , year=

Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding , author=. arXiv preprint arXiv:2509.04685 , year=

arXiv

[49] [90]

2023 IEEE automatic speech recognition and understanding workshop (ASRU) , pages=

Librispeech-pc: Benchmark for evaluation of punctuation and capitalization capabilities of end-to-end asr models , author=. 2023 IEEE automatic speech recognition and understanding workshop (ASRU) , pages=. 2023 , organization=

2023

[50] [91]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[51] [92]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024

[52] [93]

arXiv preprint arXiv:2505.17589 , year=

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

Pith/arXiv arXiv

[53] [94]

arXiv preprint arXiv:2504.07053 , year=

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling , author=. arXiv preprint arXiv:2504.07053 , year=

arXiv

[54] [95]

arXiv preprint arXiv:2411.01156 , year=

Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis , author=. arXiv preprint arXiv:2411.01156 , year=

arXiv

[55] [96]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Libri-light: A benchmark for asr with limited or no supervision , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

2020

[56] [97]

arXiv preprint arXiv:2508.16790 , year=

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling , author=. arXiv preprint arXiv:2508.16790 , year=

arXiv

[57] [98]

arXiv preprint arXiv:2410.04029 , year=

Syllablelm: Learning coarse semantic units for speech language models , author=. arXiv preprint arXiv:2410.04029 , year=

arXiv

[58] [99]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Sd-hubert: Sentence-level self-distillation induces syllabic organization in hubert , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[59] [100]

arXiv preprint arXiv:2407.04051 , year=

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms , author=. arXiv preprint arXiv:2407.04051 , year=

arXiv

[60] [101]

1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings , volume=

Unit selection in a concatenative speech synthesis system using a large speech database , author=. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings , volume=. 1996 , organization=

1996

[61] [102]

arXiv preprint arXiv:2502.07243 , year=

Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement , author=. arXiv preprint arXiv:2502.07243 , year=

arXiv

[62] [103]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[63] [104]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

A convnet for the 2020s , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[64] [105]

arXiv preprint arXiv:1703.10135 , year=

Tacotron: Towards end-to-end speech synthesis , author=. arXiv preprint arXiv:1703.10135 , year=

Pith/arXiv arXiv

[65] [106]

arXiv preprint arXiv:2406.07422 , year=

Single-codec: Single-codebook speech codec towards high-performance speech generation , author=. arXiv preprint arXiv:2406.07422 , year=

arXiv

[66] [107]

arXiv preprint arXiv:1609.03499 , volume=

Wavenet: A generative model for raw audio , author=. arXiv preprint arXiv:1609.03499 , volume=

Pith/arXiv arXiv

[67] [108]

arXiv preprint arXiv:2411.18803 , year=

Ts3-codec: Transformer-based simple streaming single codec , author=. arXiv preprint arXiv:2411.18803 , year=

arXiv

[68] [109]

arXiv preprint arXiv:2506.23325 , year=

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs , author=. arXiv preprint arXiv:2506.23325 , year=

arXiv

[69] [110]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[70] [111]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[71] [112]

arXiv preprint arXiv:2402.08093 , year=

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data , author=. arXiv preprint arXiv:2402.08093 , year=

arXiv

[72] [113]

arXiv preprint arXiv:2305.07243 , year=

Better speech synthesis through scaling , author=. arXiv preprint arXiv:2305.07243 , year=

arXiv

[73] [114]

arXiv preprint arXiv:2207.12598 , year=

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

Pith/arXiv arXiv

[74] [115]

arXiv preprint arXiv:2306.17806 , year=

Stay on topic with classifier-free guidance , author=. arXiv preprint arXiv:2306.17806 , year=

arXiv

[75] [116]

arXiv preprint arXiv:2306.00814 , year=

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis , author=. arXiv preprint arXiv:2306.00814 , year=

arXiv

[76] [117]

Advances in Neural Information Processing Systems , volume=

High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=

[77] [118]

arXiv preprint arXiv:2206.04658 , year=

Bigvgan: A universal neural vocoder with large-scale training , author=. arXiv preprint arXiv:2206.04658 , year=

arXiv

[78] [119]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv

[79] [120]

arXiv preprint arXiv:2312.09911 , year=

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit , author=. arXiv preprint arXiv:2312.09911 , year=

arXiv

[80] [121]

arXiv preprint arXiv:2210.13438 , year=

High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

Pith/arXiv arXiv