pith. the verified trust layer for science. sign in

arxiv: 2509.22220 · v2 · submitted 2025-09-26 · 💻 cs.CL · cs.SD

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Pith reviewed 2026-05-18 12:53 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords semantic speech tokenizernoise robustnesstoken stabilitySpeechLLMbit-wise votingmulti-branch architectureUnit Edit Distance
0
0 comments X p. Extension

The pith

StableToken uses multi-branch audio processing and bit-wise voting to produce semantic speech tokens that stay consistent under noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing semantic speech tokenizers change their output sequences sharply even under mild acoustic noise that leaves speech perfectly intelligible to humans, raising the learning burden for downstream language models. StableToken fixes this by running audio through multiple parallel branches and combining the results with bit-wise voting to reach a single stable token sequence. The resulting lower Unit Edit Distance under diverse noise conditions directly improves SpeechLLM robustness across tasks. The design targets the single-path quantization flaw and the training signal that ignored intermediate token consistency.

Core claim

The central claim is that a consensus-driven tokenizer built from parallel processing branches merged by bit-wise voting produces markedly more stable semantic token sequences than prior single-path designs, even at high signal-to-noise ratios, and that this token-level stability improves the noise resilience of SpeechLLMs on multiple downstream tasks.

What carries the argument

Multi-branch architecture merged by bit-wise voting, which aggregates parallel audio representations to form one consistent token sequence.

If this is right

  • SpeechLLMs trained on StableToken sequences become more robust to noisy real-world inputs.
  • Token edit distance drops substantially across a range of signal-to-noise ratios without harming intelligibility.
  • The stability gain reduces the effective learning burden for the language model on speech data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consensus approach could be tested on other modalities where tokenization instability hurts downstream models.
  • Optimizing branch count and voting rules might further improve efficiency without losing the stability benefit.
  • Explicit consistency terms could be added to tokenizer training objectives in related audio or multimodal work.

Load-bearing premise

The multi-branch paths and voting step preserve the original semantic content while adding only stability.

What would settle it

An experiment showing that StableToken tokens still exhibit high Unit Edit Distance under the tested noise conditions, or that SpeechLLM task accuracy fails to rise despite the reported token stability gains.

Figures

Figures reproduced from arXiv: 2509.22220 by Aiwei Liu, Chuhan Wu, Houfeng Wang, Linhao Zhang, Wei Jia, Xiao Zhou, Yuhan Song.

Figure 1
Figure 1. Figure 1: Illustration of StableToken: unlike traditional methods, StableToken yields consistent token [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of StableToken. Our model replaces the standard single-path quantizer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of downstream SpeechLLMs under various noise conditions and SNR levels. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies fragility in existing semantic speech tokenizers to acoustic noise even at high SNR, attributing it to single-path quantization and distant training signals. It proposes StableToken, which uses a multi-branch parallel processing architecture merged by bit-wise voting to produce stable token sequences. The work claims new SOTA token stability via drastically reduced Unit Edit Distance (UED) under diverse noise, with direct downstream gains in SpeechLLM robustness on various tasks; code and models are released publicly.

Significance. If the stability improvements hold without semantic degradation, the approach could meaningfully improve resilience of speech LLMs in real-world noisy conditions by reducing the learning burden on downstream models. The public code release supports reproducibility and is a positive contribution.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.
  2. [Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.
minor comments (1)
  1. [Method] Notation for the voting mechanism and multi-branch fusion should be formalized with equations to clarify how consensus is computed without introducing new parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions made to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we have inserted the key measured improvements (UED reductions under multiple noise types and SNR levels, plus average gains on the downstream SpeechLLM tasks) together with a brief reference to the baselines and evaluation protocol. The body of the paper already contains the full experimental details and error analysis; the abstract update simply makes these results immediately visible. revision: yes

  2. Referee: [Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.

    Authors: This is a fair and important point. We have added a dedicated clean-audio ablation in the revised experiments section that directly compares StableToken against the single-branch baseline on ASR WER and spoken-language-understanding accuracy using the same clean test sets. The results show no measurable semantic degradation (and in some cases a small improvement), confirming that bit-wise voting does not override linguistically relevant decisions. We have also inserted a short discussion clarifying why the observed stability gains therefore translate to downstream robustness without hidden semantic cost. revision: yes

Circularity Check

0 steps flagged

No circularity: stability claim rests on explicit architectural mechanism rather than self-referential definition or fitted prediction

full rationale

The paper introduces StableToken via a multi-branch architecture whose outputs are merged by bit-wise voting; the resulting UED reduction is presented as an empirical consequence of this consensus design rather than a quantity defined in terms of itself or recovered from a fitted parameter. No equations appear that equate the stability metric to the voting rule by construction, and no load-bearing step reduces to a self-citation whose content is itself unverified. The central claim therefore remains an independent architectural proposal whose validity can be checked against external benchmarks (clean-data semantic metrics, downstream task accuracy) without circular reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based only on the abstract; no specific free parameters, axioms, or invented entities are detailed in the available text. The approach appears to rest on standard assumptions from neural audio processing and quantization literature.

pith-pipeline@v0.9.0 · 5728 in / 1073 out tokens · 50013 ms · 2026-05-18T12:53:32.944886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 17 internal anchors

  1. [1]

    The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

    Adaeze Adigwe, No \'e Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018

  2. [2]

    Hubert-vic: Improving noise-robust automatic speech recognition of speech foundation model via variance-invariance-covariance regularization

    Hyebin Ahn, Kangwook Jang, and Hoirin Kim. Hubert-vic: Improving noise-robust automatic speech recognition of speech foundation model via variance-invariance-covariance regularization. arXiv preprint arXiv:2508.12292, 2025

  3. [3]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024

  4. [4]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019

  5. [5]

    vq-wav2vec: Self-supervised learning of discrete speech representations

    Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019

  6. [6]

    wav2vec 2.0: A framework for self-supervised learning of speech representations

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33: 0 12449--12460, 2020

  7. [7]

    Hi-fi multi-speaker english tts dataset

    Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset. arXiv preprint arXiv:2104.01497, 2021

  8. [8]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  9. [9]

    Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

    Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.\ 1--5. IEEE, 2017

  10. [10]

    Iemocap: Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42 0 (4): 0 335--359, 2008

  11. [11]

    Crema-d: Crowd-sourced emotional multimodal actors dataset

    Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014

  12. [12]

    R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

    Heng-Jui Chang and James Glass. R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 642--662. Association for Computational Linguistics, 2024

  13. [13]

    Self-supervised fine-tuning for improved content representations by speaker-invariant clustering

    Heng-Jui Chang, Alexander H Liu, and James Glass. Self-supervised fine-tuning for improved content representations by speaker-invariant clustering. In Proc. Interspeech 2023, pp.\ 2983--2987, 2023

  14. [14]

    Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio

    Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021

  15. [15]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

  16. [16]

    Self-supervised learning with random-projection quantizer for speech recognition

    Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pp.\ 3915--3924. PMLR, 2022

  17. [17]

    W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 244--250. IEEE, 2021

  18. [18]

    Unsupervised cross-lingual representation learning for speech recognition

    Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2021

  19. [19]

    Fleurs: Few-shot learning evaluation of universal representations of speech

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 798--805. IEEE, 2023

  20. [20]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024

  21. [21]

    Kimi-Audio Technical Report

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report. arXiv preprint arXiv:2504.18425, 2025

  22. [22]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024 a

  23. [23]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024 b

  24. [24]

    Toronto emotional speech set (tess)

    Kate Dupuis and M Kathleen Pichora-Fuller. Toronto emotional speech set (tess). 2010

  25. [25]

    Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

    Patrick Eickhoff, Matthias M \"o ller, Theresa Pekarek Rosin, Johannes Twiefel, and Stefan Wermter. Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition . In Artificial Neural Networks and Machine Learning -- ICANN 2023, pp.\ 381--392. Springer Nature Switzerland, 2023

  26. [26]

    Llama-omni: Seamless speech interaction with large language models

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024

  27. [27]

    Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis

    Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis. arXiv preprint arXiv:2505.02625, 2025

  28. [28]

    Fsd50k: an open dataset of human-labeled sound events

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021

  29. [29]

    The people's speech: A large-scale diverse english speech recognition dataset for commercial usage

    Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer \'o n, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people's speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021

  30. [30]

    Augmentation invariant discrete representation for generative spoken language modeling

    Itai Gat, Felix Kreuk, Tu-Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, and Yossi Adi. Augmentation invariant discrete representation for generative spoken language modeling. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pp.\ 465--477, 2023

  31. [31]

    Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

    Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers . In Proc. Interspeech 2023, pp.\ 2358--2362, 2023. doi:10.21437/Interspeech.2023-1511

  32. [32]

    Recent advances in discrete speech tokens: A review

    Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review. arXiv preprint arXiv:2502.06490, 2025

  33. [33]

    Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 885--890. IEEE, 2024

  34. [34]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29: 0 3451--3460, 2021

  35. [35]

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

  36. [36]

    Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training

    Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, and Qun Liu. Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training. In ICLR, 2022

  37. [37]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  38. [38]

    Surrey audio-visual expressed emotion (savee) database

    Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014

  39. [39]

    do you follow me?

    L \'e o Jacqmin, Lina M Rojas-Barahona, and Benoit Favre. " do you follow me?": A survey of recent approaches in dialogue state tracking. arXiv preprint arXiv:2207.14627, 2022

  40. [40]

    An open source emotional speech corpus for human robot interaction applications

    Jesin James, Li Tian, and Catherine Inez Watson. An open source emotional speech corpus for human robot interaction applications. In Interspeech, pp.\ 2768--2772, 2018

  41. [41]

    Survey of adversarial robustness in multimodal large language models

    Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, and Jie Gui. Survey of adversarial robustness in multimodal large language models. arXiv preprint arXiv:2503.13962, 2025

  42. [42]

    Libri-light: A benchmark for asr with limited or no supervision

    Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazar \'e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 76...

  43. [43]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

  44. [44]

    Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33: 0 17022--17033, 2020

  45. [45]

    Dialogue state tracking with a language model using schema-driven prompting

    Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Dialogue state tracking with a language model using schema-driven prompting. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4937--4949, Online and Punta Cana, Dominican Republic, ...

  46. [46]

    Dailytalk: Spoken dialogue dataset for conversational text-to-speech

    Keon Lee, Kyumin Park, and Daeyoung Kim. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

  47. [47]

    Yodas: Youtube-oriented dataset for audio and speech

    Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, and Shinji Watanabe. Yodas: Youtube-oriented dataset for audio and speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 1--8. IEEE, 2023

  48. [48]

    Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning

    Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM international conference on multimedia, pp.\ 9610--9614, 2023

  49. [49]

    Dinosr: Self-distillation and online clustering for self-supervised speech representation learning

    Alexander H Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, and Jim Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. Advances in Neural Information Processing Systems, 36: 0 58346--58362, 2023

  50. [50]

    Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models

    Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zemin Liu, and Junyang Lin. Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 31035--31046, 2025

  51. [51]

    The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13 0 (5): 0 e0196391, 2018

  52. [52]

    Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations

    Vasista Sai Lodagala, Sreyan Ghosh, and Srinivasan Umesh. Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 1--8. IEEE, 2023

  53. [53]

    Unitok: A unified tokenizer for visual generation and understanding

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025 a

  54. [54]

    Language model can listen while speaking

    Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24831--24839, 2025 b

  55. [55]

    Nast: Noise aware speech tokenization for speech language models

    Shoval Messica and Yossi Adi. Nast: Noise aware speech tokenization for speech language models. In Proc. Interspeech 2024, pp.\ 4169--4173, 2024

  56. [56]

    DASB - Discrete Audio and Speech Benchmark

    Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli. Dasb-discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024

  57. [57]

    [TARGET]

    Tu Anh Nguyen, Wei-Ning Hsu, Antony d'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, et al. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023

  58. [58]

    Emns/imz/corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels

    Kari Ali Noriy, Xiaosong Yang, and Jian Jun Zhang. Emns/imz/corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels. arXiv preprint arXiv:2305.13137, 2023

  59. [59]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 5206--5210. IEEE, 2015

  60. [60]

    Esc: Dataset for environmental sound classification

    Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015

  61. [61]

    Speech resynthesis from discrete disentangled self-supervised representations

    Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021

  62. [62]

    MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018

  63. [63]

    MLS: A Large-Scale Multilingual Dataset for Speech Research

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020

  64. [64]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023

  65. [65]

    Ultra-low-bitrate speech coding with pretrained transformers

    Ali Siahkoohi, Michael Chinen, Tom Denton, W Bastiaan Kleijn, and Jan Skoglund. Ultra-low-bitrate speech coding with pretrained transformers. In Proc. Interspeech 2022, pp.\ 4421--4425, 2022

  66. [66]

    Analysing discrete self supervised speech representation for spoken language modeling

    Amitay Sicherman and Yossi Adi. Analysing discrete self supervised speech representation for spoken language modeling. In ICASSP, 2023

  67. [67]

    Marco-voice technical report

    Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, et al. Marco-voice technical report. arXiv preprint arXiv:2508.02038, 2025

  68. [68]

    deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

    Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition . arXiv preprint arXiv:2302.14597, 2023

  69. [69]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

  70. [70]

    Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit

    Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2017

  71. [71]

    An analysis of environment, microphone and data simulation mismatches in robust speech recognition

    Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language, 46: 0 535--557, 2017

  72. [72]

    Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021

  73. [73]

    Mead: A large-scale audio-visual dataset for emotional talking-face generation

    Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European conference on computer vision, pp.\ 700--717. Springer, 2020

  74. [74]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

    Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. arXiv preprint arXiv:2411.00774, 2024

  75. [75]

    Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

    Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu. Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition . In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 7632--7636. IEEE, 2022

  76. [76]

    Step-Audio 2 Technical Report

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report. arXiv preprint arXiv:2507.16632, 2025

  77. [77]

    Mini-omni: Language models can hear, talk while thinking in streaming

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024

  78. [78]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024 a

  79. [79]

    Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner

    Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, and Helen Meng. Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner. Advances in Neural Information Processing Systems, 37: 0 56802--56827, 2024 b

  80. [80]

    Who can withstand chat-audio attacks? an evaluation benchmark for large language models

    Wanqi Yang, Yanda Li, Meng Fang, Yunchao Wei, Tianyi Zhou, and Ling Chen. Who can withstand chat-audio attacks? an evaluation benchmark for large language models. arXiv preprint arXiv:2411.14842, 2024 c

Showing first 80 references.