arxiv: 2509.22220 · v2 · submitted 2025-09-26 · 💻 cs.CL · cs.SD

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Yuhan Song , Linhao Zhang , Chuhan Wu , Aiwei Liu , Wei Jia , Houfeng Wang , Xiao Zhou This is my paper

Pith reviewed 2026-05-18 12:53 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords semantic speech tokenizernoise robustnesstoken stabilitySpeechLLMbit-wise votingmulti-branch architectureUnit Edit Distance

0 comments p. Extension

The pith

StableToken uses multi-branch audio processing and bit-wise voting to produce semantic speech tokens that stay consistent under noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing semantic speech tokenizers change their output sequences sharply even under mild acoustic noise that leaves speech perfectly intelligible to humans, raising the learning burden for downstream language models. StableToken fixes this by running audio through multiple parallel branches and combining the results with bit-wise voting to reach a single stable token sequence. The resulting lower Unit Edit Distance under diverse noise conditions directly improves SpeechLLM robustness across tasks. The design targets the single-path quantization flaw and the training signal that ignored intermediate token consistency.

Core claim

The central claim is that a consensus-driven tokenizer built from parallel processing branches merged by bit-wise voting produces markedly more stable semantic token sequences than prior single-path designs, even at high signal-to-noise ratios, and that this token-level stability improves the noise resilience of SpeechLLMs on multiple downstream tasks.

What carries the argument

Multi-branch architecture merged by bit-wise voting, which aggregates parallel audio representations to form one consistent token sequence.

If this is right

SpeechLLMs trained on StableToken sequences become more robust to noisy real-world inputs.
Token edit distance drops substantially across a range of signal-to-noise ratios without harming intelligibility.
The stability gain reduces the effective learning burden for the language model on speech data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consensus approach could be tested on other modalities where tokenization instability hurts downstream models.
Optimizing branch count and voting rules might further improve efficiency without losing the stability benefit.
Explicit consistency terms could be added to tokenizer training objectives in related audio or multimodal work.

Load-bearing premise

The multi-branch paths and voting step preserve the original semantic content while adding only stability.

What would settle it

An experiment showing that StableToken tokens still exhibit high Unit Edit Distance under the tested noise conditions, or that SpeechLLM task accuracy fails to rise despite the reported token stability gains.

Figures

Figures reproduced from arXiv: 2509.22220 by Aiwei Liu, Chuhan Wu, Houfeng Wang, Linhao Zhang, Wei Jia, Xiao Zhou, Yuhan Song.

**Figure 2.** Figure 2: The architecture of StableToken. Our model replaces the standard single-path quantizer [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of downstream SpeechLLMs under various noise conditions and SNR levels. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StableToken's multi-branch voting targets real tokenizer fragility under noise but still needs direct proof that semantic quality holds on clean speech.

read the letter

The main thing to know is that StableToken tackles the brittleness of semantic speech tokenizers by running audio through parallel branches and merging them with bit-wise voting. This is a concrete engineering response to token sequences changing too much even at high SNR, and it could help downstream SpeechLLMs in noisy settings. The approach is new in its specific use of consensus for this task rather than just tweaking single-path quantization or the loss.

Referee Report

2 major / 1 minor

Summary. The paper identifies fragility in existing semantic speech tokenizers to acoustic noise even at high SNR, attributing it to single-path quantization and distant training signals. It proposes StableToken, which uses a multi-branch parallel processing architecture merged by bit-wise voting to produce stable token sequences. The work claims new SOTA token stability via drastically reduced Unit Edit Distance (UED) under diverse noise, with direct downstream gains in SpeechLLM robustness on various tasks; code and models are released publicly.

Significance. If the stability improvements hold without semantic degradation, the approach could meaningfully improve resilience of speech LLMs in real-world noisy conditions by reducing the learning burden on downstream models. The public code release supports reproducibility and is a positive contribution.

major comments (2)

[Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.
[Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.

minor comments (1)

[Method] Notation for the voting mechanism and multi-branch fusion should be formalized with equations to clarify how consensus is computed without introducing new parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions made to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we have inserted the key measured improvements (UED reductions under multiple noise types and SNR levels, plus average gains on the downstream SpeechLLM tasks) together with a brief reference to the baselines and evaluation protocol. The body of the paper already contains the full experimental details and error analysis; the abstract update simply makes these results immediately visible. revision: yes
Referee: [Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.

Authors: This is a fair and important point. We have added a dedicated clean-audio ablation in the revised experiments section that directly compares StableToken against the single-branch baseline on ASR WER and spoken-language-understanding accuracy using the same clean test sets. The results show no measurable semantic degradation (and in some cases a small improvement), confirming that bit-wise voting does not override linguistically relevant decisions. We have also inserted a short discussion clarifying why the observed stability gains therefore translate to downstream robustness without hidden semantic cost. revision: yes

Circularity Check

0 steps flagged

No circularity: stability claim rests on explicit architectural mechanism rather than self-referential definition or fitted prediction

full rationale

The paper introduces StableToken via a multi-branch architecture whose outputs are merged by bit-wise voting; the resulting UED reduction is presented as an empirical consequence of this consensus design rather than a quantity defined in terms of itself or recovered from a fitted parameter. No equations appear that equate the stability metric to the voting rule by construction, and no load-bearing step reduces to a self-citation whose content is itself unverified. The central claim therefore remains an independent architectural proposal whose validity can be checked against external benchmarks (clean-data semantic metrics, downstream task accuracy) without circular reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based only on the abstract; no specific free parameters, axioms, or invented entities are detailed in the available text. The approach appears to rest on standard assumptions from neural audio processing and quantization literature.

pith-pipeline@v0.9.0 · 5728 in / 1073 out tokens · 50013 ms · 2026-05-18T12:53:32.944886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 17 internal anchors

[1]

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Adaeze Adigwe, No \'e Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Hubert-vic: Improving noise-robust automatic speech recognition of speech foundation model via variance-invariance-covariance regularization

Hyebin Ahn, Kangwook Jang, and Hoirin Kim. Hubert-vic: Improving noise-robust automatic speech recognition of speech foundation model via variance-invariance-covariance regularization. arXiv preprint arXiv:2508.12292, 2025

work page arXiv 2025
[3]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019

work page arXiv 1912
[5]

vq-wav2vec: Self-supervised learning of discrete speech representations

Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019

work page arXiv 1910
[6]

wav2vec 2.0: A framework for self-supervised learning of speech representations

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33: 0 12449--12460, 2020

work page 2020
[7]

Hi-fi multi-speaker english tts dataset

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset. arXiv preprint arXiv:2104.01497, 2021

work page arXiv 2021
[8]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[9]

Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.\ 1--5. IEEE, 2017

work page 2017
[10]

Iemocap: Interactive emotional dyadic motion capture database

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42 0 (4): 0 335--359, 2008

work page 2008
[11]

Crema-d: Crowd-sourced emotional multimodal actors dataset

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014

work page 2014
[12]

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

Heng-Jui Chang and James Glass. R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 642--662. Association for Computational Linguistics, 2024

work page 2024
[13]

Self-supervised fine-tuning for improved content representations by speaker-invariant clustering

Heng-Jui Chang, Alexander H Liu, and James Glass. Self-supervised fine-tuning for improved content representations by speaker-invariant clustering. In Proc. Interspeech 2023, pp.\ 2983--2987, 2023

work page 2023
[14]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[15]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022

work page 2022
[16]

Self-supervised learning with random-projection quantizer for speech recognition

Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pp.\ 3915--3924. PMLR, 2022

work page 2022
[17]

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 244--250. IEEE, 2021

work page 2021
[18]

Unsupervised cross-lingual representation learning for speech recognition

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2021

work page 2021
[19]

Fleurs: Few-shot learning evaluation of universal representations of speech

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 798--805. IEEE, 2023

work page 2022
[20]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report. arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Toronto emotional speech set (tess)

Kate Dupuis and M Kathleen Pichora-Fuller. Toronto emotional speech set (tess). 2010

work page 2010
[25]

Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

Patrick Eickhoff, Matthias M \"o ller, Theresa Pekarek Rosin, Johannes Twiefel, and Stefan Wermter. Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition . In Artificial Neural Networks and Machine Learning -- ICANN 2023, pp.\ 381--392. Springer Nature Switzerland, 2023

work page 2023
[26]

Llama-omni: Seamless speech interaction with large language models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024

work page arXiv 2024
[27]

Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis. arXiv preprint arXiv:2505.02625, 2025

work page arXiv 2025
[28]

Fsd50k: an open dataset of human-labeled sound events

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021

work page 2021
[29]

The people's speech: A large-scale diverse english speech recognition dataset for commercial usage

Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer \'o n, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people's speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021

work page arXiv 2021
[30]

Augmentation invariant discrete representation for generative spoken language modeling

Itai Gat, Felix Kreuk, Tu-Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, and Yossi Adi. Augmentation invariant discrete representation for generative spoken language modeling. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pp.\ 465--477, 2023

work page 2023
[31]

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers . In Proc. Interspeech 2023, pp.\ 2358--2362, 2023. doi:10.21437/Interspeech.2023-1511

work page doi:10.21437/interspeech.2023-1511 2023
[32]

Recent advances in discrete speech tokens: A review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review. arXiv preprint arXiv:2502.06490, 2025

work page arXiv 2025
[33]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 885--890. IEEE, 2024

work page 2024
[34]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29: 0 3451--3460, 2021

work page 2021
[35]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

work page internal anchor Pith review arXiv 2025
[36]

Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training

Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, and Qun Liu. Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training. In ICLR, 2022

work page 2022
[37]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Surrey audio-visual expressed emotion (savee) database

Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014

work page 2014
[39]

do you follow me?

L \'e o Jacqmin, Lina M Rojas-Barahona, and Benoit Favre. " do you follow me?": A survey of recent approaches in dialogue state tracking. arXiv preprint arXiv:2207.14627, 2022

work page arXiv 2022
[40]

An open source emotional speech corpus for human robot interaction applications

Jesin James, Li Tian, and Catherine Inez Watson. An open source emotional speech corpus for human robot interaction applications. In Interspeech, pp.\ 2768--2772, 2018

work page 2018
[41]

Survey of adversarial robustness in multimodal large language models

Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, and Jie Gui. Survey of adversarial robustness in multimodal large language models. arXiv preprint arXiv:2503.13962, 2025

work page arXiv 2025
[42]

Libri-light: A benchmark for asr with limited or no supervision

Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazar \'e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 76...

work page 2020
[43]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

work page 2019
[44]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33: 0 17022--17033, 2020

work page 2020
[45]

Dialogue state tracking with a language model using schema-driven prompting

Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Dialogue state tracking with a language model using schema-driven prompting. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4937--4949, Online and Punta Cana, Dominican Republic, ...

work page doi:10.18653/v1/2021.emnlp-main.404 2021
[46]

Dailytalk: Spoken dialogue dataset for conversational text-to-speech

Keon Lee, Kyumin Park, and Daeyoung Kim. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

work page 2023
[47]

Yodas: Youtube-oriented dataset for audio and speech

Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, and Shinji Watanabe. Yodas: Youtube-oriented dataset for audio and speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 1--8. IEEE, 2023

work page 2023
[48]

Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning

Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM international conference on multimedia, pp.\ 9610--9614, 2023

work page 2023
[49]

Dinosr: Self-distillation and online clustering for self-supervised speech representation learning

Alexander H Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, and Jim Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. Advances in Neural Information Processing Systems, 36: 0 58346--58362, 2023

work page 2023
[50]

Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models

Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zemin Liu, and Junyang Lin. Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 31035--31046, 2025

work page 2025
[51]

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13 0 (5): 0 e0196391, 2018

work page 2018
[52]

Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations

Vasista Sai Lodagala, Sreyan Ghosh, and Srinivasan Umesh. Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 1--8. IEEE, 2023

work page 2022
[53]

Unitok: A unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025 a

work page arXiv 2025
[54]

Language model can listen while speaking

Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24831--24839, 2025 b

work page 2025
[55]

Nast: Noise aware speech tokenization for speech language models

Shoval Messica and Yossi Adi. Nast: Noise aware speech tokenization for speech language models. In Proc. Interspeech 2024, pp.\ 4169--4173, 2024

work page 2024
[56]

DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli. Dasb-discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

[TARGET]

Tu Anh Nguyen, Wei-Ning Hsu, Antony d'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, et al. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023

work page arXiv 2023
[58]

Emns/imz/corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels

Kari Ali Noriy, Xiaosong Yang, and Jian Jun Zhang. Emns/imz/corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels. arXiv preprint arXiv:2305.13137, 2023

work page arXiv 2023
[59]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 5206--5210. IEEE, 2015

work page 2015
[60]

Esc: Dataset for environmental sound classification

Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015

work page 2015
[61]

Speech resynthesis from discrete disentangled self-supervised representations

Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021

work page arXiv 2021
[62]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

MLS: A Large-Scale Multilingual Dataset for Speech Research

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020

work page internal anchor Pith review arXiv 2012
[64]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023

work page 2023
[65]

Ultra-low-bitrate speech coding with pretrained transformers

Ali Siahkoohi, Michael Chinen, Tom Denton, W Bastiaan Kleijn, and Jan Skoglund. Ultra-low-bitrate speech coding with pretrained transformers. In Proc. Interspeech 2022, pp.\ 4421--4425, 2022

work page 2022
[66]

Analysing discrete self supervised speech representation for spoken language modeling

Amitay Sicherman and Yossi Adi. Analysing discrete self supervised speech representation for spoken language modeling. In ICASSP, 2023

work page 2023
[67]

Marco-voice technical report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, et al. Marco-voice technical report. arXiv preprint arXiv:2508.02038, 2025

work page arXiv 2025
[68]

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition . arXiv preprint arXiv:2302.14597, 2023

work page arXiv 2023
[69]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017
[70]

Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit

Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2017

work page 2017
[71]

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language, 46: 0 535--557, 2017

work page 2017
[72]

Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021

work page arXiv 2021
[73]

Mead: A large-scale audio-visual dataset for emotional talking-face generation

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European conference on computer vision, pp.\ 700--717. Springer, 2020

work page 2020
[74]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[75]

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu. Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition . In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 7632--7636. IEEE, 2022

work page 2022
[76]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report. arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Mini-omni: Language models can hear, talk while thinking in streaming

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[78]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner

Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, and Helen Meng. Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner. Advances in Neural Information Processing Systems, 37: 0 56802--56827, 2024 b

work page 2024
[80]

Who can withstand chat-audio attacks? an evaluation benchmark for large language models

Wanqi Yang, Yanda Li, Meng Fang, Yunchao Wei, Tianyi Zhou, and Ling Chen. Who can withstand chat-audio attacks? an evaluation benchmark for large language models. arXiv preprint arXiv:2411.14842, 2024 c

work page arXiv 2024

Showing first 80 references.