StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Pith reviewed 2026-05-18 12:53 UTC · model grok-4.3
The pith
StableToken uses multi-branch audio processing and bit-wise voting to produce semantic speech tokens that stay consistent under noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a consensus-driven tokenizer built from parallel processing branches merged by bit-wise voting produces markedly more stable semantic token sequences than prior single-path designs, even at high signal-to-noise ratios, and that this token-level stability improves the noise resilience of SpeechLLMs on multiple downstream tasks.
What carries the argument
Multi-branch architecture merged by bit-wise voting, which aggregates parallel audio representations to form one consistent token sequence.
If this is right
- SpeechLLMs trained on StableToken sequences become more robust to noisy real-world inputs.
- Token edit distance drops substantially across a range of signal-to-noise ratios without harming intelligibility.
- The stability gain reduces the effective learning burden for the language model on speech data.
Where Pith is reading between the lines
- The same consensus approach could be tested on other modalities where tokenization instability hurts downstream models.
- Optimizing branch count and voting rules might further improve efficiency without losing the stability benefit.
- Explicit consistency terms could be added to tokenizer training objectives in related audio or multimodal work.
Load-bearing premise
The multi-branch paths and voting step preserve the original semantic content while adding only stability.
What would settle it
An experiment showing that StableToken tokens still exhibit high Unit Edit Distance under the tested noise conditions, or that SpeechLLM task accuracy fails to rise despite the reported token stability gains.
Figures
read the original abstract
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies fragility in existing semantic speech tokenizers to acoustic noise even at high SNR, attributing it to single-path quantization and distant training signals. It proposes StableToken, which uses a multi-branch parallel processing architecture merged by bit-wise voting to produce stable token sequences. The work claims new SOTA token stability via drastically reduced Unit Edit Distance (UED) under diverse noise, with direct downstream gains in SpeechLLM robustness on various tasks; code and models are released publicly.
Significance. If the stability improvements hold without semantic degradation, the approach could meaningfully improve resilience of speech LLMs in real-world noisy conditions by reducing the learning burden on downstream models. The public code release supports reproducibility and is a positive contribution.
major comments (2)
- [Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.
- [Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.
minor comments (1)
- [Method] Notation for the voting mechanism and multi-branch fusion should be formalized with equations to clarify how consensus is computed without introducing new parameters.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions made to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts clear improvements, SOTA results, and direct translation of stability to downstream benefits, but provides no quantitative numbers, experimental details, baselines, or error analysis. This makes it impossible to assess whether the central claims are supported by data.
Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we have inserted the key measured improvements (UED reductions under multiple noise types and SNR levels, plus average gains on the downstream SpeechLLM tasks) together with a brief reference to the baselines and evaluation protocol. The body of the paper already contains the full experimental details and error analysis; the abstract update simply makes these results immediately visible. revision: yes
-
Referee: [Method / Experiments] Method / Experiments: The bit-wise voting is presented as preserving semantic content while only suppressing noise-induced flips. However, no results demonstrate that the voted tokens yield equivalent or better downstream semantic metrics (e.g., ASR WER or spoken-language understanding accuracy) on clean audio versus the single-branch baseline; if voting overrides fine-grained decisions, UED gains could mask a hidden semantic cost that undermines the 'foundational stability translates directly' claim.
Authors: This is a fair and important point. We have added a dedicated clean-audio ablation in the revised experiments section that directly compares StableToken against the single-branch baseline on ASR WER and spoken-language-understanding accuracy using the same clean test sets. The results show no measurable semantic degradation (and in some cases a small improvement), confirming that bit-wise voting does not override linguistically relevant decisions. We have also inserted a short discussion clarifying why the observed stability gains therefore translate to downstream robustness without hidden semantic cost. revision: yes
Circularity Check
No circularity: stability claim rests on explicit architectural mechanism rather than self-referential definition or fitted prediction
full rationale
The paper introduces StableToken via a multi-branch architecture whose outputs are merged by bit-wise voting; the resulting UED reduction is presented as an empirical consequence of this consensus design rather than a quantity defined in terms of itself or recovered from a fitted parameter. No equations appear that equate the stability metric to the voting rule by construction, and no load-bearing step reduces to a self-citation whose content is itself unverified. The central claim therefore remains an independent architectural proposal whose validity can be checked against external benchmarks (clean-data semantic metrics, downstream task accuracy) without circular reduction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems
Adaeze Adigwe, No \'e Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Hyebin Ahn, Kangwook Jang, and Hoirin Kim. Hubert-vic: Improving noise-robust automatic speech recognition of speech foundation model via variance-invariance-covariance regularization. arXiv preprint arXiv:2508.12292, 2025
-
[3]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Common voice: A massively-multilingual speech corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019
-
[5]
vq-wav2vec: Self-supervised learning of discrete speech representations
Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019
-
[6]
wav2vec 2.0: A framework for self-supervised learning of speech representations
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33: 0 12449--12460, 2020
work page 2020
-
[7]
Hi-fi multi-speaker english tts dataset
Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-fi multi-speaker english tts dataset. arXiv preprint arXiv:2104.01497, 2021
-
[8]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[9]
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.\ 1--5. IEEE, 2017
work page 2017
-
[10]
Iemocap: Interactive emotional dyadic motion capture database
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42 0 (4): 0 335--359, 2008
work page 2008
-
[11]
Crema-d: Crowd-sourced emotional multimodal actors dataset
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014
work page 2014
-
[12]
R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces
Heng-Jui Chang and James Glass. R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 642--662. Association for Computational Linguistics, 2024
work page 2024
-
[13]
Self-supervised fine-tuning for improved content representations by speaker-invariant clustering
Heng-Jui Chang, Alexander H Liu, and James Glass. Self-supervised fine-tuning for improved content representations by speaker-invariant clustering. In Proc. Interspeech 2023, pp.\ 2983--2987, 2023
work page 2023
-
[14]
Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021
-
[15]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505--1518, 2022
work page 2022
-
[16]
Self-supervised learning with random-projection quantizer for speech recognition
Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pp.\ 3915--3924. PMLR, 2022
work page 2022
-
[17]
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 244--250. IEEE, 2021
work page 2021
-
[18]
Unsupervised cross-lingual representation learning for speech recognition
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2021
work page 2021
-
[19]
Fleurs: Few-shot learning evaluation of universal representations of speech
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 798--805. IEEE, 2023
work page 2022
-
[20]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report. arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Toronto emotional speech set (tess)
Kate Dupuis and M Kathleen Pichora-Fuller. Toronto emotional speech set (tess). 2010
work page 2010
-
[25]
Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition
Patrick Eickhoff, Matthias M \"o ller, Theresa Pekarek Rosin, Johannes Twiefel, and Stefan Wermter. Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition . In Artificial Neural Networks and Machine Learning -- ICANN 2023, pp.\ 381--392. Springer Nature Switzerland, 2023
work page 2023
-
[26]
Llama-omni: Seamless speech interaction with large language models
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024
-
[27]
Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis
Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis. arXiv preprint arXiv:2505.02625, 2025
-
[28]
Fsd50k: an open dataset of human-labeled sound events
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021
work page 2021
-
[29]
The people's speech: A large-scale diverse english speech recognition dataset for commercial usage
Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer \'o n, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people's speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021
-
[30]
Augmentation invariant discrete representation for generative spoken language modeling
Itai Gat, Felix Kreuk, Tu-Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, and Yossi Adi. Augmentation invariant discrete representation for generative spoken language modeling. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pp.\ 465--477, 2023
work page 2023
-
[31]
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers . In Proc. Interspeech 2023, pp.\ 2358--2362, 2023. doi:10.21437/Interspeech.2023-1511
-
[32]
Recent advances in discrete speech tokens: A review
Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review. arXiv preprint arXiv:2502.06490, 2025
-
[33]
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp.\ 885--890. IEEE, 2024
work page 2024
-
[34]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29: 0 3451--3460, 2021
work page 2021
-
[35]
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training
Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, and Qun Liu. Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training. In ICLR, 2022
work page 2022
-
[37]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Surrey audio-visual expressed emotion (savee) database
Philip Jackson and SJUoSG Haq. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014
work page 2014
-
[39]
L \'e o Jacqmin, Lina M Rojas-Barahona, and Benoit Favre. " do you follow me?": A survey of recent approaches in dialogue state tracking. arXiv preprint arXiv:2207.14627, 2022
-
[40]
An open source emotional speech corpus for human robot interaction applications
Jesin James, Li Tian, and Catherine Inez Watson. An open source emotional speech corpus for human robot interaction applications. In Interspeech, pp.\ 2768--2772, 2018
work page 2018
-
[41]
Survey of adversarial robustness in multimodal large language models
Chengze Jiang, Zhuangzhuang Wang, Minjing Dong, and Jie Gui. Survey of adversarial robustness in multimodal large language models. arXiv preprint arXiv:2503.13962, 2025
-
[42]
Libri-light: A benchmark for asr with limited or no supervision
Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazar \'e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 76...
work page 2020
-
[43]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019
work page 2019
-
[44]
Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33: 0 17022--17033, 2020
work page 2020
-
[45]
Dialogue state tracking with a language model using schema-driven prompting
Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Dialogue state tracking with a language model using schema-driven prompting. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4937--4949, Online and Punta Cana, Dominican Republic, ...
-
[46]
Dailytalk: Spoken dialogue dataset for conversational text-to-speech
Keon Lee, Kyumin Park, and Daeyoung Kim. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023
work page 2023
-
[47]
Yodas: Youtube-oriented dataset for audio and speech
Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, and Shinji Watanabe. Yodas: Youtube-oriented dataset for audio and speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.\ 1--8. IEEE, 2023
work page 2023
-
[48]
Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning
Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mingyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, Jinming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM international conference on multimedia, pp.\ 9610--9614, 2023
work page 2023
-
[49]
Dinosr: Self-distillation and online clustering for self-supervised speech representation learning
Alexander H Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, and Jim Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. Advances in Neural Information Processing Systems, 36: 0 58346--58362, 2023
work page 2023
-
[50]
Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models
Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zemin Liu, and Junyang Lin. Analyzing and mitigating inconsistency in discrete speech tokens for neural codec language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 31035--31046, 2025
work page 2025
-
[51]
Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13 0 (5): 0 e0196391, 2018
work page 2018
-
[52]
Vasista Sai Lodagala, Sreyan Ghosh, and Srinivasan Umesh. Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.\ 1--8. IEEE, 2023
work page 2022
-
[53]
Unitok: A unified tokenizer for visual generation and understanding
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025 a
-
[54]
Language model can listen while speaking
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24831--24839, 2025 b
work page 2025
-
[55]
Nast: Noise aware speech tokenization for speech language models
Shoval Messica and Yossi Adi. Nast: Noise aware speech tokenization for speech language models. In Proc. Interspeech 2024, pp.\ 4169--4173, 2024
work page 2024
-
[56]
DASB - Discrete Audio and Speech Benchmark
Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli. Dasb-discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [57]
-
[58]
Kari Ali Noriy, Xiaosong Yang, and Jian Jun Zhang. Emns/imz/corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels. arXiv preprint arXiv:2305.13137, 2023
-
[59]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 5206--5210. IEEE, 2015
work page 2015
-
[60]
Esc: Dataset for environmental sound classification
Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.\ 1015--1018, 2015
work page 2015
-
[61]
Speech resynthesis from discrete disentangled self-supervised representations
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021
-
[62]
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[63]
MLS: A Large-Scale Multilingual Dataset for Speech Research
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020
work page internal anchor Pith review arXiv 2012
-
[64]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.\ 28492--28518. PMLR, 2023
work page 2023
-
[65]
Ultra-low-bitrate speech coding with pretrained transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W Bastiaan Kleijn, and Jan Skoglund. Ultra-low-bitrate speech coding with pretrained transformers. In Proc. Interspeech 2022, pp.\ 4421--4425, 2022
work page 2022
-
[66]
Analysing discrete self supervised speech representation for spoken language modeling
Amitay Sicherman and Yossi Adi. Analysing discrete self supervised speech representation for spoken language modeling. In ICASSP, 2023
work page 2023
-
[67]
Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, et al. Marco-voice technical report. arXiv preprint arXiv:2508.02038, 2025
-
[68]
deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition . arXiv preprint arXiv:2302.14597, 2023
-
[69]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[70]
Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit
Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2017
work page 2017
-
[71]
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language, 46: 0 535--557, 2017
work page 2017
-
[72]
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021
-
[73]
Mead: A large-scale audio-visual dataset for emotional talking-face generation
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European conference on computer vision, pp.\ 700--717. Springer, 2020
work page 2020
-
[74]
Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm
Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. arXiv preprint arXiv:2411.00774, 2024
-
[75]
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition
Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu. Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition . In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 7632--7636. IEEE, 2022
work page 2022
-
[76]
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report. arXiv preprint arXiv:2507.16632, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Mini-omni: Language models can hear, talk while thinking in streaming
Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024
-
[78]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner
Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, and Helen Meng. Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner. Advances in Neural Information Processing Systems, 37: 0 56802--56827, 2024 b
work page 2024
-
[80]
Who can withstand chat-audio attacks? an evaluation benchmark for large language models
Wanqi Yang, Yanda Li, Meng Fang, Yunchao Wei, Tianyi Zhou, and Ling Chen. Who can withstand chat-audio attacks? an evaluation benchmark for large language models. arXiv preprint arXiv:2411.14842, 2024 c
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.