An Exploration of Mamba for Speech Self-Supervised Models
Pith reviewed 2026-05-19 09:10 UTC · model grok-4.3
The pith
Mamba-based models can replace Transformers in HuBERT-style speech self-supervised learning and deliver lower compute for long audio plus stronger streaming results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mamba-based HuBERT models achieve competitive performance on SUPERB benchmarks, especially in causal settings, while yielding higher-quality quantized representations and more distinct speaker features than Transformer counterparts. The linear-time property of the Selective State Space model lets the same pre-training recipe support fine-tuning on long-context ASR at much lower compute cost and produces superior results when the same models are adapted for streaming ASR.
What carries the argument
The Selective State Space model inside Mamba, substituted directly into the HuBERT encoder stack to replace self-attention layers.
If this is right
- Fine-tuning for long-context ASR requires significantly lower compute than Transformer models.
- Streaming ASR fine-tuning yields higher accuracy than the Transformer baseline.
- Probing results remain competitive on SUPERB tasks and improve in causal configurations.
- Quantized speech units extracted from the model are of higher quality.
- Speaker-related information is captured more distinctly in the learned representations.
Where Pith is reading between the lines
- The same linear scaling could let researchers pre-train on entire long-form audio files instead of fixed short segments.
- Real-time speech systems might adopt these models for lower latency without sacrificing accuracy.
- The clearer speaker separation could simplify downstream tasks such as diarization or voice conversion.
Load-bearing premise
Mamba can be dropped straight into the existing HuBERT pre-training and fine-tuning recipe without any extra architectural changes or hyper-parameter retuning and still produce equal or better speech representations.
What would settle it
Running the identical HuBERT pre-training schedule on a standard speech corpus and finding that the Mamba version scores lower than the Transformer version on a held-out long-context ASR test set.
Figures
read the original abstract
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction. The codebase is available at https://github.com/hckuo145/Mamba-based-HuBERT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores Mamba-based alternatives to Transformer-based HuBERT models for speech self-supervised learning. It claims that these models achieve competitive performance on SUPERB probing benchmarks, especially in causal settings, produce higher-quality quantized representations, capture speaker-related features more distinctly, and enable efficient fine-tuning for long-context and streaming ASR due to the linear-time complexity of selective state space models.
Significance. If the results are confirmed, this work is significant as it introduces an efficient architecture for speech SSL that scales better to long sequences and supports real-time applications. The open-source codebase at the provided GitHub link is a strength that facilitates reproducibility and further research in the community.
major comments (3)
- [Section 3] The description of the Mamba integration into the HuBERT pre-training pipeline does not include ablations on key hyperparameters such as state size or discretization step; without these, it is unclear whether the reported improvements require speech-specific retuning or hold under direct substitution as assumed.
- [Section 4.2] The SUPERB benchmark results are presented without reporting the number of independent runs, standard deviations, or statistical significance tests; this weakens the claim of competitive or superior performance in causal settings.
- [Section 5] The analysis of quantized units and speaker feature separation relies on qualitative observations or specific metrics; quantitative comparisons with baselines should be expanded to confirm higher quality.
minor comments (2)
- [Abstract] The abstract mentions 'significantly lower compute' but does not quantify the savings; a brief mention of FLOPs or training time reduction would strengthen the claim.
- [Figure 2] Ensure that all axes are clearly labeled and legends are legible for the streaming ASR performance plots.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review of our manuscript. We address each of the major comments below and have incorporated revisions to improve the paper's clarity and completeness.
read point-by-point responses
-
Referee: [Section 3] The description of the Mamba integration into the HuBERT pre-training pipeline does not include ablations on key hyperparameters such as state size or discretization step; without these, it is unclear whether the reported improvements require speech-specific retuning or hold under direct substitution as assumed.
Authors: We appreciate this comment. In designing our experiments, we deliberately used the default Mamba hyperparameters (state size d_state=16 and the standard discretization parameters from the Mamba paper) to test the hypothesis that Mamba can serve as a direct substitute for Transformers in speech SSL without requiring extensive speech-specific tuning. This approach aligns with our goal of exploring the architecture's potential in a straightforward manner. We have revised Section 3 to include explicit mention of these hyperparameter choices and a justification for not performing additional ablations in this initial exploration. We agree that further ablations would be beneficial and have added this as a suggested direction for future research. revision: partial
-
Referee: [Section 4.2] The SUPERB benchmark results are presented without reporting the number of independent runs, standard deviations, or statistical significance tests; this weakens the claim of competitive or superior performance in causal settings.
Authors: We acknowledge the validity of this point. Given the high computational cost associated with pre-training large SSL models on speech data, our reported results are from single runs per configuration. We have updated Section 4.2 to clearly state that each result is from a single independent run and have included this information in the tables. While we did not conduct multiple runs or formal statistical tests, the trends observed across the various SUPERB tasks in causal settings are consistent and support our conclusions. We have moderated the language in the manuscript to reflect this and noted the limitation in the discussion section. revision: yes
-
Referee: [Section 5] The analysis of quantized units and speaker feature separation relies on qualitative observations or specific metrics; quantitative comparisons with baselines should be expanded to confirm higher quality.
Authors: Thank you for this suggestion. We have expanded Section 5 with additional quantitative analyses, including comparisons of codebook utilization rates and speaker identification accuracy using the quantized representations from both Mamba and Transformer models. These new metrics provide stronger quantitative evidence for the higher quality of the Mamba-based quantized units and better separation of speaker features. The revised section now includes direct numerical comparisons to the baseline. revision: yes
Circularity Check
No circularity: empirical benchmark comparisons with no self-referential derivations
full rationale
The paper reports direct empirical results from substituting Mamba blocks into the HuBERT pipeline and evaluating on SUPERB, ASR, and streaming tasks. No equations, fitted parameters, or self-citations are used to derive the claimed performance gains; results are measured against external public benchmarks. The central claims rest on observed metrics rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mamba is a state space model (SSMs) whose discrete-time formulas are expressed as ht = Aht−1 + Bxt, yt = Cht
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model,
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model,” in International Conference on Machine Learning (ICML), 2024
work page 2024
-
[2]
MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking,
H.-W. Huang, C.-Y . Yang, W. Chai, Z. Jiang, and J.-N. Hwang, “MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2025, pp. 1–5
work page 2025
-
[3]
Jamba: Hybrid Transformer-Mamba Language Models,
B. Lenz et al., “Jamba: Hybrid Transformer-Mamba Language Models,” in The Thirteenth International Conference on Learning Representa- tions, 2025
work page 2025
-
[4]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” arXiv preprint arXiv:2312.00752 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
X. Jiang, Y . A. Li, A. N. Florea, C. Han, and N. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[6]
X. Jiang, C. Han, and N. Mesgarani, “Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation,” in Proceedings of the IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5
work page 2025
-
[7]
Speech-Mamba: Long-Context Speech Recog- nition with Selective State Spaces Models,
X. Gao and N. F. Chen, “Speech-Mamba: Long-Context Speech Recog- nition with Selective State Spaces Models,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8
work page 2024
-
[8]
SPMamba: State-space model is all you need in speech separation,
K. Li, G. Chen, R. Yang, and X. Hu, “SPMamba: State-space model is all you need in speech separation,” arXiv preprint arXiv:2404.02063, 2024
-
[9]
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 3451–3460, 2021
work page 2021
-
[10]
Mamba in Speech: Towards an Alternative to Self-Attention,
X. Zhang, Q. Zhang, H. Liu, T. Xiao, X. Qian, B. Ahmed, E. Ambikaira- jah, H. Li, and J. Epps, “Mamba in Speech: Towards an Alternative to Self-Attention,” arXiv preprint arXiv:2405.12609 , 2024
-
[11]
Relations between two sets of variates,
H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936
work page 1936
-
[12]
SUPERB: Speech Processing Universal PERformance Benchmark,
S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech, 2021, pp. 1194–1198
work page 2021
-
[13]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems , vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[14]
An Investigation of Incorporating Mamba For Speech Enhancement,
R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An Investigation of Incorporating Mamba For Speech Enhancement,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 302–308
work page 2024
-
[15]
Mamba for Streaming ASR Combined with Unimodal Aggregation,
Y . Fang and X. Li, “Mamba for Streaming ASR Combined with Unimodal Aggregation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2025, pp. 1–5
work page 2025
-
[16]
Rethinking Mamba in Speech Processing by Self-Supervised Models,
X. Zhang, J. Ma, M. Shahin, B. Ahmed, and J. Epps, “Rethinking Mamba in Speech Processing by Self-Supervised Models,” in Proceed- ings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5
work page 2025
-
[17]
Audio Mamba: Selective State Spaces for Self- Supervised Audio Representations,
S. Yadav and Z.-H. Tan, “Audio Mamba: Selective State Spaces for Self- Supervised Audio Representations,” in Interspeech, 2024, pp. 552–556
work page 2024
-
[18]
T.-h. Feng, A. Dong, C.-F. Yeh, S.-w. Yang, T.-Q. Lin, J. Shi, K.-W. Chang, Z. Huang, H. Wu, X. Chang et al., “Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103
work page 2022
-
[19]
Compressing transformer-based self-supervised models for speech processing,
T.-Q. Lin, T.-H. Yang, C.-Y . Chang, K.-M. Chen, T.-h. Feng, H.-y. Lee, and H. Tang, “Compressing transformer-based self-supervised models for speech processing,” arXiv preprint arXiv:2211.09949 , 2022
-
[20]
TED-LIUM 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,
F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve, “TED-LIUM 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,” in Proceedings of the 20th International Conference on Speech and Computer (SPECOM) . Springer, 2018, pp. 198–208
work page 2018
-
[21]
On gener- ative spoken language modeling from raw audio,
K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y . Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed et al. , “On gener- ative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021
work page 2021
-
[22]
Textually Pretrained Speech Language Models,
M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. CONNEAU, F. Kreuk, J. Copet, A. Defossez, G. Synnaeve, E. Dupoux, R. Schwartz, and Y . Adi, “Textually Pretrained Speech Language Models,” in Advances in Neural Information Processing Systems , vol. 36, 2023, pp. 63 483– 63 501
work page 2023
-
[23]
On The Landscape of Spoken Language Models: A Comprehensive Survey
S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,” arXiv preprint arXiv:2504.08528, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Building a taiwanese mandarin spoken language model: A first attempt,
C.-K. Yang et al. , “Building a taiwanese mandarin spoken language model: A first attempt,” arXiv preprint arXiv:2411.07111 , 2024
-
[25]
C.-y. Huang et al., “Dynamic-SUPERB Phase-2: A Collaboratively Ex- panding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks,” in The Thirteen International Conference on Learning Representations, 2024
work page 2024
-
[26]
Layer-wise analysis of a self-supervised speech representation model,
A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921
work page 2021
-
[27]
DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,
T.-Q. Lin, H.-y. Lee, and H. Tang, “DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,” in Inter- speech 2024, 2024, pp. 4513–4517
work page 2024
-
[28]
What do self- supervised speech models know about words?
A. Pasad, C.-M. Chien, S. Settle, and K. Livescu, “What do self- supervised speech models know about words?” Transactions of the Association for Computational Linguistics , vol. 12, pp. 372–391, 2024
work page 2024
-
[29]
Property Neurons in Self-Supervised Speech Transformers,
T.-Q. Lin, G.-T. Lin, H. yi Lee, and H. Tang, “Property Neurons in Self-Supervised Speech Transformers,” inProceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT) . IEEE, 2024, pp. 401– 408
work page 2024
-
[30]
MelHuBERT: A Simplified Hubert on Mel Spectrograms,
T.-Q. Lin, H.-Y . Lee, and H. Tang, “MelHuBERT: A Simplified Hubert on Mel Spectrograms,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8
work page 2023
-
[31]
Generalized end-to-end loss for speaker verification,
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2018, pp. 4879–4883
work page 2018
-
[32]
N. R. Koluguri, T. Park, and B. Ginsburg, “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2022, pp. 8102–8106
work page 2022
-
[33]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” arXiv preprint arXiv:2005.07143 , 2020
-
[34]
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Repre- sentation,
Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-Supervised Pre-Training for Speech Emotion Repre- sentation,” in Findings of the Association for Computational Linguistics: ACL 2024 . Association for Computational Linguistics, 2024, pp. 15 747–15 760
work page 2024
-
[35]
MiniSu- PEBR: Lightweight benchmark for self-supervised speech models,
Y .-H. Wang, H.-Y . Chen, K.-W. Chang, W. Hsu, and H.-y. Lee, “MiniSu- PEBR: Lightweight benchmark for self-supervised speech models,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8
work page 2023
-
[36]
ML- SUPERB: Multilingual Speech Universal PERformance Benchmark,
J. Shi, D. Berrebbi, W. Chen, E.-P. Hu, W.-P. Huang, H.-L. Chung, X. Chang, S.-W. Li, A. Mohamed, H. yi Lee, and S. Watanabe, “ML- SUPERB: Multilingual Speech Universal PERformance Benchmark,” in Interspeech, 2023, pp. 884–888
work page 2023
-
[37]
J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y . Tang, S.-W. Li et al. , “Findings of the 2023 ML-SUPERB challenge: Pre-training and evaluation over more languages and beyond,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . IEEE, 2023, pp. 1–8
work page 2023
-
[38]
J. Shi, H. Inaguma, X. Ma, I. Kulikov, and A. Sun, “Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,” in The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[39]
Task- Agnostic Structured Pruning of Speech Representation Models,
H. Wang, S. Wang, W.-Q. Zhang, H. Suo, and Y . Wan, “Task- Agnostic Structured Pruning of Speech Representation Models,” in Interspeech, 2023, p. 231–235
work page 2023
-
[40]
T.-Q. Lin, W.-P. Huang, H. Tang, and H.-y. Lee, “Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability,” arXiv preprint arXiv:2502.12672 , 2025. [Online]. Available: https://arxiv.org/abs/2502.12672
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
S. Zaiem, Y . Kemiche, T. Parcollet, S. Essid, and M. Ravanelli, “Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?” in Interspeech, 2023, pp. 2873–2877
work page 2023
-
[42]
Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks,
C. Plachouras, J. Guinot, G. Fazekas, E. Quinton, E. Benetos, and J. Pauwels, “Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN) . IEEE, 2025
work page 2025
-
[43]
M. Yang, R. C. M. C. Shekar, O. Kang, and J. H. L. Hansen, “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” inInterspeech, 2023, pp. 1923–1927
work page 2023
-
[44]
What Do Self-Supervised Vision Transformers Learn?
N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What Do Self-Supervised Vision Transformers Learn?” in The Eleventh International Conference on Learning Representations , 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.