pith. sign in

arxiv: 2506.12606 · v2 · submitted 2025-06-14 · 💻 cs.CL · cs.AI

An Exploration of Mamba for Speech Self-Supervised Models

Pith reviewed 2026-05-19 09:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords MambaHuBERTspeech self-supervised learningautomatic speech recognitionstreaming ASRSelective State Spacequantized representations
0
0 comments X

The pith

Mamba-based models can replace Transformers in HuBERT-style speech self-supervised learning and deliver lower compute for long audio plus stronger streaming results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that the Mamba architecture can be inserted into the standard HuBERT pre-training and fine-tuning pipeline for speech. A reader would care because the linear-time Selective State Space mechanism removes the quadratic cost that limits how much speech context Transformers can handle. The models match or beat Transformer baselines on SUPERB probes, produce cleaner quantized speech units, and separate speaker information more sharply. They also cut compute sharply when fine-tuned on long recordings and improve streaming ASR accuracy.

Core claim

Mamba-based HuBERT models achieve competitive performance on SUPERB benchmarks, especially in causal settings, while yielding higher-quality quantized representations and more distinct speaker features than Transformer counterparts. The linear-time property of the Selective State Space model lets the same pre-training recipe support fine-tuning on long-context ASR at much lower compute cost and produces superior results when the same models are adapted for streaming ASR.

What carries the argument

The Selective State Space model inside Mamba, substituted directly into the HuBERT encoder stack to replace self-attention layers.

If this is right

  • Fine-tuning for long-context ASR requires significantly lower compute than Transformer models.
  • Streaming ASR fine-tuning yields higher accuracy than the Transformer baseline.
  • Probing results remain competitive on SUPERB tasks and improve in causal configurations.
  • Quantized speech units extracted from the model are of higher quality.
  • Speaker-related information is captured more distinctly in the learned representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear scaling could let researchers pre-train on entire long-form audio files instead of fixed short segments.
  • Real-time speech systems might adopt these models for lower latency without sacrificing accuracy.
  • The clearer speaker separation could simplify downstream tasks such as diarization or voice conversion.

Load-bearing premise

Mamba can be dropped straight into the existing HuBERT pre-training and fine-tuning recipe without any extra architectural changes or hyper-parameter retuning and still produce equal or better speech representations.

What would settle it

Running the identical HuBERT pre-training schedule on a standard speech corpus and finding that the Mamba version scores lower than the Transformer version on a held-out long-context ASR test set.

Figures

Figures reproduced from arXiv: 2506.12606 by Chun Wei Chen, Heng-Cheng Kuo, Hsi-Chun Cheng, Hsien-Fu Hsiao, Hung-yi Lee, Tzu-Chieh Wei, Tzu-Quan Lin, Yu Tsao.

Figure 1
Figure 1. Figure 1: MACs (G/sec) and Real-Time Factor (RTF) of different HuBERT models at varying sequence lengths. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise phone purity of HuBERT models under three k-means clustering granularities ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise analysis results for different HuBERT models. Each plot below shows the CCA similarity to different label [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction. The codebase is available at https://github.com/hckuo145/Mamba-based-HuBERT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper explores Mamba-based alternatives to Transformer-based HuBERT models for speech self-supervised learning. It claims that these models achieve competitive performance on SUPERB probing benchmarks, especially in causal settings, produce higher-quality quantized representations, capture speaker-related features more distinctly, and enable efficient fine-tuning for long-context and streaming ASR due to the linear-time complexity of selective state space models.

Significance. If the results are confirmed, this work is significant as it introduces an efficient architecture for speech SSL that scales better to long sequences and supports real-time applications. The open-source codebase at the provided GitHub link is a strength that facilitates reproducibility and further research in the community.

major comments (3)
  1. [Section 3] The description of the Mamba integration into the HuBERT pre-training pipeline does not include ablations on key hyperparameters such as state size or discretization step; without these, it is unclear whether the reported improvements require speech-specific retuning or hold under direct substitution as assumed.
  2. [Section 4.2] The SUPERB benchmark results are presented without reporting the number of independent runs, standard deviations, or statistical significance tests; this weakens the claim of competitive or superior performance in causal settings.
  3. [Section 5] The analysis of quantized units and speaker feature separation relies on qualitative observations or specific metrics; quantitative comparisons with baselines should be expanded to confirm higher quality.
minor comments (2)
  1. [Abstract] The abstract mentions 'significantly lower compute' but does not quantify the savings; a brief mention of FLOPs or training time reduction would strengthen the claim.
  2. [Figure 2] Ensure that all axes are clearly labeled and legends are legible for the streaming ASR performance plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript. We address each of the major comments below and have incorporated revisions to improve the paper's clarity and completeness.

read point-by-point responses
  1. Referee: [Section 3] The description of the Mamba integration into the HuBERT pre-training pipeline does not include ablations on key hyperparameters such as state size or discretization step; without these, it is unclear whether the reported improvements require speech-specific retuning or hold under direct substitution as assumed.

    Authors: We appreciate this comment. In designing our experiments, we deliberately used the default Mamba hyperparameters (state size d_state=16 and the standard discretization parameters from the Mamba paper) to test the hypothesis that Mamba can serve as a direct substitute for Transformers in speech SSL without requiring extensive speech-specific tuning. This approach aligns with our goal of exploring the architecture's potential in a straightforward manner. We have revised Section 3 to include explicit mention of these hyperparameter choices and a justification for not performing additional ablations in this initial exploration. We agree that further ablations would be beneficial and have added this as a suggested direction for future research. revision: partial

  2. Referee: [Section 4.2] The SUPERB benchmark results are presented without reporting the number of independent runs, standard deviations, or statistical significance tests; this weakens the claim of competitive or superior performance in causal settings.

    Authors: We acknowledge the validity of this point. Given the high computational cost associated with pre-training large SSL models on speech data, our reported results are from single runs per configuration. We have updated Section 4.2 to clearly state that each result is from a single independent run and have included this information in the tables. While we did not conduct multiple runs or formal statistical tests, the trends observed across the various SUPERB tasks in causal settings are consistent and support our conclusions. We have moderated the language in the manuscript to reflect this and noted the limitation in the discussion section. revision: yes

  3. Referee: [Section 5] The analysis of quantized units and speaker feature separation relies on qualitative observations or specific metrics; quantitative comparisons with baselines should be expanded to confirm higher quality.

    Authors: Thank you for this suggestion. We have expanded Section 5 with additional quantitative analyses, including comparisons of codebook utilization rates and speaker identification accuracy using the quantized representations from both Mamba and Transformer models. These new metrics provide stronger quantitative evidence for the higher quality of the Mamba-based quantized units and better separation of speaker features. The revised section now includes direct numerical comparisons to the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparisons with no self-referential derivations

full rationale

The paper reports direct empirical results from substituting Mamba blocks into the HuBERT pipeline and evaluating on SUPERB, ASR, and streaming tasks. No equations, fitted parameters, or self-citations are used to derive the claimed performance gains; results are measured against external public benchmarks. The central claims rest on observed metrics rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5712 in / 1058 out tokens · 35497 ms · 2026-05-19T09:10:46.239890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

  1. [1]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model,

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model,” in International Conference on Machine Learning (ICML), 2024

  2. [2]

    MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking,

    H.-W. Huang, C.-Y . Yang, W. Chai, Z. Jiang, and J.-N. Hwang, “MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2025, pp. 1–5

  3. [3]

    Jamba: Hybrid Transformer-Mamba Language Models,

    B. Lenz et al., “Jamba: Hybrid Transformer-Mamba Language Models,” in The Thirteenth International Conference on Learning Representa- tions, 2025

  4. [4]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” arXiv preprint arXiv:2312.00752 , 2023

  5. [5]

    Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,

    X. Jiang, Y . A. Li, A. N. Florea, C. Han, and N. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing (ICASSP). IEEE, 2025, pp. 1–5

  6. [6]

    Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation,

    X. Jiang, C. Han, and N. Mesgarani, “Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation,” in Proceedings of the IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5

  7. [7]

    Speech-Mamba: Long-Context Speech Recog- nition with Selective State Spaces Models,

    X. Gao and N. F. Chen, “Speech-Mamba: Long-Context Speech Recog- nition with Selective State Spaces Models,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8

  8. [8]

    SPMamba: State-space model is all you need in speech separation,

    K. Li, G. Chen, R. Yang, and X. Hu, “SPMamba: State-space model is all you need in speech separation,” arXiv preprint arXiv:2404.02063, 2024

  9. [9]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 3451–3460, 2021

  10. [10]

    Mamba in Speech: Towards an Alternative to Self-Attention,

    X. Zhang, Q. Zhang, H. Liu, T. Xiao, X. Qian, B. Ahmed, E. Ambikaira- jah, H. Li, and J. Epps, “Mamba in Speech: Towards an Alternative to Self-Attention,” arXiv preprint arXiv:2405.12609 , 2024

  11. [11]

    Relations between two sets of variates,

    H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936

  12. [12]

    SUPERB: Speech Processing Universal PERformance Benchmark,

    S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech, 2021, pp. 1194–1198

  13. [13]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems , vol. 33, 2020, pp. 12 449–12 460

  14. [14]

    An Investigation of Incorporating Mamba For Speech Enhancement,

    R. Chao, W.-H. Cheng, M. L. Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y . Tsao, “An Investigation of Incorporating Mamba For Speech Enhancement,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 302–308

  15. [15]

    Mamba for Streaming ASR Combined with Unimodal Aggregation,

    Y . Fang and X. Li, “Mamba for Streaming ASR Combined with Unimodal Aggregation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2025, pp. 1–5

  16. [16]

    Rethinking Mamba in Speech Processing by Self-Supervised Models,

    X. Zhang, J. Ma, M. Shahin, B. Ahmed, and J. Epps, “Rethinking Mamba in Speech Processing by Self-Supervised Models,” in Proceed- ings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025, pp. 1–5

  17. [17]

    Audio Mamba: Selective State Spaces for Self- Supervised Audio Representations,

    S. Yadav and Z.-H. Tan, “Audio Mamba: Selective State Spaces for Self- Supervised Audio Representations,” in Interspeech, 2024, pp. 552–556

  18. [18]

    Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,

    T.-h. Feng, A. Dong, C.-F. Yeh, S.-w. Yang, T.-Q. Lin, J. Shi, K.-W. Chang, Z. Huang, H. Wu, X. Chang et al., “Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1096–1103

  19. [19]

    Compressing transformer-based self-supervised models for speech processing,

    T.-Q. Lin, T.-H. Yang, C.-Y . Chang, K.-M. Chen, T.-h. Feng, H.-y. Lee, and H. Tang, “Compressing transformer-based self-supervised models for speech processing,” arXiv preprint arXiv:2211.09949 , 2022

  20. [20]

    TED-LIUM 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,

    F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve, “TED-LIUM 3: Twice as much data and corpus repartition for experi- ments on speaker adaptation,” in Proceedings of the 20th International Conference on Speech and Computer (SPECOM) . Springer, 2018, pp. 198–208

  21. [21]

    On gener- ative spoken language modeling from raw audio,

    K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y . Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed et al. , “On gener- ative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021

  22. [22]

    Textually Pretrained Speech Language Models,

    M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. CONNEAU, F. Kreuk, J. Copet, A. Defossez, G. Synnaeve, E. Dupoux, R. Schwartz, and Y . Adi, “Textually Pretrained Speech Language Models,” in Advances in Neural Information Processing Systems , vol. 36, 2023, pp. 63 483– 63 501

  23. [23]

    On The Landscape of Spoken Language Models: A Comprehensive Survey

    S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,” arXiv preprint arXiv:2504.08528, 2025

  24. [24]

    Building a taiwanese mandarin spoken language model: A first attempt,

    C.-K. Yang et al. , “Building a taiwanese mandarin spoken language model: A first attempt,” arXiv preprint arXiv:2411.07111 , 2024

  25. [25]

    Dynamic-SUPERB Phase-2: A Collaboratively Ex- panding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks,

    C.-y. Huang et al., “Dynamic-SUPERB Phase-2: A Collaboratively Ex- panding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks,” in The Thirteen International Conference on Learning Representations, 2024

  26. [26]

    Layer-wise analysis of a self-supervised speech representation model,

    A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

  27. [27]

    DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,

    T.-Q. Lin, H.-y. Lee, and H. Tang, “DAISY: Data Adaptive Self- Supervised Early Exit for Speech Representation Models,” in Inter- speech 2024, 2024, pp. 4513–4517

  28. [28]

    What do self- supervised speech models know about words?

    A. Pasad, C.-M. Chien, S. Settle, and K. Livescu, “What do self- supervised speech models know about words?” Transactions of the Association for Computational Linguistics , vol. 12, pp. 372–391, 2024

  29. [29]

    Property Neurons in Self-Supervised Speech Transformers,

    T.-Q. Lin, G.-T. Lin, H. yi Lee, and H. Tang, “Property Neurons in Self-Supervised Speech Transformers,” inProceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT) . IEEE, 2024, pp. 401– 408

  30. [30]

    MelHuBERT: A Simplified Hubert on Mel Spectrograms,

    T.-Q. Lin, H.-Y . Lee, and H. Tang, “MelHuBERT: A Simplified Hubert on Mel Spectrograms,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8

  31. [31]

    Generalized end-to-end loss for speaker verification,

    L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2018, pp. 4879–4883

  32. [32]

    Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,

    N. R. Koluguri, T. Park, and B. Ginsburg, “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) . IEEE, 2022, pp. 8102–8106

  33. [33]

    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” arXiv preprint arXiv:2005.07143 , 2020

  34. [34]

    emotion2vec: Self-Supervised Pre-Training for Speech Emotion Repre- sentation,

    Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-Supervised Pre-Training for Speech Emotion Repre- sentation,” in Findings of the Association for Computational Linguistics: ACL 2024 . Association for Computational Linguistics, 2024, pp. 15 747–15 760

  35. [35]

    MiniSu- PEBR: Lightweight benchmark for self-supervised speech models,

    Y .-H. Wang, H.-Y . Chen, K.-W. Chang, W. Hsu, and H.-y. Lee, “MiniSu- PEBR: Lightweight benchmark for self-supervised speech models,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8

  36. [36]

    ML- SUPERB: Multilingual Speech Universal PERformance Benchmark,

    J. Shi, D. Berrebbi, W. Chen, E.-P. Hu, W.-P. Huang, H.-L. Chung, X. Chang, S.-W. Li, A. Mohamed, H. yi Lee, and S. Watanabe, “ML- SUPERB: Multilingual Speech Universal PERformance Benchmark,” in Interspeech, 2023, pp. 884–888

  37. [37]

    Findings of the 2023 ML-SUPERB challenge: Pre-training and evaluation over more languages and beyond,

    J. Shi, W. Chen, D. Berrebbi, H.-H. Wang, W.-P. Huang, E.-P. Hu, H.-L. Chuang, X. Chang, Y . Tang, S.-W. Li et al. , “Findings of the 2023 ML-SUPERB challenge: Pre-training and evaluation over more languages and beyond,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . IEEE, 2023, pp. 1–8

  38. [38]

    Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,

    J. Shi, H. Inaguma, X. Ma, I. Kulikov, and A. Sun, “Multi-resolution Hu- BERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction,” in The Twelfth International Conference on Learning Representations, 2024

  39. [39]

    Task- Agnostic Structured Pruning of Speech Representation Models,

    H. Wang, S. Wang, W.-Q. Zhang, H. Suo, and Y . Wan, “Task- Agnostic Structured Pruning of Speech Representation Models,” in Interspeech, 2023, p. 231–235

  40. [40]

    Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

    T.-Q. Lin, W.-P. Huang, H. Tang, and H.-y. Lee, “Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability,” arXiv preprint arXiv:2502.12672 , 2025. [Online]. Available: https://arxiv.org/abs/2502.12672

  41. [41]

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    S. Zaiem, Y . Kemiche, T. Parcollet, S. Essid, and M. Ravanelli, “Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?” in Interspeech, 2023, pp. 2873–2877

  42. [42]

    Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks,

    C. Plachouras, J. Guinot, G. Fazekas, E. Quinton, E. Benetos, and J. Pauwels, “Towards a Unified Representation Evaluation Framework Beyond Downstream Tasks,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN) . IEEE, 2025

  43. [43]

    What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,

    M. Yang, R. C. M. C. Shekar, O. Kang, and J. H. L. Hansen, “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” inInterspeech, 2023, pp. 1923–1927

  44. [44]

    What Do Self-Supervised Vision Transformers Learn?

    N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What Do Self-Supervised Vision Transformers Learn?” in The Eleventh International Conference on Learning Representations , 2023