pith. sign in

arxiv: 2607.00387 · v1 · pith:DGWC243Onew · submitted 2026-07-01 · 📡 eess.AS

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

Pith reviewed 2026-07-02 05:51 UTC · model grok-4.3

classification 📡 eess.AS
keywords audio self-supervised learninginductive biasespretraining objectivesarchitectural alignmentdownstream tasksspeech processingmultimodal learning
0
0 comments X

The pith

Audio SSL pretraining objectives align with the inductive biases of different architectures to determine success on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that audio self-supervised learning should be analyzed through the lens of how specific pretraining objectives interact with model architectures rather than viewing methods as a sequence of tasks or architectures. It groups objectives into auxiliary tasks, contrastive learning, generative reconstruction, discrete token prediction, and multimodal alignment, each imposing unique requirements like local sensitivity or semantic abstraction. These requirements match the biases of CNNs for local patterns, recurrent models for sequences, Transformers for global context, and hybrids for combined capabilities. Understanding this alignment helps explain generalization in applications from speech to music and medical audio analysis.

Core claim

The paper establishes that the five SSL paradigms impose demands on models including local structural sensitivity, contrastive invariance, contextual inference, discrete semantic abstraction, and multimodal grounding, which correspond to the inductive biases of CNNs, recurrent and state space models, Transformers, and hybrid architectures, thereby explaining performance differences across downstream applications in speech processing, environmental sound analysis, music information retrieval, medical and bioacoustic analysis, and multimodal audio understanding.

What carries the argument

The alignment between the five SSL paradigms (auxiliary tasks, contrastive learning, generative reconstruction, discrete token prediction, multimodal alignment) and architectural inductive biases of CNNs, recurrent/SSMs, Transformers, and hybrids.

If this is right

  • CNNs support local acoustic compression needed for auxiliary tasks.
  • Recurrent and state space models enable sequential state propagation for contextual tasks.
  • Transformers provide content-dependent global routing for inference and abstraction.
  • Hybrid architectures integrate local and global processing for multimodal alignment.
  • Downstream applications serve as tests of whether these alignments hold across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could use this mapping to select or create architectures for new audio domains without extensive trial and error.
  • Extending the framework to emerging objectives like those in audio-language models could reveal new alignment patterns.
  • Addressing open challenges such as tokenization bottlenecks may require architectures tuned to discrete prediction paradigms.
  • Long-context efficiency issues might be resolved by state space models if their alignment with generative objectives is confirmed.

Load-bearing premise

The five paradigms comprehensively capture all audio SSL objectives and their demands can be mapped to architectural biases using only existing literature without new experiments.

What would settle it

A controlled experiment training multiple architectures on one objective from each paradigm and testing on a downstream task where the alignment predicts poor performance but the model succeeds anyway.

Figures

Figures reproduced from arXiv: 2607.00387 by Boda Zhou, Cheng Yang, Huaimin Wang, Jin Zhang, Kele Xu, Qisheng Xu, Qiya Song, Yulin Sun, Yulu Fang.

Figure 1
Figure 1. Figure 1: Evolution of research activity in self-supervised audio representation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of this paper. recent safety-oriented evaluations. Section VI analyzes key challenges, including privacy risks, computational efficiency, robustness, and polyphonic complexity. Section VII outlines future directions toward general-purpose audio intelligence. Finally, Section VIII concludes the paper [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Objective-demand alignment in audio SSL. Beyond a chronological progression, the figure maps each supervisory paradigm to its characteristic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative structural mechanisms in audio SSL architectures. CNNs support local acoustic encoding, RNNs and SSM-based models support [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hybrid architectures for local-global integration in audio SSL. Hybrid designs combine local acoustic encoding, content-dependent global interaction, [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

This paper examines audio self-supervised learning (SSL) through the alignment between pretraining objectives, architectural inductive biases, and downstream applications. Rather than treating SSL methods as a chronological sequence of pretext tasks or model families, we ask how different supervisory signals shape the representations that models are expected to learn. The discussion is organized around five paradigms: auxiliary tasks, contrastive learning, generative reconstruction, discrete token prediction, and multimodal alignment. These objectives place different demands on the model, from local structural sensitivity and contrastive invariance to contextual inference, discrete semantic abstraction, and multimodal grounding. We relate these demands to the biases of CNNs, recurrent and State Space Models, Transformers, and hybrid architectures, showing how local acoustic compression, sequential state propagation, content-dependent global routing, and local--global integration support different forms of audio SSL. The same view is then used to interpret downstream applications in speech processing, environmental sound analysis, music information retrieval, medical and bioacoustic analysis, and multimodal audio understanding as practical tests of whether learned representations and architectural choices generalize across domains. We also review benchmark protocols and open challenges, including tokenization bottlenecks, long-context efficiency, robustness, and secure multimodal deployment, and discuss how codec-based tokenization and audio-language modeling extend this objective--architecture--application pipeline. The accompanying repository is released at https://github.com/colaudiolab/Awesome-Self-Supervised-Audio-Learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes an organizational framework for audio self-supervised learning (SSL) by aligning five pretraining paradigms—auxiliary tasks, contrastive learning, generative reconstruction, discrete token prediction, and multimodal alignment—with the inductive biases of architectures such as CNNs, recurrent and state space models, Transformers, and hybrids. It uses this alignment to interpret how these objectives shape representations for downstream applications in speech, environmental sound, music, medical, and multimodal domains, while reviewing benchmarks and open challenges.

Significance. If the proposed mappings between objectives, architectures, and applications hold, the paper offers a valuable interpretive lens for the audio SSL community. It synthesizes existing literature into a coherent structure that could guide architecture selection and highlight generalization issues. The accompanying repository enhances reproducibility and accessibility of the reviewed resources.

minor comments (2)
  1. The abstract refers to 'the accompanying repository' but does not specify its contents or how it supports the framework; this could be clarified to better inform readers.
  2. Some terminology, such as 'content-dependent global routing', would benefit from a brief definition or example upon first use to aid readers unfamiliar with the specific architectural biases discussed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary and positive assessment of the manuscript. The recommendation for minor revision is noted, and we will address any editorial or minor points in the revised version. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a survey paper that proposes an organizational framework for existing audio SSL literature around five paradigms (auxiliary tasks, contrastive, generative, discrete token, multimodal) and relates their demands to architectural biases of CNNs, SSMs, Transformers, and hybrids. No new derivations, equations, quantitative predictions, or proofs are introduced. The mapping is presented as an interpretive lens rather than a deductive chain, with all content supported by external references and no load-bearing steps that reduce to self-citations or fitted inputs by construction. The work is self-contained as a review without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a conceptual survey with no new mathematical derivations, fitted parameters, or postulated entities; it synthesizes existing concepts from the literature.

pith-pipeline@v0.9.1-grok · 5811 in / 1140 out tokens · 28198 ms · 2026-07-02T05:51:01.147762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

166 extracted references · 51 canonical work pages · 11 internal anchors

  1. [1]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”nature, vol. 521, no. 7553, pp. 436–444, 2015

  2. [2]

    Deep learning for audio signal processing,

    H. Purwins, B. Li, T. Virtanen, J. Schl ¨uter, S.-Y . Chang, and T. Sainath, “Deep learning for audio signal processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019. 21

  3. [3]

    Audio signal processing in the 21st century: The important outcomes of the past 25 years,

    G. Richard, P. Smaragdis, S. Gannot, P. A. Naylor, S. Makino, W. Kellermann, and A. Sugiyama, “Audio signal processing in the 21st century: The important outcomes of the past 25 years,”IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 12–26, 2023

  4. [4]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

  5. [5]

    Mat-sed: A masked audio transformer with masked-reconstruction based pre- training for sound event detection,

    P. Cai, Y . Song, K. Li, H. Song, and I. McLoughlin, “Mat-sed: A masked audio transformer with masked-reconstruction based pre- training for sound event detection,”arXiv preprint arXiv:2408.08673, 2024

  6. [6]

    Taming data and transformers for audio generation,

    M. Haji-Ali, W. Menapace, A. Siarohin, G. Balakrishnan, and V . Or- donez, “Taming data and transformers for audio generation,”Interna- tional Journal of Computer Vision, vol. 134, no. 3, p. 87, 2026

  7. [7]

    Deep convolutional neural networks and data augmentation for environmental sound classification,

    J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,”IEEE Signal processing letters, vol. 24, no. 3, pp. 279–283, 2017

  8. [8]

    Synthio: Augmenting small-scale audio classification datasets with synthetic data,

    S. Ghosh, S. Kumar, Z. Kong, R. Valle, B. Catanzaro, and D. Manocha, “Synthio: Augmenting small-scale audio classification datasets with synthetic data,”arXiv preprint arXiv:2410.02056, 2024

  9. [9]

    Explaining and Harnessing Adversarial Examples

    I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv preprint arXiv:1412.6572, 2014

  10. [10]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, and et al., “Intriguing properties of neural networks,”arXiv preprint arXiv:1312.6199, 2013

  11. [11]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand- ing deep learning requires rethinking generalization,”arXiv preprint arXiv:1611.03530, 2016

  12. [12]

    wav2vec: Unsupervised pre-training for speech recognition,

    S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,”arXiv preprint arXiv:1904.05862, 2019

  13. [13]

    Specaugment: A simple data augmentation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019

  14. [14]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Ad- vances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

  15. [15]

    Musical genre classification of audio signals,

    G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,”IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002

  16. [16]

    Dynamic attention-asymmetric perceptron network for overlapping sound event detection,

    Y . Miao, J. Zhu, and Y . Li, “Dynamic attention-asymmetric perceptron network for overlapping sound event detection,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 636–649, 2026

  17. [17]

    SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

    X. Yang, Y . Yang, Z. Jin, Z. Cui, W. Wu, B. Li, C. Zhang, and P. Woodland, “Spear: A unified ssl framework for learning speech and audio representations,”arXiv preprint arXiv:2510.25955, 2025

  18. [18]

    Scaling up masked audio encoder learning for general audio classification,

    H. Dinkel, Z. Yan, Y . Wang, J. Zhang, Y . Wang, and B. Wang, “Scaling up masked audio encoder learning for general audio classification,” arXiv preprint arXiv:2406.06992, 2024

  19. [19]

    A survey on contrastive self-supervised learning,

    A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,”Technologies, vol. 9, no. 1, p. 2, 2020

  20. [20]

    Audio self-supervised learning: A survey,

    S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller, “Audio self-supervised learning: A survey,”Patterns, vol. 3, no. 12, 2022

  21. [21]

    Scaling bioacoustic signal pre-training with million samples via mask- modeling,

    X. Deng, T. Wan, K. Xu, T. Gao, P. Qiao, D. Feng, and Y . Dou, “Scaling bioacoustic signal pre-training with million samples via mask- modeling,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  22. [22]

    Hearsay benchmark: Do audio llms leak what they hear?

    J. Wang, L. Lin, K. Luo, W. Wang, Y . Chen, M. Aloqaily, X. Tang, Z. Zhou, K. Wang, L. Sunet al., “Hearsay benchmark: Do audio llms leak what they hear?”arXiv preprint arXiv:2601.03783, 2026

  23. [23]

    A survey on self-supervised learning: Algorithms, applications, and future trends,

    J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 9052–9071, 2024

  24. [24]

    Self-supervised speech representation learning: A review,

    A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløeet al., “Self-supervised speech representation learning: A review,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022

  25. [25]

    Ssast: Self- supervised audio spectrogram transformer,

    Y . Gong, C.-I. J. Lai, Y .-A. Chung, and J. Glass, “Ssast: Self- supervised audio spectrogram transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 699– 10 709

  26. [26]

    Unsupervised feature learning via non-parametric instance discrimination,

    Z. Wu, Y . Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742

  27. [27]

    Unsupervised representation learning by predicting image rotations,

    S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” inInternational Conference on Learning Representations, 2018

  28. [28]

    Unsupervised learning of visual representa- tions by solving jigsaw puzzles,

    M. Noroozi and P. Favaro, “Unsupervised learning of visual representa- tions by solving jigsaw puzzles,” inEuropean conference on computer vision. Springer, 2016, pp. 69–84

  29. [29]

    A simple frame- work for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- work for contrastive learning of visual representations,” inInternational conference on machine learning. PmLR, 2020, pp. 1597–1607

  30. [30]

    Contrastive learning of general-purpose audio representations,

    A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3875–3879

  31. [31]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  32. [32]

    Masked autoencoders that listen,

    P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,”Advances in neural information processing systems, vol. 35, pp. 28 708–28 720, 2022

  33. [33]

    Masked spectrogram prediction for self-supervised audio pre-training,

    D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  34. [34]

    Recent advances in discrete speech tokens: A review,

    Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025

  35. [35]

    Beats: Audio pre-training with acoustic tokenizers,

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,”arXiv preprint arXiv:2212.09058, 2022

  36. [36]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-C. Tsai, K. Lakhotia, R. Salakhutdinov, M. Ma, and J. Glass, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

  37. [37]

    Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,

    H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y . Cui, and B. Gong, “Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,”Advances in neural information processing systems, vol. 34, pp. 24 206–24 221, 2021

  38. [38]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  39. [39]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  40. [40]

    Pre-training audio representations with self-supervision,

    M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Pre-training audio representations with self-supervision,”IEEE Signal Processing Letters, vol. 27, pp. 600–604, 2020

  41. [41]

    Shuffle and learn: unsupervised learning using temporal order verification,

    I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” inEuropean conference on computer vision. Springer, 2016, pp. 527–544

  42. [42]

    Self-supervised learning of audio representations from permutations with differentiable ranking,

    A. N. Carr, Q. Berthet, M. Blondel, O. Teboul, and N. Zeghidour, “Self-supervised learning of audio representations from permutations with differentiable ranking,”IEEE Signal Processing Letters, vol. 28, pp. 708–712, 2021

  43. [43]

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    S. Pascual, M. Ravanelli, J. Serra, A. Bonafonte, and Y . Bengio, “Learning problem-agnostic speech representations from multiple self- supervised tasks,”arXiv preprint arXiv:1904.03416, 2019

  44. [44]

    Multi-task self-supervised learning for robust speech recognition,

    M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y . Bengio, “Multi-task self-supervised learning for robust speech recognition,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6989–6993

  45. [45]

    Clar: Contrastive learning of auditory representations,

    H. Al-Tahan and Y . Mohsenzadeh, “Clar: Contrastive learning of auditory representations,” inInternational conference on artificial intelligence and statistics. PMLR, 2021, pp. 2530–2538

  46. [46]

    Byol for audio: Exploring pre-trained general-purpose audio repre- 22 sentations,

    D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Byol for audio: Exploring pre-trained general-purpose audio repre- 22 sentations,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 137–151, 2022

  47. [47]

    Representation Learning with Contrastive Predictive Coding

    A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  48. [48]

    Byol for audio: Self-supervised learning for general-purpose audio representation,

    D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Byol for audio: Self-supervised learning for general-purpose audio representation,” in2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8

  49. [49]

    An Unsupervised Autoregressive Model for Speech Representation Learning

    Y .-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised au- toregressive model for speech representation learning,”arXiv preprint arXiv:1904.03240, 2019

  50. [50]

    Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,

    A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” inICASSP 2020-2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6419–6423

  51. [51]

    Audio albert: A lite bert for self-supervised learning of audio representation,

    P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, Y .-H. Chen, S.-W. Li, and H.-y. Lee, “Audio albert: A lite bert for self-supervised learning of audio representation,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 344–350

  52. [52]

    Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2871–2883, 2024

  53. [53]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer,

    Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

  54. [54]

    w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,

    Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” inIEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250

  55. [55]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  56. [56]

    Discrete audio tokens: More than a survey!

    P. Mousavi, G. Maimon, A. Moumen, D. Petermann, J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov, R. Marxeret al., “Discrete audio tokens: More than a survey!”arXiv preprint arXiv:2506.10274, 2025

  57. [57]

    Data2vec: A general framework for self-supervised learning in speech, vision and language,

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” inInternational conference on machine learning. PMLR, 2022, pp. 1298–1312

  58. [58]

    Eat: Self- supervised pre-training with efficient audio transformer,

    W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “Eat: Self- supervised pre-training with efficient audio transformer,”arXiv preprint arXiv:2401.03497, 2024

  59. [59]

    Look, listen and learn,

    R. Arandjelovic and A. Zisserman, “Look, listen and learn,” inProceed- ings of the IEEE international conference on computer vision, 2017, pp. 609–617

  60. [60]

    Soundnet: Learning sound representations from unlabeled video,

    Y . Aytar, C. V ondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,”Advances in neural information processing systems, vol. 29, 2016

  61. [61]

    Robust audio-visual in- stance discrimination,

    P. Morgado, I. Misra, and N. Vasconcelos, “Robust audio-visual in- stance discrimination,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 934–12 945

  62. [62]

    Audioclip: Extending clip to image, text and audio,

    A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 976–980

  63. [63]

    Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,

    X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1336–1351, 2024

  64. [64]

    Audio mamba: Selective state spaces for self- supervised audio representations,

    S. Yadav and Z.-H. Tan, “Audio mamba: Selective state spaces for self- supervised audio representations,”arXiv preprint arXiv:2406.02178, 2024

  65. [65]

    Ssamba: Self-supervised audio representation learning with mamba state space model,

    S. Shams, S. S. Dindar, X. Jiang, and N. Mesgarani, “Ssamba: Self-supervised audio representation learning with mamba state space model,”arXiv preprint arXiv:2405.11831, 2024

  66. [66]

    Mamba in speech: Towards an alternative to self-attention,

    X. Zhang, Q. Zhang, H. Liu, T. Xiao, X. Qian, B. Ahmed, E. Am- bikairajah, H. Li, and J. Epps, “Mamba in speech: Towards an alternative to self-attention,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  67. [67]

    xlstm: Extended long short-term memory,

    M. Beck, K. P ¨oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xlstm: Extended long short-term memory,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 547–107 603, 2024

  68. [68]

    Axlstms: learning self-supervised audio representations with xlstms,

    S. Yadav, S. Theodoridis, and Z.-H. Tan, “Axlstms: learning self-supervised audio representations with xlstms,”arXiv preprint arXiv:2408.16568, 2024

  69. [69]

    Tera: Self-supervised learning of transformer encoder representation for speech,

    A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 29, pp. 2351– 2366, 2021

  70. [70]

    Ast: Audio spectrogram trans- former,

    Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,”arXiv preprint arXiv:2104.01778, 2021

  71. [71]

    AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

    K. Yamamoto and K. Okusa, “Aape: Aliasing-aware patch embed- ding for self-supervised audio representation learning,”arXiv preprint arXiv:2512.03637, 2025

  72. [72]

    Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,

    Y . Zhang, D. S. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y . Xu, Y . Huang, S. Wanget al., “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022

  73. [73]

    Hypercon- former: Multi-head hypermixer for efficient speech recognition,

    F. Mai, J. Zuluaga-Gomez, T. Parcollet, and P. Motlicek, “Hypercon- former: Multi-head hypermixer for efficient speech recognition,”arXiv preprint arXiv:2305.18281, 2023

  74. [74]

    Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding,

    Y . Peng, S. Dalmia, I. Lane, and S. Watanabe, “Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 17 627–17 643

  75. [75]

    E-branchformer: Branchformer with enhanced merging for speech recognition,

    K. Kim, F. Wu, Y . Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 84–91

  76. [76]

    Zipformer: A faster and better encoder for automatic speech recognition,

    Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang, Y . Yang, Z. Jin, L. Lin, and D. Povey, “Zipformer: A faster and better encoder for automatic speech recognition,”arXiv preprint arXiv:2310.11230, 2023

  77. [77]

    Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,

    K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650

  78. [78]

    Joint semantic knowl- edge distillation and masked acoustic modeling for full-band speech restoration with improved intelligibility,

    X. Liu, X. Li, J. Serr `a, and S. Pascual, “Joint semantic knowl- edge distillation and masked acoustic modeling for full-band speech restoration with improved intelligibility,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  79. [79]

    Distilhubert: Speech represen- tation learning by layer-wise distillation of hidden-unit bert,

    H.-J. Chang, S.-w. Yang, and H.-y. Lee, “Distilhubert: Speech represen- tation learning by layer-wise distillation of hidden-unit bert,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7087–7091

  80. [80]

    Skill: Similarity-aware knowledge distillation for speech self-supervised learning,

    L. Zampierin, G. B. Hacene, B. Nguyen, and M. Ravanelli, “Skill: Similarity-aware knowledge distillation for speech self-supervised learning,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 675– 679

Showing first 80 references.