AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Kohei Yamamoto; Kosuke Okusa

arxiv: 2512.03637 · v2 · pith:QSKAJ67Dnew · submitted 2025-12-03 · 💻 cs.SD · cs.LG· stat.ML

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Kohei Yamamoto , Kosuke Okusa This is my paper

Pith reviewed 2026-05-17 02:26 UTC · model grok-4.3

classification 💻 cs.SD cs.LGstat.ML

keywords self-supervised learningaudio spectrogram transformersaliasing-aware patch embeddingmasked modelingaudio classificationhigh-frequency cuespre-training

0 comments

The pith

AaSP uses input-estimated kernels to fuse high-frequency cues lost to aliasing in audio spectrogram pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that convolutional patchification with temporal downsampling in spectrogram transformers lowers the effective Nyquist frequency and introduces aliasing that discards task-relevant high-frequency information. It introduces AaSP, which augments standard patch tokens through an aliasing-aware embedding that adaptively analyzes alias-prone modulation bands with a band-limited complex sinusoidal kernel whose parameters are estimated from each input. If correct, this produces representations that integrate those cues while staying stable across different masked views. The approach pairs the embedding with teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization. A reader would care because the resulting features transfer better to environmental sound, speech, and music tasks without relying on naive low-pass filters that remove useful detail.

Core claim

The central claim is that an aliasing-aware self-supervised pre-training framework learns representations that integrate features from alias-prone modulation bands while remaining stable across masked views, achieved by combining an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization.

What carries the argument

The Aliasing-aware Patch Embedding (AaPE) module, which augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window and frequency and decay parameters estimated from the input.

If this is right

Under fine-tuning the full framework reaches state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines.
It remains competitive on other acoustic, environmental, speech, and music recognition benchmarks.
Linear evaluation shows gains on US8K and NSynth.
The learned representations are more stable under aliasing-sensitive temporal perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive kernel idea might apply to other patch-based transformers that process downsampled time-frequency data, such as certain video or radar models.
One could test whether sharing the estimated kernel parameters across similar audio clips reduces computation while preserving the reported gains.
The stability under temporal perturbations suggests the method could help in streaming or low-latency audio applications where aliasing varies with input rate.

Load-bearing premise

That aliasing from convolutional patchification is a primary performance bottleneck and that an input-estimated kernel can fuse useful high-frequency cues without introducing instabilities or overfitting to the estimation.

What would settle it

An ablation that removes the aliasing-aware kernel components and measures whether performance drops on high-frequency-sensitive tasks such as music pitch or certain speech benchmarks; if results stay the same or improve, the contribution of the aliasing fix would be falsified.

Figures

Figures reproduced from arXiv: 2512.03637 by Kohei Yamamoto, Kosuke Okusa.

**Figure 1.** Figure 1: Overview of the Teacher-student Self-supervised Learning Scheme with AaPE. Aliasing-aware Patch Embedding (AaPE) replaces the ViT patch embedding, performing dynamic subband frequency analysis to extract aliasing-aware features from aliasing-prone bands and fusing them with the standard patch tokens. The class (CLS) token predicts a global summary of the teacher outputs, while a cross-attention predictor p… view at source ↗

**Figure 2.** Figure 2: Spectra and Gradient Magnitudes for Various Window Functions. The two-sided window sustains usable gradients over wider frequency offsets, while one-sided and Gaussian windows suffer rapid off-target vanishing; this supports our choice of a two-sided exponential window in SBLU for stable subband-wise estimation in aliasing-prone bands. The window parameters α and σg are normalized so that the peak gradient… view at source ↗

**Figure 3.** Figure 3: Architecture of the Aliasing-aware Patch Embedding (AaPE). AaPE augments the standard ViT patch embedding with aliasing-aware features via three components: (1) Lambda Encoder–takes log-mel spectrogram patches and estimates input-dependent complex kernel parameters Λalias with decay α and frequency β per subband using a narrow-and-shallow Transformer and linear projections; (2) Adaptive SBLU–emphasizes ali… view at source ↗

**Figure 4.** Figure 4: Example of Patch-wise Adaptive Estimation of SBLU Kernel Parameters. Top: input spectrogram prior to patchification, which is subsequently patchified and fed into the Lambda Encoder. Middle/Bottom: pertime-patch distributions of the estimated SBLU kernel parameters (decay and frequency) produced by the pre-trained Lambda Encoder, showing adaptive variation in response to the input. the AudioSet [16] data… view at source ↗

read the original abstract

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. It introduces the Aliasing-aware Patch Embedding (AaPE) module that augments standard convolutional patch tokens with features from alias-prone modulation bands via a band-limited complex sinusoidal kernel with a two-sided exponential window; the kernel's frequency and decay parameters are estimated from the input for adaptive subband fusion. The framework bundles this with teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization. Pre-trained on AudioSet, the model is evaluated via fine-tuning and linear probing on acoustic, speech, and music benchmarks, claiming state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines (with competitive performance elsewhere) and improved stability under aliasing-sensitive temporal perturbations.

Significance. If the performance gains are shown to arise specifically from the adaptive aliasing-aware kernel rather than the bundled SSL components, the work could meaningfully advance spectrogram-based audio transformers by preserving task-relevant high-frequency cues without introducing instability. The input-estimated kernel offers a potentially elegant, adaptive mechanism for subband integration that aligns with the paper's emphasis on stability across masked views. However, the significance hinges on rigorous isolation of contributions and quantitative validation of the aliasing-specific benefits.

major comments (2)

[evaluation section] The central claim that AaPE's input-estimated kernel drives the reported fine-tuning gains on AS-20K, ESC-50, and NSynth is not supported by ablation studies that isolate its contribution from the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization. Without such controls, attribution of SOTA results to aliasing awareness remains unverified (evaluation section and associated tables).
[§4] The abstract and results claim SOTA performance and improved stability under aliasing-sensitive perturbations, yet no quantitative details are provided on baseline scores, ablation results, error bars, or specific controls for the adaptive kernel estimation process. This absence undermines assessment of the method's load-bearing claims (abstract and §4).

minor comments (2)

[§3.2] Clarify the exact fusion mechanism between AaPE outputs and standard patch tokens, including any weighting or concatenation details, to improve reproducibility of the patch-embedding module.
[related work] Add explicit references to prior work on aliasing effects in convolutional spectrogram processing and adaptive filtering in audio SSL to better contextualize the novelty of the two-sided exponential window.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify the need for stronger isolation of the AaPE module's contributions and more detailed quantitative reporting to support the central claims. We will revise the evaluation section and §4 accordingly to address these points.

read point-by-point responses

Referee: [evaluation section] The central claim that AaPE's input-estimated kernel drives the reported fine-tuning gains on AS-20K, ESC-50, and NSynth is not supported by ablation studies that isolate its contribution from the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization. Without such controls, attribution of SOTA results to aliasing awareness remains unverified (evaluation section and associated tables).

Authors: We agree that the current ablations do not fully isolate the adaptive kernel's role from the bundled SSL components. In the revised manuscript we will add controlled experiments that disable or replace the input-estimated subband fusion in AaPE while keeping the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization unchanged. These results will be reported in the evaluation section and tables to enable clearer attribution of performance gains to aliasing awareness. revision: yes
Referee: [§4] The abstract and results claim SOTA performance and improved stability under aliasing-sensitive perturbations, yet no quantitative details are provided on baseline scores, ablation results, error bars, or specific controls for the adaptive kernel estimation process. This absence undermines assessment of the method's load-bearing claims (abstract and §4).

Authors: We acknowledge that additional quantitative details are required. The revised version will expand §4 with complete tables of baseline scores, full ablation results, error bars computed over multiple random seeds, and explicit controls comparing the input-estimated kernel against fixed-parameter variants. These additions will strengthen the assessment of both the SOTA claims and the stability improvements under aliasing-sensitive perturbations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents AaSP as an empirical framework combining an input-estimated kernel in AaPE for aliasing-aware patching with standard SSL elements (teacher-student masking, cross-attention predictor, multi-mask contrastive loss). Kernel parameters are explicitly estimated per-input rather than globally fitted or derived by construction from the target metrics. Performance claims rely on pre-training on AudioSet followed by fine-tuning/linear evaluation on external benchmarks (AS-20K, ESC-50, NSynth, etc.), without any equation or step reducing the reported gains to a tautological fit, self-citation chain, or renamed known result. The central claims therefore retain independent content from the described architecture and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard signal processing assumptions about spectrogram aliasing and the utility of subband fusion; no new physical entities are postulated and the adaptive parameters are input-derived rather than globally fitted constants.

axioms (2)

domain assumption Convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing in spectrogram inputs.
Invoked in the opening problem statement to motivate the need for aliasing-aware processing.
domain assumption Naive low-pass filtering may remove task-relevant high-frequency cues.
Used to justify avoiding simple filtering solutions.

pith-pipeline@v0.9.0 · 5578 in / 1543 out tokens · 63093 ms · 2026-05-17T02:26:53.341874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

[1]

Bert: Pre- training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understanding,” inConference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019
[2]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...

work page 2023
[3]

Masked autoencoders that listen,

P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[4]

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,

A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” inAnnual Conference of the Interna- tional Speech Communication Association (INTERSPEECH), 2022, pp. 2438–2442

work page 2022
[5]

SSAST: Self-Supervised Audio Spectrogram Transformer,

Y . Gong, C.-I. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” inAAAI Conference on Artificial Intelligence (AAAI), vol. 36, no. 10, 2022, pp. 10 699–10 709

work page 2022
[6]

Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,

X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 32, pp. 1336– 1351, 2023

work page 2023
[7]

Masked modeling duo: Towards a universal audio pre-training frame- work,

D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Towards a universal audio pre-training frame- work,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 2391–2406, 2024

work page 2024
[8]

AST: Audio Spectrogram Trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Trans- former,” inAnnual Conference of the International Speech Communica- tion Association (INTERSPEECH), 2021, pp. 571–575

work page 2021
[9]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[10]

Masked Autoencoders Are Scalable Vision Learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 000–16 009

work page 2022
[11]

Anti- aliasing regularization in stacking layers

A. Bruguier, A. Misra, A. Narayanan, and R. Prabhavalkar, “Anti- aliasing regularization in stacking layers.” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 314–318

work page 2020
[12]

Improving sound event classifi- cation by increasing shift invariance in convolutional neural networks,

E. Fonseca, A. Ferraro, and X. Serra, “Improving sound event classifi- cation by increasing shift invariance in convolutional neural networks,” arXiv:2107.00623, 2021

work page arXiv 2021
[13]

Making convolutional networks shift-invariant again,

R. Zhang, “Making convolutional networks shift-invariant again,” in International Conference on Machine Learning (ICML). PMLR, 2019, pp. 7324–7334

work page 2019
[14]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations (ICLR), 2022. 10

work page 2022
[15]

On the parameterization and initialization of diagonal state space models,

A. Gu, K. Goel, A. Gupta, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 35 971–35 983, 2022

work page 2022
[16]

Audio set: An ontology and human- labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017
[17]

Esc: Dataset for environmental sound classification,

K. J. Piczak, “Esc: Dataset for environmental sound classification,” in ACM International Conference on Multimedia (ACM MM), ser. MM ’15. New York, NY , USA: Association for Computing Machinery, 2015, pp. 1015–1018

work page 2015
[18]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM International Conference on Multimedia (ACM MM), 2014, pp. 1041–1044

work page 2014
[19]

Neural audio synthesis of musical notes with wavenet au- toencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet au- toencoders,” inInternational Conference on Machine Learning (ICML). PMLR, 2017, pp. 1068–1077

work page 2017
[20]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014
[22]

Masked spectrogram prediction for self-supervised audio pre-training,

D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[23]

BEATs: audio pre-training with acoustic tokenizers,

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: audio pre-training with acoustic tokenizers,” in International Conference on Machine Learning (ICML), 2023

work page 2023
[24]

Asit: Local-global audio spectrogram vision transformer for event clas- sification,

S. A. A. Ahmed, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “Asit: Local-global audio spectrogram vision transformer for event clas- sification,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3684–3693, 2024

work page 2024
[25]

Eat: Self-supervised pre-training with efficient audio transformer,

W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “Eat: Self-supervised pre-training with efficient audio transformer,” inInternational Joint Conference on Artificial Intelligence (IJCAI), K. Larson, Ed. IJCAI, 8 2024, pp. 3807–3815, main Track

work page 2024
[26]

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Rep- resentation Learning,

J. Wang, T. Wang, M. Ge, L. Wang, and J. Dang, “ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Rep- resentation Learning,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2025, pp. 5803–5807

work page 2025
[27]

SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,

T. Alex, S. Atito, A. Mustafa, M. Awais, and P. J. B. Jackson, “SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[28]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 12 449–12 460

work page 2020
[29]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. rahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[30]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Leaf: A learnable frontend for audio classification,

N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “Leaf: A learnable frontend for audio classification,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[32]

Efficientleaf: A faster learnable audio frontend of questionable use,

J. Schl ¨uter and G. Gutenbrunner, “Efficientleaf: A faster learnable audio frontend of questionable use,” inEuropean Signal Processing Conference (EUSIPCO), 2022, pp. 205–208

work page 2022
[33]

Fitting auditory filterbanks with multiresolution neural networks,

V . Lostanlen, D. Haider, H. Han, M. Lagrange, P. Balazs, and M. Ehler, “Fitting auditory filterbanks with multiresolution neural networks,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023, pp. 1–5

work page 2023
[34]

Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,

M. Anderson, T. H. Kinnunen, and N. Harte, “Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[35]

Sinusoidal frequency estimation by gradient descent,

B. Hayes, C. Saitis, and G. Fazekas, “Sinusoidal frequency estimation by gradient descent,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[36]

Rethinking patch dependence for masked autoencoders,

L. Fu, L. Lian, R. Wang, B. Shi, X. Wang, A. Yala, T. Darrell, A. A. Efros, and K. Goldberg, “Rethinking patch dependence for masked autoencoders,”arXiv:2401.14391, 2024

work page arXiv 2024
[37]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 1597–1607

work page 2020
[38]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[39]

Sgdr: Stochastic gradient descent with warm restarts,

——, “Sgdr: Stochastic gradient descent with warm restarts,” inInter- national Conference on Learning Representations (ICLR), 2017

work page 2017
[40]

BEiT: BERT Pre-Training of Image Transformers

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

mixup: Beyond empirical risk minimization,

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[42]

Deep networks with stochastic depth,

G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inEuropean Conference on Computer Vision (ECCV), 2016

work page 2016
[43]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2019

work page 2019
[44]

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” inInternational Conference on Machine Learning (ICML), 2022, pp. 1298–1312

work page 2022
[45]

Masked latent prediction and classification for self-supervised audio representation learning,

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid, “Masked latent prediction and classification for self-supervised audio representation learning,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025, pp. 1–5. APPENDIXA We derive Eq. (2) under the notation and conventions specified in the main text (see Sec.III-B)....

work page 2025

[1] [1]

Bert: Pre- training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understanding,” inConference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019

[2] [2]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...

work page 2023

[3] [3]

Masked autoencoders that listen,

P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[4] [4]

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,

A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” inAnnual Conference of the Interna- tional Speech Communication Association (INTERSPEECH), 2022, pp. 2438–2442

work page 2022

[5] [5]

SSAST: Self-Supervised Audio Spectrogram Transformer,

Y . Gong, C.-I. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” inAAAI Conference on Artificial Intelligence (AAAI), vol. 36, no. 10, 2022, pp. 10 699–10 709

work page 2022

[6] [6]

Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,

X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 32, pp. 1336– 1351, 2023

work page 2023

[7] [7]

Masked modeling duo: Towards a universal audio pre-training frame- work,

D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Towards a universal audio pre-training frame- work,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 2391–2406, 2024

work page 2024

[8] [8]

AST: Audio Spectrogram Trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Trans- former,” inAnnual Conference of the International Speech Communica- tion Association (INTERSPEECH), 2021, pp. 571–575

work page 2021

[9] [9]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[10] [10]

Masked Autoencoders Are Scalable Vision Learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 000–16 009

work page 2022

[11] [11]

Anti- aliasing regularization in stacking layers

A. Bruguier, A. Misra, A. Narayanan, and R. Prabhavalkar, “Anti- aliasing regularization in stacking layers.” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 314–318

work page 2020

[12] [12]

Improving sound event classifi- cation by increasing shift invariance in convolutional neural networks,

E. Fonseca, A. Ferraro, and X. Serra, “Improving sound event classifi- cation by increasing shift invariance in convolutional neural networks,” arXiv:2107.00623, 2021

work page arXiv 2021

[13] [13]

Making convolutional networks shift-invariant again,

R. Zhang, “Making convolutional networks shift-invariant again,” in International Conference on Machine Learning (ICML). PMLR, 2019, pp. 7324–7334

work page 2019

[14] [14]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations (ICLR), 2022. 10

work page 2022

[15] [15]

On the parameterization and initialization of diagonal state space models,

A. Gu, K. Goel, A. Gupta, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 35 971–35 983, 2022

work page 2022

[16] [16]

Audio set: An ontology and human- labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017

[17] [17]

Esc: Dataset for environmental sound classification,

K. J. Piczak, “Esc: Dataset for environmental sound classification,” in ACM International Conference on Multimedia (ACM MM), ser. MM ’15. New York, NY , USA: Association for Computing Machinery, 2015, pp. 1015–1018

work page 2015

[18] [18]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM International Conference on Multimedia (ACM MM), 2014, pp. 1041–1044

work page 2014

[19] [19]

Neural audio synthesis of musical notes with wavenet au- toencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet au- toencoders,” inInternational Conference on Machine Learning (ICML). PMLR, 2017, pp. 1068–1077

work page 2017

[20] [20]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014

work page 2014

[22] [22]

Masked spectrogram prediction for self-supervised audio pre-training,

D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[23] [23]

BEATs: audio pre-training with acoustic tokenizers,

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: audio pre-training with acoustic tokenizers,” in International Conference on Machine Learning (ICML), 2023

work page 2023

[24] [24]

Asit: Local-global audio spectrogram vision transformer for event clas- sification,

S. A. A. Ahmed, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “Asit: Local-global audio spectrogram vision transformer for event clas- sification,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3684–3693, 2024

work page 2024

[25] [25]

Eat: Self-supervised pre-training with efficient audio transformer,

W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “Eat: Self-supervised pre-training with efficient audio transformer,” inInternational Joint Conference on Artificial Intelligence (IJCAI), K. Larson, Ed. IJCAI, 8 2024, pp. 3807–3815, main Track

work page 2024

[26] [26]

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Rep- resentation Learning,

J. Wang, T. Wang, M. Ge, L. Wang, and J. Dang, “ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Rep- resentation Learning,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2025, pp. 5803–5807

work page 2025

[27] [27]

SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,

T. Alex, S. Atito, A. Mustafa, M. Awais, and P. J. B. Jackson, “SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[28] [28]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 12 449–12 460

work page 2020

[29] [29]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. rahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[30] [30]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Leaf: A learnable frontend for audio classification,

N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “Leaf: A learnable frontend for audio classification,” inInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[32] [32]

Efficientleaf: A faster learnable audio frontend of questionable use,

J. Schl ¨uter and G. Gutenbrunner, “Efficientleaf: A faster learnable audio frontend of questionable use,” inEuropean Signal Processing Conference (EUSIPCO), 2022, pp. 205–208

work page 2022

[33] [33]

Fitting auditory filterbanks with multiresolution neural networks,

V . Lostanlen, D. Haider, H. Han, M. Lagrange, P. Balazs, and M. Ehler, “Fitting auditory filterbanks with multiresolution neural networks,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023, pp. 1–5

work page 2023

[34] [34]

Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,

M. Anderson, T. H. Kinnunen, and N. Harte, “Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[35] [35]

Sinusoidal frequency estimation by gradient descent,

B. Hayes, C. Saitis, and G. Fazekas, “Sinusoidal frequency estimation by gradient descent,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[36] [36]

Rethinking patch dependence for masked autoencoders,

L. Fu, L. Lian, R. Wang, B. Shi, X. Wang, A. Yala, T. Darrell, A. A. Efros, and K. Goldberg, “Rethinking patch dependence for masked autoencoders,”arXiv:2401.14391, 2024

work page arXiv 2024

[37] [37]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 1597–1607

work page 2020

[38] [38]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[39] [39]

Sgdr: Stochastic gradient descent with warm restarts,

——, “Sgdr: Stochastic gradient descent with warm restarts,” inInter- national Conference on Learning Representations (ICLR), 2017

work page 2017

[40] [40]

BEiT: BERT Pre-Training of Image Transformers

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

mixup: Beyond empirical risk minimization,

H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[42] [42]

Deep networks with stochastic depth,

G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inEuropean Conference on Computer Vision (ECCV), 2016

work page 2016

[43] [43]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2019

work page 2019

[44] [44]

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” inInternational Conference on Machine Learning (ICML), 2022, pp. 1298–1312

work page 2022

[45] [45]

Masked latent prediction and classification for self-supervised audio representation learning,

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid, “Masked latent prediction and classification for self-supervised audio representation learning,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025, pp. 1–5. APPENDIXA We derive Eq. (2) under the notation and conventions specified in the main text (see Sec.III-B)....

work page 2025