AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
Pith reviewed 2026-05-17 02:26 UTC · model grok-4.3
The pith
AaSP uses input-estimated kernels to fuse high-frequency cues lost to aliasing in audio spectrogram pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an aliasing-aware self-supervised pre-training framework learns representations that integrate features from alias-prone modulation bands while remaining stable across masked views, achieved by combining an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization.
What carries the argument
The Aliasing-aware Patch Embedding (AaPE) module, which augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window and frequency and decay parameters estimated from the input.
If this is right
- Under fine-tuning the full framework reaches state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines.
- It remains competitive on other acoustic, environmental, speech, and music recognition benchmarks.
- Linear evaluation shows gains on US8K and NSynth.
- The learned representations are more stable under aliasing-sensitive temporal perturbations.
Where Pith is reading between the lines
- The adaptive kernel idea might apply to other patch-based transformers that process downsampled time-frequency data, such as certain video or radar models.
- One could test whether sharing the estimated kernel parameters across similar audio clips reduces computation while preserving the reported gains.
- The stability under temporal perturbations suggests the method could help in streaming or low-latency audio applications where aliasing varies with input rate.
Load-bearing premise
That aliasing from convolutional patchification is a primary performance bottleneck and that an input-estimated kernel can fuse useful high-frequency cues without introducing instabilities or overfitting to the estimation.
What would settle it
An ablation that removes the aliasing-aware kernel components and measures whether performance drops on high-frequency-sensitive tasks such as music pitch or certain speech benchmarks; if results stay the same or improve, the contribution of the aliasing fix would be falsified.
Figures
read the original abstract
Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. It introduces the Aliasing-aware Patch Embedding (AaPE) module that augments standard convolutional patch tokens with features from alias-prone modulation bands via a band-limited complex sinusoidal kernel with a two-sided exponential window; the kernel's frequency and decay parameters are estimated from the input for adaptive subband fusion. The framework bundles this with teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization. Pre-trained on AudioSet, the model is evaluated via fine-tuning and linear probing on acoustic, speech, and music benchmarks, claiming state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines (with competitive performance elsewhere) and improved stability under aliasing-sensitive temporal perturbations.
Significance. If the performance gains are shown to arise specifically from the adaptive aliasing-aware kernel rather than the bundled SSL components, the work could meaningfully advance spectrogram-based audio transformers by preserving task-relevant high-frequency cues without introducing instability. The input-estimated kernel offers a potentially elegant, adaptive mechanism for subband integration that aligns with the paper's emphasis on stability across masked views. However, the significance hinges on rigorous isolation of contributions and quantitative validation of the aliasing-specific benefits.
major comments (2)
- [evaluation section] The central claim that AaPE's input-estimated kernel drives the reported fine-tuning gains on AS-20K, ESC-50, and NSynth is not supported by ablation studies that isolate its contribution from the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization. Without such controls, attribution of SOTA results to aliasing awareness remains unverified (evaluation section and associated tables).
- [§4] The abstract and results claim SOTA performance and improved stability under aliasing-sensitive perturbations, yet no quantitative details are provided on baseline scores, ablation results, error bars, or specific controls for the adaptive kernel estimation process. This absence undermines assessment of the method's load-bearing claims (abstract and §4).
minor comments (2)
- [§3.2] Clarify the exact fusion mechanism between AaPE outputs and standard patch tokens, including any weighting or concatenation details, to improve reproducibility of the patch-embedding module.
- [related work] Add explicit references to prior work on aliasing effects in convolutional spectrogram processing and adaptive filtering in audio SSL to better contextualize the novelty of the two-sided exponential window.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify the need for stronger isolation of the AaPE module's contributions and more detailed quantitative reporting to support the central claims. We will revise the evaluation section and §4 accordingly to address these points.
read point-by-point responses
-
Referee: [evaluation section] The central claim that AaPE's input-estimated kernel drives the reported fine-tuning gains on AS-20K, ESC-50, and NSynth is not supported by ablation studies that isolate its contribution from the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization. Without such controls, attribution of SOTA results to aliasing awareness remains unverified (evaluation section and associated tables).
Authors: We agree that the current ablations do not fully isolate the adaptive kernel's role from the bundled SSL components. In the revised manuscript we will add controlled experiments that disable or replace the input-estimated subband fusion in AaPE while keeping the teacher-student masked modeling, cross-attention predictor, and multi-mask contrastive regularization unchanged. These results will be reported in the evaluation section and tables to enable clearer attribution of performance gains to aliasing awareness. revision: yes
-
Referee: [§4] The abstract and results claim SOTA performance and improved stability under aliasing-sensitive perturbations, yet no quantitative details are provided on baseline scores, ablation results, error bars, or specific controls for the adaptive kernel estimation process. This absence undermines assessment of the method's load-bearing claims (abstract and §4).
Authors: We acknowledge that additional quantitative details are required. The revised version will expand §4 with complete tables of baseline scores, full ablation results, error bars computed over multiple random seeds, and explicit controls comparing the input-estimated kernel against fixed-parameter variants. These additions will strengthen the assessment of both the SOTA claims and the stability improvements under aliasing-sensitive perturbations. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper presents AaSP as an empirical framework combining an input-estimated kernel in AaPE for aliasing-aware patching with standard SSL elements (teacher-student masking, cross-attention predictor, multi-mask contrastive loss). Kernel parameters are explicitly estimated per-input rather than globally fitted or derived by construction from the target metrics. Performance claims rely on pre-training on AudioSet followed by fine-tuning/linear evaluation on external benchmarks (AS-20K, ESC-50, NSynth, etc.), without any equation or step reducing the reported gains to a tautological fit, self-citation chain, or renamed known result. The central claims therefore retain independent content from the described architecture and evaluation protocol.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing in spectrogram inputs.
- domain assumption Naive low-pass filtering may remove task-relevant high-frequency cues.
Reference graph
Works this paper leans on
-
[1]
Bert: Pre- training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understanding,” inConference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
work page 2019
-
[2]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...
work page 2023
-
[3]
Masked autoencoders that listen,
P.-Y . Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[4]
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,
A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” inAnnual Conference of the Interna- tional Speech Communication Association (INTERSPEECH), 2022, pp. 2438–2442
work page 2022
-
[5]
SSAST: Self-Supervised Audio Spectrogram Transformer,
Y . Gong, C.-I. Lai, Y .-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” inAAAI Conference on Artificial Intelligence (AAAI), vol. 36, no. 10, 2022, pp. 10 699–10 709
work page 2022
-
[6]
Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,
X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 32, pp. 1336– 1351, 2023
work page 2023
-
[7]
Masked modeling duo: Towards a universal audio pre-training frame- work,
D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked modeling duo: Towards a universal audio pre-training frame- work,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 2391–2406, 2024
work page 2024
-
[8]
AST: Audio Spectrogram Trans- former,
Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Trans- former,” inAnnual Conference of the International Speech Communica- tion Association (INTERSPEECH), 2021, pp. 571–575
work page 2021
-
[9]
An image is worth 16x16 words: Trans- formers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[10]
Masked Autoencoders Are Scalable Vision Learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 000–16 009
work page 2022
-
[11]
Anti- aliasing regularization in stacking layers
A. Bruguier, A. Misra, A. Narayanan, and R. Prabhavalkar, “Anti- aliasing regularization in stacking layers.” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 314–318
work page 2020
-
[12]
E. Fonseca, A. Ferraro, and X. Serra, “Improving sound event classifi- cation by increasing shift invariance in convolutional neural networks,” arXiv:2107.00623, 2021
-
[13]
Making convolutional networks shift-invariant again,
R. Zhang, “Making convolutional networks shift-invariant again,” in International Conference on Machine Learning (ICML). PMLR, 2019, pp. 7324–7334
work page 2019
-
[14]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations (ICLR), 2022. 10
work page 2022
-
[15]
On the parameterization and initialization of diagonal state space models,
A. Gu, K. Goel, A. Gupta, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 35 971–35 983, 2022
work page 2022
-
[16]
Audio set: An ontology and human- labeled dataset for audio events,
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human- labeled dataset for audio events,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780
work page 2017
-
[17]
Esc: Dataset for environmental sound classification,
K. J. Piczak, “Esc: Dataset for environmental sound classification,” in ACM International Conference on Multimedia (ACM MM), ser. MM ’15. New York, NY , USA: Association for Computing Machinery, 2015, pp. 1015–1018
work page 2015
-
[18]
A dataset and taxonomy for urban sound research,
J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inACM International Conference on Multimedia (ACM MM), 2014, pp. 1041–1044
work page 2014
-
[19]
Neural audio synthesis of musical notes with wavenet au- toencoders,
J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet au- toencoders,” inInternational Conference on Machine Learning (ICML). PMLR, 2017, pp. 1068–1077
work page 2017
-
[20]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv:1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Crema-d: Crowd-sourced emotional multimodal actors dataset,
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014
work page 2014
-
[22]
Masked spectrogram prediction for self-supervised audio pre-training,
D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[23]
BEATs: audio pre-training with acoustic tokenizers,
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: audio pre-training with acoustic tokenizers,” in International Conference on Machine Learning (ICML), 2023
work page 2023
-
[24]
Asit: Local-global audio spectrogram vision transformer for event clas- sification,
S. A. A. Ahmed, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “Asit: Local-global audio spectrogram vision transformer for event clas- sification,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3684–3693, 2024
work page 2024
-
[25]
Eat: Self-supervised pre-training with efficient audio transformer,
W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “Eat: Self-supervised pre-training with efficient audio transformer,” inInternational Joint Conference on Artificial Intelligence (IJCAI), K. Larson, Ed. IJCAI, 8 2024, pp. 3807–3815, main Track
work page 2024
-
[26]
J. Wang, T. Wang, M. Ge, L. Wang, and J. Dang, “ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Rep- resentation Learning,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2025, pp. 5803–5807
work page 2025
-
[27]
SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,
T. Alex, S. Atito, A. Mustafa, M. Awais, and P. J. B. Jackson, “SSLAM: Enhancing self-supervised models with audio mixtures for polyphonic soundscapes,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[28]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[29]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. rahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[30]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Leaf: A learnable frontend for audio classification,
N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “Leaf: A learnable frontend for audio classification,” inInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[32]
Efficientleaf: A faster learnable audio frontend of questionable use,
J. Schl ¨uter and G. Gutenbrunner, “Efficientleaf: A faster learnable audio frontend of questionable use,” inEuropean Signal Processing Conference (EUSIPCO), 2022, pp. 205–208
work page 2022
-
[33]
Fitting auditory filterbanks with multiresolution neural networks,
V . Lostanlen, D. Haider, H. Han, M. Lagrange, P. Balazs, and M. Ehler, “Fitting auditory filterbanks with multiresolution neural networks,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023, pp. 1–5
work page 2023
-
[34]
Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,
M. Anderson, T. H. Kinnunen, and N. Harte, “Learnable frontends that do not learn: Quantifying sensitivity to filterbank initialisation,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[35]
Sinusoidal frequency estimation by gradient descent,
B. Hayes, C. Saitis, and G. Fazekas, “Sinusoidal frequency estimation by gradient descent,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5
work page 2023
-
[36]
Rethinking patch dependence for masked autoencoders,
L. Fu, L. Lian, R. Wang, B. Shi, X. Wang, A. Yala, T. Darrell, A. A. Efros, and K. Goldberg, “Rethinking patch dependence for masked autoencoders,”arXiv:2401.14391, 2024
-
[37]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 1597–1607
work page 2020
-
[38]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019
work page 2019
-
[39]
Sgdr: Stochastic gradient descent with warm restarts,
——, “Sgdr: Stochastic gradient descent with warm restarts,” inInter- national Conference on Learning Representations (ICLR), 2017
work page 2017
-
[40]
BEiT: BERT Pre-Training of Image Transformers
H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
mixup: Beyond empirical risk minimization,
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” inInternational Conference on Learning Representations (ICLR), 2018
work page 2018
-
[42]
Deep networks with stochastic depth,
G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inEuropean Conference on Computer Vision (ECCV), 2016
work page 2016
-
[43]
Specaugment: A simple data augmentation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH), 2019
work page 2019
-
[44]
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,
A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” inInternational Conference on Machine Learning (ICML), 2022, pp. 1298–1312
work page 2022
-
[45]
Masked latent prediction and classification for self-supervised audio representation learning,
A. Quelennec, P. Chouteau, G. Peeters, and S. Essid, “Masked latent prediction and classification for self-supervised audio representation learning,” inIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025, pp. 1–5. APPENDIXA We derive Eq. (2) under the notation and conventions specified in the main text (see Sec.III-B)....
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.