arxiv: 2604.24933 · v1 · submitted 2026-04-27 · 💻 cs.AI · cs.SD

Recognition: unknown

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Mohammed Ali El Adlouni , Aurian Quelennec , Pierre Chouteau , Geoffroy Peeters , Slim Essid

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:29 UTC · model grok-4.3

classification 💻 cs.AI cs.SD

keywords knowledge distillationaudio foundation modelsself-supervised learningmodel compressionembedding alignmentS-SONDOedge deployment

0 comments

The pith

S-SONDO distills general audio foundation models using only output embeddings to produce students up to 61 times smaller that retain up to 96 percent of teacher performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S-SONDO as a distillation framework that transfers knowledge from large audio foundation models to smaller ones solely by aligning their output embeddings. Earlier audio distillation methods needed class logits, intermediate layer features, or supervised labels, which excludes many self-supervised embedding models. Removing those needs makes the process architecture-agnostic and usable for any model that produces embeddings. The authors demonstrate the approach by creating three student models from two teachers, reaching size reductions of up to 61 times with performance retention up to 96 percent, and they add guidance on loss functions and clustering-based data sampling.

Core claim

S-SONDO performs self-supervised knowledge distillation from general audio foundation models to smaller students by matching output embeddings alone, without logits, layer-level alignment, or task-specific supervision. This yields architecture-agnostic compression. Experiments produce three students up to 61 times smaller than the teachers while retaining up to 96 percent of performance across tasks, together with observations on suitable loss choices and balanced sampling via clustering.

What carries the argument

S-SONDO, the embedding-matching distillation process that aligns teacher and student output vectors directly and incorporates clustering to create balanced training batches.

If this is right

Audio foundation models become deployable on edge devices through large size reductions with limited performance loss.
Distillation now applies to self-supervised and metric-learning audio models that expose only embeddings.
Loss selection and clustering-based sampling provide reusable practices for improving embedding-based distillation.
Teacher-student size gaps can be bridged without architecture matching or internal feature access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding-matching idea could be tested on foundation models from vision or language that also output only embeddings.
Pairing S-SONDO with quantization or pruning might produce even smaller yet still capable audio models.
Evaluating the students on real-world audio with background noise would test whether the retained performance holds outside clean benchmarks.

Load-bearing premise

Aligning output embeddings alone transfers enough knowledge for the smaller model to perform well across varied audio tasks.

What would settle it

Training a student to match teacher embeddings yet achieving well below 90 percent of teacher accuracy on a broad audio benchmark that includes classification, retrieval, and tagging would show the method is insufficient.

read the original abstract

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S-SONDO shows embedding-only distillation can shrink audio foundation models while retaining most performance, but the transfer mechanism needs tighter checks.

read the letter

The main point is that this paper gives a practical way to compress general audio foundation models by matching only the final embeddings, without logits or layer alignment. That makes the method work for self-supervised teachers that lack classification heads, which prior audio distillation work often assumed were available. They test it on two teachers distilled into three smaller students, reaching up to 61x size reduction and 96% performance retention on downstream tasks, and they release the code plus notes on loss choice and clustering for balanced sampling. Those elements are useful and directly address deployment constraints on edge devices. The approach is presented as architecture-agnostic, which is a clean extension beyond supervised or feature-alignment techniques. The soft spot sits in the central assumption that output embedding alignment alone recovers enough of the teacher's knowledge. If the embeddings were shaped by contrastive or reconstruction objectives that simple matching does not fully reconstruct, the retention numbers could drop on broader or unseen tasks. The abstract gives limited detail on baselines, splits, and variance, so the experiments need close reading to confirm the claims hold without overstatement. No obvious circularity or invented entities appear. This work is aimed at researchers building or deploying audio models where parameter count and inference cost matter. A reader focused on model compression or efficient foundation models would pick up the method and the sampling trick as concrete starting points. It has enough of a clear, testable contribution and reproducible code to deserve serious referee time rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces S-SONDO, a self-supervised knowledge distillation framework for general audio foundation models that operates solely on teacher and student output embeddings (plus clustering-based balanced sampling) without requiring logits, intermediate features, or architecture-specific alignment. It claims to be the first such method, demonstrates distillation of two large audio foundation models into three smaller student models achieving up to 61x size reduction while retaining up to 96% of teacher performance on downstream tasks, and offers practical insights on loss choice.

Significance. If the experimental results hold under detailed scrutiny, the work would be significant for enabling compression of embedding-only audio models (including self-supervised ones) in an architecture-agnostic manner, addressing a gap in prior audio distillation literature that relies on supervised signals. The public code release supports reproducibility and is a strength.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that output-embedding matching alone suffices to retain up to 96% teacher performance (and the 61x compression) is load-bearing but unsupported by reported baselines, downstream task definitions, data splits, error bars, or statistical tests; without these, it is impossible to confirm that the embedding space encodes recoverable task structure independent of the teacher's original training objective.
[§3.1] §3.1 (Method): The loss formulation for embedding alignment is presented as sufficient, yet no ablation compares it against alternatives (e.g., contrastive or reconstruction losses on the same embeddings) or tests whether the student architecture can represent the same invariances; this directly tests the weakest assumption that simple matching transfers general audio knowledge.
[§4.2] §4.2 (Results): The reported retention numbers for the three students lack per-task breakdowns and comparisons to supervised distillation or feature-based methods, undermining the assertion that avoiding logits and layer alignment is broadly effective rather than task-specific.

minor comments (2)

[Introduction] The acronym expansion in the title and abstract is clear but could be repeated once in the introduction for readers who skip the abstract.
[§3.2 and Figure 1] Figure 1 caption and §3.2 should explicitly state the embedding dimensionality and distance metric used in the loss to aid replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of experimental rigor and clarity that we will address in the revision. We respond to each major comment below, indicating where we agree and what changes will be made.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that output-embedding matching alone suffices to retain up to 96% teacher performance (and the 61x compression) is load-bearing but unsupported by reported baselines, downstream task definitions, data splits, error bars, or statistical tests; without these, it is impossible to confirm that the embedding space encodes recoverable task structure independent of the teacher's original training objective.

Authors: We agree that additional explicit documentation would strengthen the claims. Section 4 of the manuscript defines the downstream tasks (AudioSet, ESC-50, VGGSound, and others), follows standard public data splits, and reports average relative performance (the 96% figure) across tasks for the distilled students. However, we did not include a dedicated baseline table for embedding-only distillation or formal statistical tests. In the revised manuscript we will add: (i) a comparison table with students trained from random initialization and with supervised distillation where logits are available, (ii) standard error bars from multiple runs, and (iii) paired statistical significance tests. These additions will directly demonstrate that the teacher embedding space encodes recoverable task structure. revision: yes
Referee: [§3.1] §3.1 (Method): The loss formulation for embedding alignment is presented as sufficient, yet no ablation compares it against alternatives (e.g., contrastive or reconstruction losses on the same embeddings) or tests whether the student architecture can represent the same invariances; this directly tests the weakest assumption that simple matching transfers general audio knowledge.

Authors: The MSE embedding-alignment loss was selected for its architecture-agnostic simplicity and direct matching of output spaces. We acknowledge that an ablation study would better validate this choice. In the revision we will add a new subsection with ablations comparing MSE against contrastive (InfoNCE) and reconstruction losses applied to the same teacher embeddings. We will also analyze embedding similarity metrics and invariance properties to confirm that the student architecture can represent the transferred knowledge. These results will be reported with the same downstream evaluation protocol. revision: yes
Referee: [§4.2] §4.2 (Results): The reported retention numbers for the three students lack per-task breakdowns and comparisons to supervised distillation or feature-based methods, undermining the assertion that avoiding logits and layer alignment is broadly effective rather than task-specific.

Authors: Section 4.2 currently summarizes aggregate retention; we will expand it to include full per-task performance tables for each student-teacher pair, explicitly showing retention percentages on every downstream task. Regarding comparisons, S-SONDO is designed for teachers that provide only embeddings (common in self-supervised audio models). We will nevertheless add, where feasible, comparisons against feature-based distillation methods that use intermediate layers and against supervised logit-based distillation on tasks where labels are available. This will clarify both the advantages in embedding-only settings and the performance gap relative to methods that require more teacher information. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical distillation method with independent validation

full rationale

The paper proposes S-SONDO as a new framework for distilling audio foundation models using only output embeddings, without logits or layer alignment. Claims of effectiveness (up to 61x compression, 96% retention) rest on experimental results distilling two independent teacher models into three students, plus practical insights on loss and sampling. No derivations, equations, or predictions reduce to fitted inputs by construction; no self-citations form load-bearing chains; the method is presented as tested on external benchmarks. This is a standard empirical contribution with self-contained validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the unstated assumption that a self-supervised embedding-matching loss plus clustering-based sampling can replace supervised distillation signals. No explicit free parameters, axioms, or invented entities are detailed in the provided abstract.

pith-pipeline@v0.9.0 · 5529 in / 977 out tokens · 51082 ms · 2026-05-08T03:29:25.283267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 3 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Self-Supervised Learning (SSL) has recently emerged as a dominant paradigm in representation learning, with remark- able success across speech and general audio. By learning directly from raw data without labels, SSL models can lever- age vast amounts of unlabeled audio and achieve performance competitive with their supervised counterparts [1...

2024
[2]

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

RELA TED WORK Knowledge distillation (KD) is a widely used method for transferring knowledge from a large pre-trained teacher model to a smaller student model. Hinton et al. in [4] in- troduced KD to improve the student’s performance by lever- aging the teacher’s learned knowledge, matching their output arXiv:2604.24933v1 [cs.AI] 27 Apr 2026 class logits ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Our KD strategy follows the formulation introduced in the original KD framework [4]

METHOD Figure 1 illustrates the overall pipeline of our method. Our KD strategy follows the formulation introduced in the original KD framework [4]. We adopt a response-based approach in which the student is trained to align its outputs with those of the teacher. Unlike the standard KD setting that matches model logits, our self-supervised KD setup instea...
[4]

Each clip is transformed into a log-Mel spectrogram at 32 kHz with a 32 ms window, 16 ms hop, and 128 Mel bins covering 50–16,000 Hz, which serves as input to the student models

EXPERIMENTAL SETUP Pre-Training setupWe use AudioSet [18] as the training dataset, restricted to 10-second clips, yielding 1.8M samples. Each clip is transformed into a log-Mel spectrogram at 32 kHz with a 32 ms window, 16 ms hop, and 128 Mel bins covering 50–16,000 Hz, which serves as input to the student models. For pre-training, we follow the hyperpara...
[5]

RESULTS AND DISCUSSION Main resultsTable 1 reports the results of S-SONDO across all teacher-student combinations. Since this is, to the best of our knowledge, the first SSL KD framework that relies exclu- sively on embeddings as the training signal, we compare the performance off θ trained using SSL KD ofg γ to the ones obtained by training directlyf θ i...
[6]

We evaluated our method using three different students and two teachers

CONCLUSION In this work, we introduced S-SONDO, the first self-supervised knowledge distillation method that aligns student and teacher embeddings without relying on class logits or the model’s architecture. We evaluated our method using three different students and two teachers. We showed that in 4/6 cases, the performance of our distillations outperform...
[7]

Masked modeling duo: Learn- ing representations by encouraging both networks to model the input,

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino, “Masked modeling duo: Learn- ing representations by encouraging both networks to model the input,” inProc. ICASSP, June 2023, pp. 1–5

2023
[8]

Matpac++: Enhanced masked latent predic- tion for self-supervised audio representation learning,

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, and Slim Essid, “Matpac++: Enhanced masked latent predic- tion for self-supervised audio representation learning,”arXiv preprint arXiv:2508.12709, 2025

work page arXiv 2025
[9]

Asit: Local-global audio spectrogram vision transformer for event classification,

Sara Atito Ali Ahmed, Muhammad Awais, Wenwu Wang, Mark D. Plumbley, and Josef Kittler, “Asit: Local-global audio spectrogram vision transformer for event classification,”IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 3684– 3693, July 2024

2024
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill- ing the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015
[11]

Fitnets: Hints for thin deep nets,

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, An- toine Chassang, Carlo Gatta, and Yoshua Bengio, “Fitnets: Hints for thin deep nets,” inProc. ICLR, Yoshua Bengio and Yann LeCun, Eds., May 2015

2015
[12]

Knowledge distillation on graphs: A sur- vey,

Yijun Tian, Shichao Pei, Xiangliang Zhang, Chuxu Zhang, and Nitesh V . Chawla, “Knowledge distillation on graphs: A sur- vey,”ACM Computing Surveys, vol. 57, 2025

2025
[13]

Sssd: Self-supervised self distillation,

Wei-Chi Chen and Wei-Ta Chu, “Sssd: Self-supervised self distillation,” inProc. WACV, January 2023, pp. 2770–2777

2023
[14]

Clusterfit: Improving general- ization of visual representations,

Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadi- yaram, and Dhruv Mahajan, “Clusterfit: Improving general- ization of visual representations,” inProc. CVPR, June 2020, pp. 6508–6517

2020
[15]

Distilhu- bert: Speech representation learning by layer-wise distillation of hidden-unit bert,

Heng-Jui Chang, Shu-wen Yang, and Hung-yi Lee, “Distilhu- bert: Speech representation learning by layer-wise distillation of hidden-unit bert,” inProc. ICASSP, May 2022, pp. 7087– 7091

2022
[16]

DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,

Hyung gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watan- abe, and Ahmed Hussen Abdelaziz, “DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective,” inProc. Interspeech, June 2025, pp. 1–5

2025
[17]

Feature-informed em- bedding space regularization for audio classification,

Yun-Ning Hung and Alexander Lerch, “Feature-informed em- bedding space regularization for audio classification,” inProc. Eusipco, June 2022, pp. 1–5

2022
[18]

A comprehensive overhaul of feature distillation,

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, No- jun Kwak, and Jin Young Choi, “A comprehensive overhaul of feature distillation,” inProc. ICCV, November 2019, pp. 1921–1930

2019
[19]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021

2021
[20]

Dy- namic convolutional neural networks as efficient pre-trained audio models,

Florian Schmid, Khaled Koutini, and Gerhard Widmer, “Dy- namic convolutional neural networks as efficient pre-trained audio models,”IEEE ACM Trans. Audio Speech Lang. Pro- cess., vol. 32, pp. 2227–2241, 2024

2024
[21]

Audio embeddings as teachers for music classification,

Yiwei Ding and Alexander Lerch, “Audio embeddings as teachers for music classification,” inProc. ISMIR, November 2023, pp. 579–587

2023
[22]

Effi- cient large-scale audio tagging via transformer-to-cnn knowl- edge distillation,

Florian Schmid, Khaled Koutini, and Gerhard Widmer, “Effi- cient large-scale audio tagging via transformer-to-cnn knowl- edge distillation,” inProc. ICASSP, June 2023, pp. 1–5

2023
[23]

CLAP learning audio concepts from nat- ural language supervision,

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “CLAP learning audio concepts from nat- ural language supervision,” inProc. ICASSP, June 2023, pp. 1–5

2023
[24]

Audio Set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” inProc. ICASSP, March 2017, pp. 776–780

2017
[25]

Searching for mo- bilenetv3,

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al., “Searching for mo- bilenetv3,” inProc. ICCV, November 2019, pp. 1314–1324

2019
[26]

An enhanced res2net with local and global feature fusion for speaker verification,

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi, “An enhanced res2net with local and global feature fusion for speaker verification,” inProc. Inter- speech, August 2023, pp. 2228–2232

2023
[27]

Openmic- 2018: An open data-set for multiple instrument recognition,

Eric Humphrey, Simon Durand, and Brian McFee, “Openmic- 2018: An open data-set for multiple instrument recognition,” inProc. ISMIR), September 2018, pp. 438–444

2018
[28]

Musical genre classi- fication of audio signals,

George Tzanetakis and Perry R. Cook, “Musical genre classi- fication of audio signals,”IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002

2002
[29]

Neural audio synthesis of musical notes with wavenet autoencoders,

Jesse H. Engel, Cinjon Resnick, Adam Roberts, Sander Diele- man, Mohammad Norouzi, Douglas Eck, and Karen Si- monyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” inProceedings of the 34th Interna- tional Conference on Machine Learning (ICML), August 2017, vol. 70, pp. 1068–1077

2017
[30]

Evaluation of algorithms using games: The case of music tagging.,

Edith Law, Kris West, Michael Mandel, and Mert Bay J. Stephen Downie, “Evaluation of algorithms using games: The case of music tagging.,” inProc. ISMIR, October 2009, pp. 387-–392

2009
[31]

Esc: Dataset for environmental sound classi- fication,

Karol J. Piczak, “Esc: Dataset for environmental sound classi- fication,” inProc. MM, Oct. 2015, pp. 1015–1018

2015
[32]

A dataset and taxonomy for urban sound research,

Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” inProc. MM, Nov. 2014, pp. 1041–1044

2014
[33]

Fsd50k: An open dataset of human-labeled sound events,

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra, “Fsd50k: An open dataset of human-labeled sound events,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2022

2022