Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Bernhard Sick; Christoph Scholz; Houtan Ghaffari; Ilyass Moummad; Lukas Miklautz; Lukas Rauch; Ren\'e Heinrich

arxiv: 2509.24901 · v4 · pith:TNERJ53Snew · submitted 2025-09-29 · 💻 cs.SD · cs.LG

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Lukas Rauch , Ren\'e Heinrich , Houtan Ghaffari , Lukas Miklautz , Ilyass Moummad , Bernhard Sick , Christoph Scholz This is my paper

classification 💻 cs.SD cs.LG

keywords audioprobinginformationpoolingbottleneckfine-tuninggloballinear

0 comments

read the original abstract

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AudioMosaic: Contrastive Masked Audio Representation Learning
cs.LG 2026-05 unverdicted novelty 6.0

AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.