pith. sign in

arxiv: 2606.17160 · v1 · pith:GHFBJ72Hnew · submitted 2026-06-15 · 💻 cs.SD

Transductive Zero-Shot Audio Classification with Audio-Language Models

Pith reviewed 2026-06-27 02:31 UTC · model grok-4.3

classification 💻 cs.SD
keywords transductive inferencezero-shot audio classificationCLAPGaussian mixture EMposterior refinementunlabeled test batchaudio-language models
0
0 comments X

The pith

A text-anchored spherical Gaussian-mixture EM refines zero-shot audio posteriors from unlabeled test-batch statistics and raises top-1 accuracy by 4.6 to 9.2 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that zero-shot audio classification with contrastive language-audio models processes each clip in isolation and therefore ignores statistical structure present across the unlabeled test set. It replaces independent inference with a single expectation-maximization pass that treats the batch embeddings as a spherical Gaussian mixture whose means are fixed at the corresponding text embeddings. The resulting posterior updates require no labels, no gradients, and only milliseconds of CPU time yet deliver consistent accuracy gains on ESC-50, UrbanSound8K, and VocalSound. The gains scale with the number of examples per class, remain useful under moderate imbalance, and add to other prompt-engineering techniques, but vanish when the initial zero-shot baseline is already near chance.

Core claim

The central claim is that a text-anchored spherical Gaussian-mixture expectation-maximization procedure, run once on the audio embeddings of an unlabeled test batch, refines the zero-shot posteriors of a contrastive language-audio pretraining model and thereby improves top-1 accuracy by 4.6 to 9.2 percentage points across standard audio classification benchmarks, with the improvement governed by a simple sample-per-class threshold and remaining positive under long-tailed batch distributions.

What carries the argument

text-anchored spherical Gaussian-mixture EM that updates zero-shot posteriors from test-batch audio-embedding statistics

If this is right

  • Roughly 2.5 test samples per class are required for the gain to appear, with diminishing returns beyond approximately 5 samples per class.
  • The EM refinement is complementary to entropy-guided prompt weighting and their combination reaches 96.2 percent on ESC-50.
  • Accuracy improvement attenuates but stays positive under long-tailed batch distributions, for example dropping from +4.9 to +3.1 points at 20:1 imbalance.
  • No improvement occurs on datasets such as TUT Urban Acoustic Scenes 2018 where the zero-shot baseline already performs near chance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same batch-level posterior refinement could be tested on other multimodal zero-shot tasks whenever multiple unlabeled items are presented together and their embeddings form visible clusters.
  • If the spherical-mixture assumption proves too restrictive, replacing it with a more flexible density model while keeping the text-anchor constraint would be a direct next step.
  • The method suggests that transductive post-processing may become a lightweight default step whenever frozen audio-language models are deployed on batches rather than single clips.

Load-bearing premise

The audio embeddings in each test batch are well described by a spherical Gaussian mixture whose component means coincide with the text embeddings of the target classes.

What would settle it

Observing no accuracy gain after applying the EM procedure on a new audio dataset whose test embeddings visibly deviate from spherical clusters centered on the text embeddings would falsify the claim that the mixture model is what produces the observed improvement.

read the original abstract

Contrastive language-audio pretraining (CLAP) enables zero-shot audio classification, but standard inference classifies each clip in isolation and ignores the structure of the unlabeled test set. We present the first systematic study of TransCLIP-style transductive inference for CLAP: a text-anchored spherical Gaussian-mixture EM that refines zero-shot posteriors using the audio-embedding statistics of the test batch, with no labels, no gradients, and negligible compute (about 15 ms on one CPU core for 2,000 clips). Across ESC-50, UrbanSound8K, and VocalSound, this consistently improves top-1 accuracy by +4.6 to +9.2 points over the zero-shot baseline (e.g., 89.1 -> 94.8% on ESC-50, 73.8 -> 81.8% on UrbanSound8K). We further show that the gain (i) is governed by a simple operating boundary -- roughly 2.5 test samples per class per batch are required, with diminishing returns beyond ~5; (ii) is complementary to entropy-guided prompt weighting, with the combination reaching 96.2% on ESC-50; and (iii) attenuates but remains positive under long-tailed batches (+4.9 -> +3.1 points at a 20:1 imbalance), which we report as an explicit limitation. We also document a negative result: on TUT Urban Acoustic Scenes 2018, where zero-shot CLAP is near chance, transduction has no signal to amplify.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a transductive inference procedure for CLAP-based zero-shot audio classification. It models test-batch audio embeddings as a spherical Gaussian mixture whose component means are fixed at the corresponding text embeddings, then applies EM to refine the initial zero-shot posteriors. The method uses no labels, no gradients, and negligible compute. Experiments report top-1 accuracy gains of +4.6 to +9.2 points on ESC-50 (89.1% → 94.8%), UrbanSound8K, and VocalSound; the gains require roughly 2.5 samples per class, remain complementary to entropy-guided prompt weighting, attenuate under long-tailed batches, and vanish on TUT where zero-shot performance is near chance.

Significance. If the distributional premise holds, the work supplies a simple, training-free, batch-level post-processing step that measurably improves zero-shot audio classification across multiple datasets while explicitly documenting its operating regime and a clear negative result. The parameter-free character and negligible runtime make the contribution immediately usable; the explicit limitation analysis on long-tailed data and the negative TUT result increase credibility.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (results): the reported accuracy deltas (e.g., 89.1 → 94.8% on ESC-50) are given as single point estimates without error bars, standard deviations across random batch orderings, or statistical significance tests. Because the EM procedure depends on the empirical distribution of each test batch, variability across batches must be quantified to support the claim of consistent improvement.
  2. [§3] §3 (method): the central modeling assumption—that audio embeddings in each test batch are well-described by a spherical Gaussian mixture centered at the text embeddings—is load-bearing for the EM updates, yet the manuscript provides no diagnostic (likelihood ratio test, embedding visualization, or comparison to alternative covariances) that this assumption holds on the reported datasets. The negative result on TUT is consistent with violation of the assumption, but the positive results on ESC-50/UrbanSound8K/VocalSound rest on an unverified premise.
minor comments (2)
  1. [§3] The notation for the E-step posterior and M-step update equations could be made more explicit by labeling each quantity with its dependence on the current batch.
  2. [§3, §4] Implementation details (exact CLAP model checkpoint, embedding dimensionality, convergence criterion for EM) are referenced only in passing; a short reproducibility paragraph or pseudocode block would help readers replicate the 15 ms timing claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on variability and modeling assumptions. We address both points below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (results): the reported accuracy deltas (e.g., 89.1 → 94.8% on ESC-50) are given as single point estimates without error bars, standard deviations across random batch orderings, or statistical significance tests. Because the EM procedure depends on the empirical distribution of each test batch, variability across batches must be quantified to support the claim of consistent improvement.

    Authors: We agree that batch-dependent variability should be quantified. In the revision we will add mean accuracy and standard deviation computed over 10 independent random shuffles of each test set (with fixed batch size), plus a paired t-test against the zero-shot baseline. This directly addresses the dependence on empirical batch statistics. revision: yes

  2. Referee: [§3] §3 (method): the central modeling assumption—that audio embeddings in each test batch are well-described by a spherical Gaussian mixture centered at the text embeddings—is load-bearing for the EM updates, yet the manuscript provides no diagnostic (likelihood ratio test, embedding visualization, or comparison to alternative covariances) that this assumption holds on the reported datasets. The negative result on TUT is consistent with violation of the assumption, but the positive results on ESC-50/UrbanSound8K/VocalSound rest on an unverified premise.

    Authors: The negative TUT result is already presented as evidence that the spherical-GMM premise fails when zero-shot performance is near chance. For the positive datasets we will add, in the revision, (i) 2-D t-SNE visualizations of audio embeddings colored by the final EM assignments and (ii) a comparison of held-out log-likelihood under the spherical model versus a diagonal-covariance alternative. These diagnostics will make the operating regime of the method more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: standard EM on fixed embeddings yields empirical gains

full rationale

The paper presents a text-anchored spherical GMM EM procedure applied to pre-computed CLAP audio embeddings from an unlabeled test batch to refine zero-shot posteriors. No equations are given that reduce the reported accuracy deltas (+4.6 to +9.2 points) to any fitted parameter or self-defined quantity inside the paper; the operating boundary (2.5 samples per class) and long-tail attenuation are stated as empirical observations. The method is described as a direct, label-free application of standard EM with means anchored at text embeddings, with no load-bearing self-citation chain, uniqueness theorem, or ansatz imported from prior author work. The derivation is therefore self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that test-batch audio embeddings form spherical Gaussians whose means are given by text embeddings; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Audio embeddings in a test batch follow a spherical Gaussian mixture model with means anchored by text embeddings
    The method is defined as text-anchored spherical Gaussian-mixture EM

pith-pipeline@v0.9.1-grok · 5809 in / 1239 out tokens · 42754 ms · 2026-06-27T02:31:59.009277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 linked inside Pith

  1. [1]

    the sound of a dog

    INTRODUCTION Audio–language models such as CLAP [1, 2, 3] transfer the con- trastive recipe of CLIP [4] to audio: a dual encoder—typically a transformer audio tower [5] pretrained at AudioSet scale [6] paired with a language model—aligns audio clips with natural-language captions. Any label set can then be classified zero-shot by embed- ding prompts such ...

  2. [2]

    Zero-shot CLAP inference Letf a andf t denote the CLAP audio and text encoders

    METHOD 2.1. Zero-shot CLAP inference Letf a andf t denote the CLAP audio and text encoders. For a test batch{x i}N i=1 andCclasses, we computeℓ 2-normalized audio embeddingsa i and text embeddingst c, obtained by en- coding one prompt (single) or averaging several templates before re-normalization (ensemble). Zero-shot posteriors are z0 ic = exp τa ⊤ i tc...

  3. [3]

    the sound of a{}

    EXPERIMENTS Setup.We use the publiclaion/clap-htsat-unfused checkpoint [1] (153M parameters) frozen, audio resampled to 48 kHz. Prompts:single= “the sound of a{}”;ensemble= mean of 4 templates. 1 Datasets: ESC-50 [27] (2,000 clips, 50 classes, official 5-fold), UrbanSound8K [28] (8,732 clips, 10 classes, official 10-fold), V ocalSound [29] (test split, 3,...

  4. [4]

    DISCUSSION Why parametric EM succeeds where graph propagation fails. Label propagation and text-anchored EM consume the same unla- beled evidence but aggregate it at different granularities: propagation moves probability mass alonglocalkNN edges, whereas the M-step (2) poolsallNposterior-weighted embeddings intoCglobal class means. To quantify the differe...

  5. [5]

    3.4, four limitations qualify our claims

    LIMITA TIONS AND FUTURE WORK Beyond the uniform-prior sensitivity of Sec. 3.4, four limitations qualify our claims. (i) All results use one public checkpoint (laion/clap-htsat-unfused); the operating boundary’s location(∼2.5samples per class) may shift for stronger ALMs [3, 10], although theN/Cmechanism itself is checkpoint-agnostic. (ii) The mixture mode...

  6. [6]

    CONCLUSION We presented the first systematic transfer of TransCLIP-style trans- ductive inference to audio–language models. A minimal text- anchored GMM-EM over the unlabeled test batch—no labels, no gradients, milliseconds of compute—lifts CLAP zero-shot accu- racy by+4.6to+9.2points across ESC-50, UrbanSound8K, and V ocalSound, composes with prompt-side...

  7. [7]

    Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale con- trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5

  8. [8]

    CLAP: Learning audio concepts from natural language supervision,

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “CLAP: Learning audio concepts from natural language supervision,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2023, pp. 1–5

  9. [9]

    Natural language supervision for general-purpose audio rep- resentations,

    Benjamin Elizalde, Soham Deshmukh, and Huaming Wang, “Natural language supervision for general-purpose audio rep- resentations,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024

  10. [10]

    Learning transferable visual models from nat- ural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable visual models from nat- ural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

  11. [11]

    HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,

    Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg- Kirkpatrick, and Shlomo Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022

  12. [12]

    Audio Set: An ontology and human- labeled dataset for audio events,

    Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio Set: An ontology and human- labeled dataset for audio events,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 776– 780

  13. [13]

    AudioCLIP: Extending CLIP to image, text and au- dio,

    Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Den- gel, “AudioCLIP: Extending CLIP to image, text and au- dio,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022

  14. [14]

    Wav2CLIP: Learning robust audio rep- resentations from CLIP,

    Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello, “Wav2CLIP: Learning robust audio rep- resentations from CLIP,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022

  15. [15]

    Zero-shot audio classifi- cation via semantic embeddings,

    Huang Xie and Tuomas Virtanen, “Zero-shot audio classifi- cation via semantic embeddings,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, 2021

  16. [16]

    Re- CLAP: Improving zero shot audio classification by describing sounds,

    Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha, “Re- CLAP: Improving zero shot audio classification by describing sounds,”arXiv preprint arXiv:2409.09213, 2024

  17. [17]

    PALM: Few-shot prompt learning for au- dio language models,

    Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, and Hanan Aldarmaki, “PALM: Few-shot prompt learning for au- dio language models,” inProc. Conf. Empirical Methods Nat- ural Lang. Process. (EMNLP), 2024

  18. [18]

    Learning to prompt for vision-language models,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu, “Learning to prompt for vision-language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022

  19. [19]

    Conditional prompt learning for vision-language mod- els,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu, “Conditional prompt learning for vision-language mod- els,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 16816–16825

  20. [20]

    Vapnik,Statistical Learning Theory, Wiley, 1998

    Vladimir N. Vapnik,Statistical Learning Theory, Wiley, 1998

  21. [21]

    Learning to propagate labels: Transductive propagation network for few-shot learn- ing,

    Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang, “Learning to propagate labels: Transductive propagation network for few-shot learn- ing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2019

  22. [22]

    Laplacian regularized few-shot learning,

    Imtiaz Masud Ziko, Jose Dolz, Eric Granger, and Ismail Ben Ayed, “Laplacian regularized few-shot learning,” inProc. Int. Conf. Mach. Learn. (ICML), 2020

  23. [23]

    Transductive zero- shot and few-shot CLIP,

    S ´egol`ene Martin, Yunshi Huang, Fereshteh Shakeri, Jean- Christophe Pesquet, and Ismail Ben Ayed, “Transductive zero- shot and few-shot CLIP,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 28816–28826

  24. [24]

    Boost- ing vision-language models with transduction,

    Maxime Zanella, Beno ˆıt G´erin, and Ismail Ben Ayed, “Boost- ing vision-language models with transduction,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2024, vol. 37, pp. 62223–62256

  25. [25]

    A comprehensive sur- vey on test-time adaptation under distribution shifts,

    Jian Liang, Ran He, and Tieniu Tan, “A comprehensive sur- vey on test-time adaptation under distribution shifts,”Int. J. Comput. Vis., vol. 133, 2025

  26. [26]

    Tent: Fully test-time adaptation by entropy minimization,

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

  27. [27]

    Test-time prompt tuning for zero-shot generalization in vision-language models,

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Gold- stein, Anima Anandkumar, and Chaowei Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2022, vol. 35, pp. 14274–14285

  28. [28]

    Lever- aging prediction entropy for automatic prompt weighting in zero-shot audio-language classification,

    Karim El Khoury, Maxime Zanella, Tiffanie Godelaine, Christophe De Vleeschouwer, and Beno ˆıt Macq, “Lever- aging prediction entropy for automatic prompt weighting in zero-shot audio-language classification,”arXiv preprint arXiv:2601.05011, 2026

  29. [29]

    A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image mod- els,

    James Urquhart Allingham, Jie Ren, Michael W. Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Bal- aji Lakshminarayanan, “A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image mod- els,” inProc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 547– 568

  30. [30]

    Clustering on the unit hypersphere using von Mises- Fisher distributions,

    Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Su- vrit Sra, “Clustering on the unit hypersphere using von Mises- Fisher distributions,”J. Mach. Learn. Res., vol. 6, pp. 1345– 1382, 2005

  31. [31]

    Maximum likelihood from incomplete data via the EM algo- rithm,

    Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, “Maximum likelihood from incomplete data via the EM algo- rithm,”J. R. Stat. Soc. Ser. B, vol. 39, no. 1, pp. 1–22, 1977

  32. [32]

    Learning with local and global consistency,

    Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason We- ston, and Bernhard Sch¨olkopf, “Learning with local and global consistency,” inAdv. Neural Inf. Process. Syst. (NIPS), 2004, vol. 16, pp. 321–328

  33. [33]

    ESC: Dataset for environmental sound clas- sification,

    Karol J. Piczak, “ESC: Dataset for environmental sound clas- sification,” inProc. 23rd ACM Int. Conf. Multimedia, 2015, pp. 1015–1018

  34. [34]

    A dataset and taxonomy for urban sound research,

    Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset and taxonomy for urban sound research,” inProc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044

  35. [35]

    V ocalSound: A dataset for improving human vocal sounds recognition,

    Yuan Gong, Jin Yu, and James Glass, “V ocalSound: A dataset for improving human vocal sounds recognition,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022, pp. 151–155

  36. [36]

    A multi-device dataset for urban acoustic scene classification,

    Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “A multi-device dataset for urban acoustic scene classification,” inProc. Detection Classification Acoust. Scenes Events Work- shop (DCASE), 2018, pp. 9–13

  37. [37]

    FSD50K: An open dataset of human- labeled sound events,

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra, “FSD50K: An open dataset of human- labeled sound events,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022

  38. [38]

    Sinkhorn distances: Lightspeed computation of optimal transport,

    Marco Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” inAdv. Neural Inf. Process. Syst. (NIPS), 2013, vol. 26