ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

Asif Hanif; Mohammad Yaqub

arxiv: 2606.31587 · v1 · pith:7E2XWT23new · submitted 2026-06-30 · 💻 cs.SD · cs.AI

ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

Asif Hanif , Mohammad Yaqub This is my paper

Pith reviewed 2026-07-01 02:58 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords prompt learningaudio-language modelsbase-to-novel generalizationentropy regularizationzero-shot learningaudio classificationlogit fusion

0 comments

The pith

ZEBRA fuses zero-shot logits with prompt-learning logits and applies self-entropy regularization to close the base-to-novel gap in audio-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prompt learning adapts audio-language models to base classes yet often lowers accuracy on unseen novel classes, widening a generalization gap. ZEBRA counters the drop by combining the model's original zero-shot predictions with the adapted prompt outputs and adding entropy regularization on the prompt side. This keeps base-class performance intact while raising novel-class scores. Tests on multiple audio classification datasets show the combined steps shrink the base-to-novel difference relative to standard prompt tuning.

Core claim

ZEBRA addresses the base-to-novel generalization gap in prompt learning for audio-language models by fusing zero-shot logits with prompt-learning logits and employing self-entropy regularization to reduce overfitting to base classes, resulting in improved novel-class performance while maintaining base accuracy.

What carries the argument

Logit fusion of zero-shot and prompt outputs together with self-entropy regularization on the prompt-learning branch, which limits overfitting during adaptation.

If this is right

Novel-class accuracy rises while base-class accuracy remains high across datasets.
The base-to-novel performance difference narrows compared with plain prompt learning.
The method functions as a plug-and-play addition to existing prompt-learning pipelines.
Overfitting to base classes decreases through direct entropy control on the adapted outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Entropy regularization may serve as a lightweight control for balancing supervised adaptation against retained zero-shot knowledge in other multimodal settings.
The fusion step implies that explicit retention of the original model predictions can stabilize generalization when prompts are tuned on limited data.
Similar logit-level mixing could be tested on vision-language or text-only prompt tuning tasks to check transfer of the gap-reduction effect.

Load-bearing premise

The base-to-novel gap stems mainly from overfitting to base classes and can be fixed by entropy regularization plus logit fusion without creating offsetting losses elsewhere.

What would settle it

An audio dataset on which ZEBRA either widens the base-to-novel gap or fails to raise novel-class accuracy would show the fusion and regularization steps do not produce the claimed benefit.

Figures

Figures reproduced from arXiv: 2606.31587 by Asif Hanif, Mohammad Yaqub.

**Figure 1.** Figure 1: Comparison of Base and Novel Performance. Existing prompt-learning methods improve accuracy on base classes but generalize poorly to novel classes, often performing even worse than zero-shot inference. In contrast, incorporating ZEBRA with these baselines consistently boosts novel-class accuracy while maintaining strong performance on base classes. reveals a fundamental base-to-novel generalization gap … view at source ↗

**Figure 2.** Figure 2: ZEBRA approach operates on top of existing prompt learning methods to bridge the base-to-novel generalization gap, preserving zero-shot transferability while benefiting from supervised adaptation through few-shot prompt learning. ZEBRA introduces no additional learnable parameters to existing prompt learning methods and incurs negligible computational overhead. where p(x) is the softmax probability derived… view at source ↗

read the original abstract

Audio-Language Models (ALMs) achieve strong zero-shot performance by aligning audio with textual class descriptions. Although prompt learning improves accuracy on base classes through few-shot supervised adaptation, we observe a critical trade-off: it often degrades performance on novel classes, sometimes falling below zero-shot accuracy. This exposes a base-to-novel generalization gap in prompt learning for ALMs. To address this issue, we propose \textbf{ZEBRA} (Zero-shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization), a plug-and-play framework that fuses zero-shot logits with prompt-learning logits, and employs self-entropy regularization to reduce overfitting to base classes. Experiments across multiple audio classification datasets show that ZEBRA consistently improves novel-class performance while maintaining strong base accuracy, significantly reducing the base-to-novel gap compared to standard prompt learning. The code is available at: https://github.com/asif-hanif/zebra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZEBRA adds logit fusion and entropy regularization to prompt learning for ALMs to cut the base-to-novel drop, with the abstract claiming steady gains on multiple datasets.

read the letter

ZEBRA targets the drop in novel-class accuracy that shows up when you apply prompt learning to audio-language models. The fix is to blend the original zero-shot logits back into the learned ones and add a self-entropy term that discourages the model from becoming too peaked on the base classes.

The combination looks new for the audio setting. Prompt learning and entropy regularization exist already, but the paper applies both together as a plug-and-play module on top of existing ALMs and tests the result on several audio classification tasks. Releasing the code helps.

The work is straightforward and addresses a practical issue that prompt tuning often creates. The abstract states that base accuracy stays high while novel accuracy improves, which would be useful if it holds.

The main limitation is that only the abstract is visible here, so there are no numbers, ablation tables, or split details to check. It is not clear how large the gains actually are, whether both the fusion and the entropy term are required, or how sensitive the method is to the choice of base-novel partition. The premise that entropy regularization directly counters overfitting to base classes is plausible but would need the experiments to confirm it does not create other problems.

This is for people working on prompt adaptation in audio or multimodal models who already run into the base-novel trade-off. A reader doing similar work in vision-language models might borrow the fusion idea.

It is worth sending to review. The problem is real, the proposed patch is simple, and the claims are testable once the full results and code are examined.

Referee Report

1 major / 0 minor

Summary. The paper proposes ZEBRA, a plug-and-play framework for prompt learning in Audio-Language Models (ALMs) that fuses zero-shot logits with prompt-learning logits and applies self-entropy regularization to mitigate overfitting to base classes. It claims this addresses the observed base-to-novel generalization gap, where standard prompt learning improves base-class accuracy but often degrades novel-class performance below zero-shot levels. Experiments across multiple audio classification datasets are reported to show consistent novel-class gains while preserving base accuracy, thereby reducing the gap relative to standard prompt learning. Code is made available.

Significance. If the experimental claims hold with proper validation, the work could offer moderate significance for the audio and multimodal learning community by providing a simple, integrable method to improve generalization in few-shot prompt adaptation for ALMs. The plug-and-play design and public code release are explicit strengths that support reproducibility.

major comments (1)

Abstract: The central claim that ZEBRA 'consistently improves novel-class performance while maintaining strong base accuracy' across multiple datasets is load-bearing for the contribution, yet the manuscript provides no quantitative metrics, dataset names, baseline comparisons, ablation studies, or experimental protocol details. This prevents assessment of whether the data supports the claim or whether the entropy regularization and logit fusion avoid new trade-offs as assumed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract: The central claim that ZEBRA 'consistently improves novel-class performance while maintaining strong base accuracy' across multiple datasets is load-bearing for the contribution, yet the manuscript provides no quantitative metrics, dataset names, baseline comparisons, ablation studies, or experimental protocol details. This prevents assessment of whether the data supports the claim or whether the entropy regularization and logit fusion avoid new trade-offs as assumed.

Authors: The abstract is a concise summary and therefore omits the specific quantitative results, dataset names, and protocol details; these appear in full in Section 4 of the manuscript, which reports experiments on multiple audio classification datasets, direct comparisons to zero-shot and standard prompt-learning baselines, component ablations, and the complete experimental protocol. We agree that the abstract would benefit from a brief indication of the scale of the observed gains. We will therefore revise the abstract to include representative quantitative improvements (e.g., average novel-class accuracy deltas relative to the baselines) while preserving its length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and claims are empirically grounded

full rationale

The paper introduces ZEBRA as a plug-and-play framework combining logit fusion and self-entropy regularization to mitigate base-to-novel gaps in audio prompt learning. No derivation chain, equations, or self-citations reduce the central claims to inputs by construction. The abstract and description frame the contribution as an empirical intervention validated across datasets, with no fitted-parameter-as-prediction, self-definitional loops, or load-bearing self-citations. This is a standard non-circular proposal of a new technique.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Limited details from abstract; the method introduces a fusion mechanism and regularization, but no specific free parameters or invented entities are mentioned. The central claim depends on the effectiveness of these techniques in reducing overfitting.

axioms (1)

domain assumption Audio-language models achieve strong zero-shot performance by aligning audio with textual class descriptions.
Stated in the abstract as the basis for ALMs.

pith-pipeline@v0.9.1-grok · 5696 in / 1154 out tokens · 64228 ms · 2026-07-01T02:58:20.401264+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 3 canonical work pages · 1 internal anchor

[1]

ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

Introduction Recent advances in Vision-Language Models (VLMs) have inspired the development of Audio-Language Models (ALMs), which achieve strong performance on zero-shot audio recogni- tion tasks [1, 2, 3, 4]. In the zero-shot setting, audio features are aligned with textual descriptions of class labels, enabling recognition without task-specific trainin...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

CLAP [2] and AudioCLIP [3], for example, extend this paradigm to align audio and textual representations, enabling robust audio classification and cross-modal retrieval

Related Work Inspired by the success of CLIP [7] in the image–language domain, many audio–language models have adopted a similar contrastive learning framework. CLAP [2] and AudioCLIP [3], for example, extend this paradigm to align audio and textual representations, enabling robust audio classification and cross-modal retrieval. Like CLIP, these models ar...
[3]

Letxdenote an input audio waveform, and lett={t 1, t2,

Methodology Zero-Shot Classification in ALM.In CLIP-style audio- language models (ALMs) [7], zero-shot classification is per- formed by measuring the similarity between the audio represen- tation and a set of class-specific text descriptions. Letxdenote an input audio waveform, and lett={t 1, t2, . . . , tc}represent the set of textual class descriptions ...
[4]

Experiments and Results Models and Datasets.For the CLIP-style audio-language backbone, we adopt Pengi [1], a generative audio-language model consisting of audio and text encoders followed by an LLM decoder. Following the setup of PALM [12], we discard the decoder and utilize only the pretrained audio and text en- coders, effectively employing PENGI in a ...

2033
[5]

Conclusion We introduced ZEBRA, a lightweight, plug-and-play frame- work for improving base-to-novel generalization in au- dio–language models. While existing prompt-learning meth- ods substantially boost base-class performance, they often suf- fer from degraded generalization to novel classes, frequently underperforming the zero-shot baseline. ZEBRA effe...
[6]

All ideas, analyses, and con- clusions are the authors’ own

Generative AI Use Disclosure We confirm that an LLM was used solely for writing refinement (grammar, wording, and clarity). All ideas, analyses, and con- clusions are the authors’ own
[7]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023
[8]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[9]

Audioclip: Extend- ing clip to image, text and audio,

A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extend- ing clip to image, text and audio,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2022, pp. 976–980

2022
[10]

Vision-language models for vision tasks: A survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[11]

Learning to prompt for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,”International Journal of Computer Vision (IJCV), 2022

2022
[12]

Conditional prompt learning for vision-language models,

K. Zhou., J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 16 816–16 825

2022
[13]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

2021
[14]

A systematic survey of prompt en- gineering on vision-language foundation models,

J. Gu, Z. Han, S. Chen, A. Beirami, B. He, G. Zhang, R. Liao, Y . Qin, V . Tresp, and P. Torr, “A systematic survey of prompt en- gineering on vision-language foundation models,”arXiv preprint arXiv:2307.12980, 2023

work page arXiv 2023
[15]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,”ACM computing sur- veys, vol. 55, no. 9, pp. 1–35, 2023

2023
[16]

Baple: Backdoor attacks on medical foundational models using prompt learning,

A. Hanif, F. Shamshad, M. Awais, M. Naseer, F. S. Khan, K. Nan- dakumar, S. Khan, and R. M. Anwer, “Baple: Backdoor attacks on medical foundational models using prompt learning,” inInterna- tional Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2024, pp. 443–453

2024
[17]

Noise is an efficient learner for zero-shot vision-language models,

R. Imam, A. Hanif, J. Zhang, K. W. Dawoud, Y . Kementched- jhieva, and M. Yaqub, “Noise is an efficient learner for zero-shot vision-language models,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2025, pp. 5820–5829

2025
[18]

Palm: Few- shot prompt learning for audio language models,

A. Hanif, M. T. Agro, M. A. Qazi, and H. Aldarmaki, “Palm: Few- shot prompt learning for audio language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, 2024, pp. 18 527–18 536

2024
[19]

Pat: Parameter-free audio-text aligner to boost zero-shot audio classification,

A. Seth, R. Selvakumar, S. Kumar, S. Ghosh, and D. Manocha, “Pat: Parameter-free audio-text aligner to boost zero-shot audio classification,” inProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 12 376–12 394

2025
[20]

Trojan- wave: Exploiting prompt learning for stealthy backdoor attacks on large audio-language models,

A. Hanif, M. T. Agro, F. Shamshad, and K. Nandakumar, “Trojan- wave: Exploiting prompt learning for stealthy backdoor attacks on large audio-language models,” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, 2025, pp. 18 628–18 644

2025
[21]

A study of instrument-wise onset detection in beijing opera percussion ensembles,

M. Tian, A. Srinivasamurthy, M. Sandler, and X. Serra, “A study of instrument-wise onset detection in beijing opera percussion ensembles,” in2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2014, pp. 2159– 2163

2014
[22]

Neural audio synthesis of musical notes with wavenet autoencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Si- monyan, and M. Norouzi, “Neural audio synthesis of musical notes with wavenet autoencoders,” 2017

2017
[23]

ESC: Dataset for Environmental Sound Classifi- cation,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classifi- cation,” inProceedings of the 23rd Annual ACM Conference on Multimedia. ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2733373.2806390

work page arXiv
[24]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inProceedings of the 22nd ACM inter- national conference on Multimedia, 2014, pp. 1041–1044

2014
[25]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

2014
[26]

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

2018
[27]

Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,

Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2021

2021
[28]

Sound events for surveillance applications,

T. Spadini, “Sound events for surveillance applications,” 2019

2019
[29]

TUT Acoustic Scenes 2017, Development dataset,

T. Heittola, A. Mesaros, and T. Virtanen, “TUT Acoustic Scenes 2017, Development dataset,” Department of Sig- nal Processing, Tampere University of Technology, Tech. Rep., 2017. [Online]. Available: https://www.cs.tut.fi/sgn/arg/ dcase2017/challenge/task-acoustic-scene-classification

2017
[30]

An analysis of the gtzan music genre dataset,

B. L. Sturm, “An analysis of the gtzan music genre dataset,” in Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strate- gies, 2012, pp. 7–12

2012

[1] [1]

ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models

Introduction Recent advances in Vision-Language Models (VLMs) have inspired the development of Audio-Language Models (ALMs), which achieve strong performance on zero-shot audio recogni- tion tasks [1, 2, 3, 4]. In the zero-shot setting, audio features are aligned with textual descriptions of class labels, enabling recognition without task-specific trainin...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

CLAP [2] and AudioCLIP [3], for example, extend this paradigm to align audio and textual representations, enabling robust audio classification and cross-modal retrieval

Related Work Inspired by the success of CLIP [7] in the image–language domain, many audio–language models have adopted a similar contrastive learning framework. CLAP [2] and AudioCLIP [3], for example, extend this paradigm to align audio and textual representations, enabling robust audio classification and cross-modal retrieval. Like CLIP, these models ar...

[3] [3]

Letxdenote an input audio waveform, and lett={t 1, t2,

Methodology Zero-Shot Classification in ALM.In CLIP-style audio- language models (ALMs) [7], zero-shot classification is per- formed by measuring the similarity between the audio represen- tation and a set of class-specific text descriptions. Letxdenote an input audio waveform, and lett={t 1, t2, . . . , tc}represent the set of textual class descriptions ...

[4] [4]

Experiments and Results Models and Datasets.For the CLIP-style audio-language backbone, we adopt Pengi [1], a generative audio-language model consisting of audio and text encoders followed by an LLM decoder. Following the setup of PALM [12], we discard the decoder and utilize only the pretrained audio and text en- coders, effectively employing PENGI in a ...

2033

[5] [5]

Conclusion We introduced ZEBRA, a lightweight, plug-and-play frame- work for improving base-to-novel generalization in au- dio–language models. While existing prompt-learning meth- ods substantially boost base-class performance, they often suf- fer from degraded generalization to novel classes, frequently underperforming the zero-shot baseline. ZEBRA effe...

[6] [6]

All ideas, analyses, and con- clusions are the authors’ own

Generative AI Use Disclosure We confirm that an LLM was used solely for writing refinement (grammar, wording, and clarity). All ideas, analyses, and con- clusions are the authors’ own

[7] [7]

Pengi: An audio language model for audio tasks,

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 18 090–18 108, 2023

2023

[8] [8]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[9] [9]

Audioclip: Extend- ing clip to image, text and audio,

A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extend- ing clip to image, text and audio,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2022, pp. 976–980

2022

[10] [10]

Vision-language models for vision tasks: A survey,

J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[11] [11]

Learning to prompt for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,”International Journal of Computer Vision (IJCV), 2022

2022

[12] [12]

Conditional prompt learning for vision-language models,

K. Zhou., J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 16 816–16 825

2022

[13] [13]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

2021

[14] [14]

A systematic survey of prompt en- gineering on vision-language foundation models,

J. Gu, Z. Han, S. Chen, A. Beirami, B. He, G. Zhang, R. Liao, Y . Qin, V . Tresp, and P. Torr, “A systematic survey of prompt en- gineering on vision-language foundation models,”arXiv preprint arXiv:2307.12980, 2023

work page arXiv 2023

[15] [15]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,”ACM computing sur- veys, vol. 55, no. 9, pp. 1–35, 2023

2023

[16] [16]

Baple: Backdoor attacks on medical foundational models using prompt learning,

A. Hanif, F. Shamshad, M. Awais, M. Naseer, F. S. Khan, K. Nan- dakumar, S. Khan, and R. M. Anwer, “Baple: Backdoor attacks on medical foundational models using prompt learning,” inInterna- tional Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2024, pp. 443–453

2024

[17] [17]

Noise is an efficient learner for zero-shot vision-language models,

R. Imam, A. Hanif, J. Zhang, K. W. Dawoud, Y . Kementched- jhieva, and M. Yaqub, “Noise is an efficient learner for zero-shot vision-language models,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2025, pp. 5820–5829

2025

[18] [18]

Palm: Few- shot prompt learning for audio language models,

A. Hanif, M. T. Agro, M. A. Qazi, and H. Aldarmaki, “Palm: Few- shot prompt learning for audio language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, 2024, pp. 18 527–18 536

2024

[19] [19]

Pat: Parameter-free audio-text aligner to boost zero-shot audio classification,

A. Seth, R. Selvakumar, S. Kumar, S. Ghosh, and D. Manocha, “Pat: Parameter-free audio-text aligner to boost zero-shot audio classification,” inProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 12 376–12 394

2025

[20] [20]

Trojan- wave: Exploiting prompt learning for stealthy backdoor attacks on large audio-language models,

A. Hanif, M. T. Agro, F. Shamshad, and K. Nandakumar, “Trojan- wave: Exploiting prompt learning for stealthy backdoor attacks on large audio-language models,” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, 2025, pp. 18 628–18 644

2025

[21] [21]

A study of instrument-wise onset detection in beijing opera percussion ensembles,

M. Tian, A. Srinivasamurthy, M. Sandler, and X. Serra, “A study of instrument-wise onset detection in beijing opera percussion ensembles,” in2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2014, pp. 2159– 2163

2014

[22] [22]

Neural audio synthesis of musical notes with wavenet autoencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Si- monyan, and M. Norouzi, “Neural audio synthesis of musical notes with wavenet autoencoders,” 2017

2017

[23] [23]

ESC: Dataset for Environmental Sound Classifi- cation,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classifi- cation,” inProceedings of the 23rd Annual ACM Conference on Multimedia. ACM Press, pp. 1015–1018. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2733373.2806390

work page arXiv

[24] [24]

A dataset and taxonomy for urban sound research,

J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” inProceedings of the 22nd ACM inter- national conference on Multimedia, 2014, pp. 1041–1044

2014

[25] [25]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014

2014

[26] [26]

The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,

S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,”PloS one, vol. 13, no. 5, p. e0196391, 2018

2018

[27] [27]

Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,

Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2021

2021

[28] [28]

Sound events for surveillance applications,

T. Spadini, “Sound events for surveillance applications,” 2019

2019

[29] [29]

TUT Acoustic Scenes 2017, Development dataset,

T. Heittola, A. Mesaros, and T. Virtanen, “TUT Acoustic Scenes 2017, Development dataset,” Department of Sig- nal Processing, Tampere University of Technology, Tech. Rep., 2017. [Online]. Available: https://www.cs.tut.fi/sgn/arg/ dcase2017/challenge/task-acoustic-scene-classification

2017

[30] [30]

An analysis of the gtzan music genre dataset,

B. L. Sturm, “An analysis of the gtzan music genre dataset,” in Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strate- gies, 2012, pp. 7–12

2012