How Class Ontology and Data Scale Affect Audio Transfer Learning

Alexander Gebhard; Andreas Triantafyllopoulos; Bj\"orn W. Schuller; Manuel Milling; Simon Rampp

arxiv: 2603.25476 · v3 · pith:UHVLSUVJnew · submitted 2026-03-26 · 💻 cs.LG

How Class Ontology and Data Scale Affect Audio Transfer Learning

Manuel Milling , Andreas Triantafyllopoulos , Alexander Gebhard , Simon Rampp , Bj\"orn W. Schuller This is my paper

Pith reviewed 2026-05-21 10:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords transfer learningaudio classificationAudioSetdata scaleclass ontologytask similarityfine-tuningacoustic features

0 comments

The pith

Similarity between pre-training audio data and the downstream task improves transfer learning more than increasing the number of samples or classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the impact of class ontology and data scale on audio transfer learning by pre-training models on different subsets of AudioSet defined by their ontologies. These subsets allow variation in the number of classes and total samples while keeping other factors as controlled as possible. The pre-trained models are evaluated after fine-tuning on three tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. Results show that both larger sample counts and more classes help transfer performance, but matching the pre-training content to the target task has a stronger positive effect because it leads to learning similar features.

Core claim

The central claim is that while both the scale of the pre-training data (more samples) and the breadth of its class ontology (more classes) positively influence transfer learning to new audio tasks, the similarity of the pre-training ontology to the downstream task generally provides greater benefits by allowing the model to acquire comparable features.

What carries the argument

Ontology-based subsets of AudioSet used to create pre-training datasets with controlled variations in class count and sample size.

If this is right

More samples in pre-training lead to improved accuracy after fine-tuning on the three audio tasks.
Pre-training with more classes also enhances transfer learning performance.
High similarity between pre-training and target task enables learning of more relevant features than scale alone.
Task mismatch reduces the effectiveness of large-scale pre-training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of audio models may benefit from curating smaller but highly relevant pre-training sets instead of relying on the largest available datasets.
Similar effects could be tested in other sensory domains such as image or video transfer learning.
Future studies might explore combinations of scale and similarity to find optimal pre-training strategies.
This finding highlights the value of domain-specific data selection for efficient transfer learning.

Load-bearing premise

The different ontology subsets of AudioSet are comparable in terms of acoustic quality, label accuracy, and recording conditions, so that observed differences in transfer stem from class count, sample size, and similarity.

What would settle it

Finding that a large dissimilar pre-training set yields better transfer performance than a smaller similar one on the downstream tasks would contradict the main result.

Figures

Figures reproduced from arXiv: 2603.25476 by Alexander Gebhard, Andreas Triantafyllopoulos, Bj\"orn W. Schuller, Manuel Milling, Simon Rampp.

**Figure 2.** Figure 2: Performance of fine-tuning experiments on the three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of fine-tuning experiments on the three [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Pair-wise cosine distance between the first convolu [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an experimental study on audio transfer learning, pre-training models on ontology-derived subsets of AudioSet varying in class count and sample size, then fine-tuning on three downstream tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. The central claim is that both larger sample counts and class counts improve transfer performance, but this effect is generally outweighed by the similarity between the pre-training data domain and the downstream task.

Significance. If the experimental controls and attribution hold, the findings would provide useful empirical guidance on data selection for audio pre-training, emphasizing domain similarity over raw scale. The ontology-based subset approach offers a systematic way to vary class count and sample size, which is a methodological strength for isolating factors in transfer learning.

major comments (2)

[Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.
[Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly define or quantify 'similarity' between pre-training and downstream tasks (e.g., via feature overlap metrics or branch overlap in the ontology).
[Figures/Tables] Figure captions and table legends should include details on the number of runs or seeds used to generate the reported metrics for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.

Authors: We agree this is a valid concern. While the AudioSet ontology enables systematic class selection, we did not quantify SNR or label noise differences across branches in the original submission. In the revised manuscript we will add empirical measurements of average SNR (using available metadata) and estimated label noise rates per ontology branch, along with a discussion of how these may interact with domain similarity. We maintain that the shared source dataset reduces some variability compared to cross-dataset comparisons, but we will explicitly acknowledge the limitation. revision: yes
Referee: [Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.

Authors: We accept the need for greater statistical rigor. The revised version will include bootstrap-derived 95% confidence intervals and paired significance tests for key performance differences. All experiments used the identical model architecture and training protocol, with fixed data splits and equal compute budgets per run; we will make these controls more prominent in the text. Full isolation of every possible confounder remains difficult at this scale, but the reported design already holds model capacity and compute constant. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on AudioSet subsets with no derivations or self-referential equations

full rationale

The paper is a purely experimental study that pre-trains models on ontology-derived AudioSet subsets of varying class count and sample size, then measures transfer performance on three fixed downstream tasks. No equations, fitted parameters, or theoretical derivations are present that could reduce to the inputs by construction. Results are reported as observed accuracy deltas rather than predictions forced by any model or ansatz. Self-citations, if present, are not load-bearing for any central claim and do not substitute for external verification. The work is self-contained against external benchmarks (standard AudioSet splits and downstream datasets) with no uniqueness theorems or renamings of known results invoked to force conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard supervised transfer learning assumptions and the representativeness of AudioSet subsets; no new entities or free parameters are introduced beyond typical hyperparameter choices in neural network training.

axioms (1)

domain assumption Ontology-based subsets of AudioSet vary class count and sample size independently of other acoustic properties.
Invoked when constructing pre-training data to isolate the effects of scale and ontology.

pith-pipeline@v0.9.0 · 5693 in / 1219 out tokens · 39367 ms · 2026-05-21T10:51:14.820959+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ontology-based subsets of AudioSet

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

A comprehensive survey on transfer learning,

F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,”Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020

work page 2020
[2]

Generalizing from a few examples: A survey on few-shot learning,

Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

work page 2020
[3]

A review of generalized zero-shot learning methods,

F. Pourpanah, M. Abdar, Y . Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4051–4070, 2022

work page 2022
[4]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023

work page 2023
[5]

Lora: Low-rank adaptation of large language models.,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[6]

On the Opportunities and Risks of Foundation Models

R. Bommasani, “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

What makes ImageNet good for transfer learning?

M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?”arXiv preprint arXiv:1608.08614, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255

work page 2009
[9]

How transferable are features in deep neural networks?

J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, 2014

work page 2014
[10]

Computer audition: From task-specific machine learning to foundation models,

A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Vir- tanen, and B. W. Schuller, “Computer audition: From task-specific machine learning to foundation models,”Proceedings of the IEEE, 2025

work page 2025
[11]

Audio for audio is better? an investigation on transfer learning models for heart sound classification,

T. Koike, K. Qian, Q. Kong, M. D. Plumbley, B. W. Schuller, and Y . Yamamoto, “Audio for audio is better? an investigation on transfer learning models for heart sound classification,” in2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2020, pp. 74–77

work page 2020
[12]

The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,

A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7268–7272

work page 2021
[13]

Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,

M. Milling, A. Triantafyllopoulos, I. Tsangko, S. D. N. Rampp, and B. W. Schuller, “Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 391–395

work page 2024
[14]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Mar. 2017

work page 2017
[15]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020
[16]

Deep scalogram representations for acoustic scene classification,

Z. Ren, K. Qian, Z. Zhang, V . Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,”IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018

work page 2018
[17]

Snore Sound Classification Using Image-Based Deep Spectrum Features,

S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, “Snore Sound Classification Using Image-Based Deep Spectrum Features,” inProc. Interspeech 2017, 2017, pp. 3512–3516

work page 2017
[18]

Deep image features in music infor- mation retrieval,

G. Gwardys and D. Grzywczak, “Deep image features in music infor- mation retrieval,”International Journal of Electronics and Telecom- munications, vol. 60, pp. 321–326, 2014

work page 2014
[19]

Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,

Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021

work page 2021
[20]

Esresnet: Environmental sound classification based on visual domain models,

A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” in2020 25th international conference on pattern recognition (ICPR), IEEE, 2021, pp. 4933–4940

work page 2021
[21]

Audio- based step-count estimation for running-windowing and neural network baselines,

P. Wagner, A. Triantafyllopoulos, A. Gebhard, and B. Schuller, “Audio- based step-count estimation for running-windowing and neural network baselines,” in2024 32nd European Signal Processing Conference (EUSIPCO), IEEE, 2024, pp. 331–335

work page 2024
[22]

Heittola, A

T. Heittola, A. Mesaros, and T. Virtanen,Acoustic scene classification in dcase 2020 challenge: Generalization across devices and low complexity solutions, 2020

work page 2020
[23]

Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,

D. Stowell, Y . Stylianou, M. Wood, H. Pamuła, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,”Methods in Ecology and Evolution, 2018

work page 2018
[24]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Towards inadequately pre-trained models in transfer learning,

A. Deng, X. Li, D. Hu, T. Wang, H. Xiong, and C.-Z. Xu, “Towards inadequately pre-trained models in transfer learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 397–19 408

work page 2023
[27]

Overtrained language models are harder to fine-tune,

J. M. Springer, S. Goyal, K. Wen, T. Kumar, X. Yue, S. Malladi, G. Neubig, and A. Raghunathan, “Overtrained language models are harder to fine-tune,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[28]

Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,

S. Rampp, A. Triantafyllopoulos, M. Milling, and B. W. Schuller, “Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,”arXiv preprint arXiv:2412.11943, 2024

work page arXiv 2024
[29]

What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,

H. Meng, V . Sethu, and E. Ambikairajah, “What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,” inProc. INTERSPEECH 2023, 2023, pp. 2898–2902

work page 2023
[30]

LEAF: A learnable frontend for audio classification,

N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “LEAF: A learnable frontend for audio classification,” inInternational Conference on Learning Representations, 2021

work page 2021
[31]

AST: Audio Spectrogram Transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” inProc. Interspeech 2021, 2021, pp. 571–575

work page 2021
[32]

Visualizing and understanding convo- lutional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- lutional networks,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 818–833

work page 2014

[1] [1]

A comprehensive survey on transfer learning,

F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,”Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020

work page 2020

[2] [2]

Generalizing from a few examples: A survey on few-shot learning,

Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

work page 2020

[3] [3]

A review of generalized zero-shot learning methods,

F. Pourpanah, M. Abdar, Y . Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4051–4070, 2022

work page 2022

[4] [4]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023

work page 2023

[5] [5]

Lora: Low-rank adaptation of large language models.,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022

[6] [6]

On the Opportunities and Risks of Foundation Models

R. Bommasani, “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

What makes ImageNet good for transfer learning?

M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?”arXiv preprint arXiv:1608.08614, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255

work page 2009

[9] [9]

How transferable are features in deep neural networks?

J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, 2014

work page 2014

[10] [10]

Computer audition: From task-specific machine learning to foundation models,

A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Vir- tanen, and B. W. Schuller, “Computer audition: From task-specific machine learning to foundation models,”Proceedings of the IEEE, 2025

work page 2025

[11] [11]

Audio for audio is better? an investigation on transfer learning models for heart sound classification,

T. Koike, K. Qian, Q. Kong, M. D. Plumbley, B. W. Schuller, and Y . Yamamoto, “Audio for audio is better? an investigation on transfer learning models for heart sound classification,” in2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2020, pp. 74–77

work page 2020

[12] [12]

The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,

A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7268–7272

work page 2021

[13] [13]

Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,

M. Milling, A. Triantafyllopoulos, I. Tsangko, S. D. N. Rampp, and B. W. Schuller, “Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 391–395

work page 2024

[14] [14]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Mar. 2017

work page 2017

[15] [15]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020

[16] [16]

Deep scalogram representations for acoustic scene classification,

Z. Ren, K. Qian, Z. Zhang, V . Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,”IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018

work page 2018

[17] [17]

Snore Sound Classification Using Image-Based Deep Spectrum Features,

S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, “Snore Sound Classification Using Image-Based Deep Spectrum Features,” inProc. Interspeech 2017, 2017, pp. 3512–3516

work page 2017

[18] [18]

Deep image features in music infor- mation retrieval,

G. Gwardys and D. Grzywczak, “Deep image features in music infor- mation retrieval,”International Journal of Electronics and Telecom- munications, vol. 60, pp. 321–326, 2014

work page 2014

[19] [19]

Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,

Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021

work page 2021

[20] [20]

Esresnet: Environmental sound classification based on visual domain models,

A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” in2020 25th international conference on pattern recognition (ICPR), IEEE, 2021, pp. 4933–4940

work page 2021

[21] [21]

Audio- based step-count estimation for running-windowing and neural network baselines,

P. Wagner, A. Triantafyllopoulos, A. Gebhard, and B. Schuller, “Audio- based step-count estimation for running-windowing and neural network baselines,” in2024 32nd European Signal Processing Conference (EUSIPCO), IEEE, 2024, pp. 331–335

work page 2024

[22] [22]

Heittola, A

T. Heittola, A. Mesaros, and T. Virtanen,Acoustic scene classification in dcase 2020 challenge: Generalization across devices and low complexity solutions, 2020

work page 2020

[23] [23]

Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,

D. Stowell, Y . Stylianou, M. Wood, H. Pamuła, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,”Methods in Ecology and Evolution, 2018

work page 2018

[24] [24]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[26] [26]

Towards inadequately pre-trained models in transfer learning,

A. Deng, X. Li, D. Hu, T. Wang, H. Xiong, and C.-Z. Xu, “Towards inadequately pre-trained models in transfer learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 397–19 408

work page 2023

[27] [27]

Overtrained language models are harder to fine-tune,

J. M. Springer, S. Goyal, K. Wen, T. Kumar, X. Yue, S. Malladi, G. Neubig, and A. Raghunathan, “Overtrained language models are harder to fine-tune,” inForty-second International Conference on Machine Learning, 2025

work page 2025

[28] [28]

Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,

S. Rampp, A. Triantafyllopoulos, M. Milling, and B. W. Schuller, “Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,”arXiv preprint arXiv:2412.11943, 2024

work page arXiv 2024

[29] [29]

What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,

H. Meng, V . Sethu, and E. Ambikairajah, “What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,” inProc. INTERSPEECH 2023, 2023, pp. 2898–2902

work page 2023

[30] [30]

LEAF: A learnable frontend for audio classification,

N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “LEAF: A learnable frontend for audio classification,” inInternational Conference on Learning Representations, 2021

work page 2021

[31] [31]

AST: Audio Spectrogram Transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” inProc. Interspeech 2021, 2021, pp. 571–575

work page 2021

[32] [32]

Visualizing and understanding convo- lutional networks,

M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- lutional networks,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 818–833

work page 2014