How Class Ontology and Data Scale Affect Audio Transfer Learning
Pith reviewed 2026-05-21 10:51 UTC · model grok-4.3
The pith
Similarity between pre-training audio data and the downstream task improves transfer learning more than increasing the number of samples or classes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that while both the scale of the pre-training data (more samples) and the breadth of its class ontology (more classes) positively influence transfer learning to new audio tasks, the similarity of the pre-training ontology to the downstream task generally provides greater benefits by allowing the model to acquire comparable features.
What carries the argument
Ontology-based subsets of AudioSet used to create pre-training datasets with controlled variations in class count and sample size.
If this is right
- More samples in pre-training lead to improved accuracy after fine-tuning on the three audio tasks.
- Pre-training with more classes also enhances transfer learning performance.
- High similarity between pre-training and target task enables learning of more relevant features than scale alone.
- Task mismatch reduces the effectiveness of large-scale pre-training data.
Where Pith is reading between the lines
- Designers of audio models may benefit from curating smaller but highly relevant pre-training sets instead of relying on the largest available datasets.
- Similar effects could be tested in other sensory domains such as image or video transfer learning.
- Future studies might explore combinations of scale and similarity to find optimal pre-training strategies.
- This finding highlights the value of domain-specific data selection for efficient transfer learning.
Load-bearing premise
The different ontology subsets of AudioSet are comparable in terms of acoustic quality, label accuracy, and recording conditions, so that observed differences in transfer stem from class count, sample size, and similarity.
What would settle it
Finding that a large dissimilar pre-training set yields better transfer performance than a smaller similar one on the downstream tasks would contradict the main result.
Figures
read the original abstract
Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an experimental study on audio transfer learning, pre-training models on ontology-derived subsets of AudioSet varying in class count and sample size, then fine-tuning on three downstream tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. The central claim is that both larger sample counts and class counts improve transfer performance, but this effect is generally outweighed by the similarity between the pre-training data domain and the downstream task.
Significance. If the experimental controls and attribution hold, the findings would provide useful empirical guidance on data selection for audio pre-training, emphasizing domain similarity over raw scale. The ontology-based subset approach offers a systematic way to vary class count and sample size, which is a methodological strength for isolating factors in transfer learning.
major comments (2)
- [Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.
- [Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly define or quantify 'similarity' between pre-training and downstream tasks (e.g., via feature overlap metrics or branch overlap in the ontology).
- [Figures/Tables] Figure captions and table legends should include details on the number of runs or seeds used to generate the reported metrics for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.
Authors: We agree this is a valid concern. While the AudioSet ontology enables systematic class selection, we did not quantify SNR or label noise differences across branches in the original submission. In the revised manuscript we will add empirical measurements of average SNR (using available metadata) and estimated label noise rates per ontology branch, along with a discussion of how these may interact with domain similarity. We maintain that the shared source dataset reduces some variability compared to cross-dataset comparisons, but we will explicitly acknowledge the limitation. revision: yes
-
Referee: [Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.
Authors: We accept the need for greater statistical rigor. The revised version will include bootstrap-derived 95% confidence intervals and paired significance tests for key performance differences. All experiments used the identical model architecture and training protocol, with fixed data splits and equal compute budgets per run; we will make these controls more prominent in the text. Full isolation of every possible confounder remains difficult at this scale, but the reported design already holds model capacity and compute constant. revision: partial
Circularity Check
No circularity: empirical measurements on AudioSet subsets with no derivations or self-referential equations
full rationale
The paper is a purely experimental study that pre-trains models on ontology-derived AudioSet subsets of varying class count and sample size, then measures transfer performance on three fixed downstream tasks. No equations, fitted parameters, or theoretical derivations are present that could reduce to the inputs by construction. Results are reported as observed accuracy deltas rather than predictions forced by any model or ansatz. Self-citations, if present, are not load-bearing for any central claim and do not substitute for external verification. The work is self-contained against external benchmarks (standard AudioSet splits and downstream datasets) with no uniqueness theorems or renamings of known results invoked to force conclusions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ontology-based subsets of AudioSet vary class count and sample size independently of other acoustic properties.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ontology-based subsets of AudioSet
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A comprehensive survey on transfer learning,
F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,”Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020
work page 2020
-
[2]
Generalizing from a few examples: A survey on few-shot learning,
Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020
work page 2020
-
[3]
A review of generalized zero-shot learning methods,
F. Pourpanah, M. Abdar, Y . Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4051–4070, 2022
work page 2022
-
[4]
Parameter-efficient fine-tuning of large-scale pre-trained language models,
N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023
work page 2023
-
[5]
Lora: Low-rank adaptation of large language models.,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[6]
On the Opportunities and Risks of Foundation Models
R. Bommasani, “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
What makes ImageNet good for transfer learning?
M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?”arXiv preprint arXiv:1608.08614, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255
work page 2009
-
[9]
How transferable are features in deep neural networks?
J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, 2014
work page 2014
-
[10]
Computer audition: From task-specific machine learning to foundation models,
A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Vir- tanen, and B. W. Schuller, “Computer audition: From task-specific machine learning to foundation models,”Proceedings of the IEEE, 2025
work page 2025
-
[11]
T. Koike, K. Qian, Q. Kong, M. D. Plumbley, B. W. Schuller, and Y . Yamamoto, “Audio for audio is better? an investigation on transfer learning models for heart sound classification,” in2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2020, pp. 74–77
work page 2020
-
[12]
A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7268–7272
work page 2021
-
[13]
M. Milling, A. Triantafyllopoulos, I. Tsangko, S. D. N. Rampp, and B. W. Schuller, “Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 391–395
work page 2024
-
[14]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Mar. 2017
work page 2017
-
[15]
Panns: Large-scale pretrained audio neural networks for audio pattern recognition,
Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020
work page 2020
-
[16]
Deep scalogram representations for acoustic scene classification,
Z. Ren, K. Qian, Z. Zhang, V . Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,”IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018
work page 2018
-
[17]
Snore Sound Classification Using Image-Based Deep Spectrum Features,
S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, “Snore Sound Classification Using Image-Based Deep Spectrum Features,” inProc. Interspeech 2017, 2017, pp. 3512–3516
work page 2017
-
[18]
Deep image features in music infor- mation retrieval,
G. Gwardys and D. Grzywczak, “Deep image features in music infor- mation retrieval,”International Journal of Electronics and Telecom- munications, vol. 60, pp. 321–326, 2014
work page 2014
-
[19]
Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,
Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021
work page 2021
-
[20]
Esresnet: Environmental sound classification based on visual domain models,
A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” in2020 25th international conference on pattern recognition (ICPR), IEEE, 2021, pp. 4933–4940
work page 2021
-
[21]
Audio- based step-count estimation for running-windowing and neural network baselines,
P. Wagner, A. Triantafyllopoulos, A. Gebhard, and B. Schuller, “Audio- based step-count estimation for running-windowing and neural network baselines,” in2024 32nd European Signal Processing Conference (EUSIPCO), IEEE, 2024, pp. 331–335
work page 2024
-
[22]
T. Heittola, A. Mesaros, and T. Virtanen,Acoustic scene classification in dcase 2020 challenge: Generalization across devices and low complexity solutions, 2020
work page 2020
-
[23]
D. Stowell, Y . Stylianou, M. Wood, H. Pamuła, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,”Methods in Ecology and Evolution, 2018
work page 2018
-
[24]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[26]
Towards inadequately pre-trained models in transfer learning,
A. Deng, X. Li, D. Hu, T. Wang, H. Xiong, and C.-Z. Xu, “Towards inadequately pre-trained models in transfer learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 397–19 408
work page 2023
-
[27]
Overtrained language models are harder to fine-tune,
J. M. Springer, S. Goyal, K. Wen, T. Kumar, X. Yue, S. Malladi, G. Neubig, and A. Raghunathan, “Overtrained language models are harder to fine-tune,” inForty-second International Conference on Machine Learning, 2025
work page 2025
-
[28]
Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,
S. Rampp, A. Triantafyllopoulos, M. Milling, and B. W. Schuller, “Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,”arXiv preprint arXiv:2412.11943, 2024
-
[29]
H. Meng, V . Sethu, and E. Ambikairajah, “What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,” inProc. INTERSPEECH 2023, 2023, pp. 2898–2902
work page 2023
-
[30]
LEAF: A learnable frontend for audio classification,
N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “LEAF: A learnable frontend for audio classification,” inInternational Conference on Learning Representations, 2021
work page 2021
-
[31]
AST: Audio Spectrogram Transformer,
Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” inProc. Interspeech 2021, 2021, pp. 571–575
work page 2021
-
[32]
Visualizing and understanding convo- lutional networks,
M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- lutional networks,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 818–833
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.