pith. sign in

arxiv: 2603.25476 · v3 · pith:UHVLSUVJnew · submitted 2026-03-26 · 💻 cs.LG

How Class Ontology and Data Scale Affect Audio Transfer Learning

Pith reviewed 2026-05-21 10:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords transfer learningaudio classificationAudioSetdata scaleclass ontologytask similarityfine-tuningacoustic features
0
0 comments X

The pith

Similarity between pre-training audio data and the downstream task improves transfer learning more than increasing the number of samples or classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the impact of class ontology and data scale on audio transfer learning by pre-training models on different subsets of AudioSet defined by their ontologies. These subsets allow variation in the number of classes and total samples while keeping other factors as controlled as possible. The pre-trained models are evaluated after fine-tuning on three tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. Results show that both larger sample counts and more classes help transfer performance, but matching the pre-training content to the target task has a stronger positive effect because it leads to learning similar features.

Core claim

The central claim is that while both the scale of the pre-training data (more samples) and the breadth of its class ontology (more classes) positively influence transfer learning to new audio tasks, the similarity of the pre-training ontology to the downstream task generally provides greater benefits by allowing the model to acquire comparable features.

What carries the argument

Ontology-based subsets of AudioSet used to create pre-training datasets with controlled variations in class count and sample size.

If this is right

  • More samples in pre-training lead to improved accuracy after fine-tuning on the three audio tasks.
  • Pre-training with more classes also enhances transfer learning performance.
  • High similarity between pre-training and target task enables learning of more relevant features than scale alone.
  • Task mismatch reduces the effectiveness of large-scale pre-training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of audio models may benefit from curating smaller but highly relevant pre-training sets instead of relying on the largest available datasets.
  • Similar effects could be tested in other sensory domains such as image or video transfer learning.
  • Future studies might explore combinations of scale and similarity to find optimal pre-training strategies.
  • This finding highlights the value of domain-specific data selection for efficient transfer learning.

Load-bearing premise

The different ontology subsets of AudioSet are comparable in terms of acoustic quality, label accuracy, and recording conditions, so that observed differences in transfer stem from class count, sample size, and similarity.

What would settle it

Finding that a large dissimilar pre-training set yields better transfer performance than a smaller similar one on the downstream tasks would contradict the main result.

Figures

Figures reproduced from arXiv: 2603.25476 by Alexander Gebhard, Andreas Triantafyllopoulos, Bj\"orn W. Schuller, Manuel Milling, Simon Rampp.

Figure 1
Figure 1. Figure 1: Excerpt of the AudioSet ontology, including specific [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of fine-tuning experiments on the three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of fine-tuning experiments on the three [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pair-wise cosine distance between the first convolu [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an experimental study on audio transfer learning, pre-training models on ontology-derived subsets of AudioSet varying in class count and sample size, then fine-tuning on three downstream tasks: acoustic scene recognition, bird activity recognition, and speech command recognition. The central claim is that both larger sample counts and class counts improve transfer performance, but this effect is generally outweighed by the similarity between the pre-training data domain and the downstream task.

Significance. If the experimental controls and attribution hold, the findings would provide useful empirical guidance on data selection for audio pre-training, emphasizing domain similarity over raw scale. The ontology-based subset approach offers a systematic way to vary class count and sample size, which is a methodological strength for isolating factors in transfer learning.

major comments (2)
  1. [Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.
  2. [Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly define or quantify 'similarity' between pre-training and downstream tasks (e.g., via feature overlap metrics or branch overlap in the ontology).
  2. [Figures/Tables] Figure captions and table legends should include details on the number of runs or seeds used to generate the reported metrics for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Methodology, subset construction): The ontology-based subsets of AudioSet are presented as cleanly varying class count and sample size, but the paper does not report controls or measurements for systematic differences in acoustic quality, SNR, label noise, or recording conditions across ontology branches (e.g., speech vs. environmental). These factors can alter learned features independently of scale, directly threatening the claim that performance gains are attributable to class count or sample size rather than incidental domain alignment with the three downstream tasks.

    Authors: We agree this is a valid concern. While the AudioSet ontology enables systematic class selection, we did not quantify SNR or label noise differences across branches in the original submission. In the revised manuscript we will add empirical measurements of average SNR (using available metadata) and estimated label noise rates per ontology branch, along with a discussion of how these may interact with domain similarity. We maintain that the shared source dataset reduces some variability compared to cross-dataset comparisons, but we will explicitly acknowledge the limitation. revision: yes

  2. Referee: [Section 4] Section 4 (Results and analysis): The directional positive effects and comparative ranking of factors are reported without statistical significance tests, confidence intervals, or explicit controls for confounding variables such as total compute, model capacity, or data splits. This leaves the central attribution of effects to scale versus similarity on unverified experimental details.

    Authors: We accept the need for greater statistical rigor. The revised version will include bootstrap-derived 95% confidence intervals and paired significance tests for key performance differences. All experiments used the identical model architecture and training protocol, with fixed data splits and equal compute budgets per run; we will make these controls more prominent in the text. Full isolation of every possible confounder remains difficult at this scale, but the reported design already holds model capacity and compute constant. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on AudioSet subsets with no derivations or self-referential equations

full rationale

The paper is a purely experimental study that pre-trains models on ontology-derived AudioSet subsets of varying class count and sample size, then measures transfer performance on three fixed downstream tasks. No equations, fitted parameters, or theoretical derivations are present that could reduce to the inputs by construction. Results are reported as observed accuracy deltas rather than predictions forced by any model or ansatz. Self-citations, if present, are not load-bearing for any central claim and do not substitute for external verification. The work is self-contained against external benchmarks (standard AudioSet splits and downstream datasets) with no uniqueness theorems or renamings of known results invoked to force conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard supervised transfer learning assumptions and the representativeness of AudioSet subsets; no new entities or free parameters are introduced beyond typical hyperparameter choices in neural network training.

axioms (1)
  • domain assumption Ontology-based subsets of AudioSet vary class count and sample size independently of other acoustic properties.
    Invoked when constructing pre-training data to isolate the effects of scale and ontology.

pith-pipeline@v0.9.0 · 5693 in / 1219 out tokens · 39367 ms · 2026-05-21T10:51:14.820959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    A comprehensive survey on transfer learning,

    F. Zhuang, Z. Qi, K. Duan, D. Xi, Y . Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,”Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020

  2. [2]

    Generalizing from a few examples: A survey on few-shot learning,

    Y . Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,”ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020

  3. [3]

    A review of generalized zero-shot learning methods,

    F. Pourpanah, M. Abdar, Y . Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4051–4070, 2022

  4. [4]

    Parameter-efficient fine-tuning of large-scale pre-trained language models,

    N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al., “Parameter-efficient fine-tuning of large-scale pre-trained language models,”Nature machine intelligence, vol. 5, no. 3, pp. 220–235, 2023

  5. [5]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

  6. [6]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021

  7. [7]

    What makes ImageNet good for transfer learning?

    M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?”arXiv preprint arXiv:1608.08614, 2016

  8. [8]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255

  9. [9]

    How transferable are features in deep neural networks?

    J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, 2014

  10. [10]

    Computer audition: From task-specific machine learning to foundation models,

    A. Triantafyllopoulos, I. Tsangko, A. Gebhard, A. Mesaros, T. Vir- tanen, and B. W. Schuller, “Computer audition: From task-specific machine learning to foundation models,”Proceedings of the IEEE, 2025

  11. [11]

    Audio for audio is better? an investigation on transfer learning models for heart sound classification,

    T. Koike, K. Qian, Q. Kong, M. D. Plumbley, B. W. Schuller, and Y . Yamamoto, “Audio for audio is better? an investigation on transfer learning models for heart sound classification,” in2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2020, pp. 74–77

  12. [12]

    The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,

    A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7268–7272

  13. [13]

    Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,

    M. Milling, A. Triantafyllopoulos, I. Tsangko, S. D. N. Rampp, and B. W. Schuller, “Bringing the discussion of minima sharpness to the audio domain: A filter-normalised evaluation for acoustic scene classification,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 391–395

  14. [14]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Mar. 2017

  15. [15]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

    Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumb- ley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

  16. [16]

    Deep scalogram representations for acoustic scene classification,

    Z. Ren, K. Qian, Z. Zhang, V . Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,”IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018

  17. [17]

    Snore Sound Classification Using Image-Based Deep Spectrum Features,

    S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, “Snore Sound Classification Using Image-Based Deep Spectrum Features,” inProc. Interspeech 2017, 2017, pp. 3512–3516

  18. [18]

    Deep image features in music infor- mation retrieval,

    G. Gwardys and D. Grzywczak, “Deep image features in music infor- mation retrieval,”International Journal of Electronics and Telecom- munications, vol. 60, pp. 321–326, 2014

  19. [19]

    Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,

    Y . Gong, Y .-A. Chung, and J. Glass, “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021

  20. [20]

    Esresnet: Environmental sound classification based on visual domain models,

    A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” in2020 25th international conference on pattern recognition (ICPR), IEEE, 2021, pp. 4933–4940

  21. [21]

    Audio- based step-count estimation for running-windowing and neural network baselines,

    P. Wagner, A. Triantafyllopoulos, A. Gebhard, and B. Schuller, “Audio- based step-count estimation for running-windowing and neural network baselines,” in2024 32nd European Signal Processing Conference (EUSIPCO), IEEE, 2024, pp. 331–335

  22. [22]

    Heittola, A

    T. Heittola, A. Mesaros, and T. Virtanen,Acoustic scene classification in dcase 2020 challenge: Generalization across devices and low complexity solutions, 2020

  23. [23]

    Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,

    D. Stowell, Y . Stylianou, M. Wood, H. Pamuła, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,”Methods in Ecology and Evolution, 2018

  24. [24]

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

    P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,”arXiv preprint arXiv:1804.03209, 2018

  25. [25]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  26. [26]

    Towards inadequately pre-trained models in transfer learning,

    A. Deng, X. Li, D. Hu, T. Wang, H. Xiong, and C.-Z. Xu, “Towards inadequately pre-trained models in transfer learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 397–19 408

  27. [27]

    Overtrained language models are harder to fine-tune,

    J. M. Springer, S. Goyal, K. Wen, T. Kumar, X. Yue, S. Malladi, G. Neubig, and A. Raghunathan, “Overtrained language models are harder to fine-tune,” inForty-second International Conference on Machine Learning, 2025

  28. [28]

    Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,

    S. Rampp, A. Triantafyllopoulos, M. Milling, and B. W. Schuller, “Au- trainer: A modular and extensible deep learning toolkit for computer audition tasks,”arXiv preprint arXiv:2412.11943, 2024

  29. [29]

    What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,

    H. Meng, V . Sethu, and E. Ambikairajah, “What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normal- isation (PCEN) to Noisy Conditions,” inProc. INTERSPEECH 2023, 2023, pp. 2898–2902

  30. [30]

    LEAF: A learnable frontend for audio classification,

    N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “LEAF: A learnable frontend for audio classification,” inInternational Conference on Learning Representations, 2021

  31. [31]

    AST: Audio Spectrogram Transformer,

    Y . Gong, Y .-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” inProc. Interspeech 2021, 2021, pp. 571–575

  32. [32]

    Visualizing and understanding convo- lutional networks,

    M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- lutional networks,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, Springer, 2014, pp. 818–833