Data Augmentation for Instrument Classification Robust to Audio Effects
Pith reviewed 2026-05-24 18:54 UTC · model grok-4.3
The pith
Training instrument classifiers with audio-effect augmentation improves accuracy on processed one-shot sounds used in electronic music production.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A model trained on a large set of clean one-shot instrumental sounds loses accuracy when the test sounds receive audio effects typical of electronic music production; retraining the same model with those effects included as data augmentation restores most of the lost accuracy, and the paper reports the per-effect contribution to the recovery.
What carries the argument
Data augmentation that applies audio effects (reverb, delay, distortion, compression, EQ, etc.) to the training examples while keeping the original instrument label.
If this is right
- Classifiers trained this way can label large sample-pack libraries without manual correction after common production processing.
- The per-effect accuracy tables identify which processing steps (for example heavy distortion) still require additional techniques.
- The same augmentation pipeline can be reused for other audio classification tasks that encounter production effects.
Where Pith is reading between the lines
- Extending the approach to multi-effect chains or to full music loops would test whether the robustness generalises beyond isolated one-shots.
- If the method works, automatic tagging services could offer users the option to train custom models on their own effect chains.
Load-bearing premise
The chosen audio effects and their parameter ranges are representative of real production processing and never change an instrument's identity enough to require a new label.
What would settle it
Measure classification accuracy on a held-out set of one-shot sounds that have been processed with the same effects but at parameter values never seen during augmentation; if accuracy remains high only when augmentation was used in training, the claim holds.
Figures
read the original abstract
Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically categorising the sounds contained in these collections, allowing easier navigation and better characterisation. Automatic instrument classification has mostly targeted the classification of unprocessed isolated instrumental sounds or detecting predominant instruments in mixed music tracks. For this classification to be useful in audio databases for EMP, it has to be robust to the audio effects applied to unprocessed sounds. In this paper we evaluate how a state of the art model trained with a large dataset of one-shot instrumental sounds performs when classifying instruments processed with audio effects. In order to evaluate the robustness of the model, we use data augmentation with audio effects and evaluate how each effect influences the classification accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the robustness of a state-of-the-art instrument classification model to audio effects commonly used in electronic music production. It trains the model on a large dataset of one-shot instrumental sounds and uses data augmentation with audio effects to assess how each effect influences classification accuracy on processed sounds.
Significance. If the empirical results demonstrate that augmentation with representative effects measurably improves accuracy on processed sounds while preserving performance on clean inputs, the work would offer a practical technique for deploying classifiers on real sample packs. The focus on EMP workflows addresses a gap between standard instrument classification benchmarks and production use cases.
minor comments (2)
- [Abstract] Abstract: the evaluation plan is described but no quantitative results, dataset sizes, model details, or statistical tests are provided, making it difficult to assess the strength of the robustness claims without the full experimental section.
- The weakest assumption (effects preserve instrument identity) is stated but receives no explicit validation or discussion of edge cases where heavy processing might alter perceived timbre enough to warrant relabeling.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were listed in the provided report.
Circularity Check
No significant circularity
full rationale
The paper describes an empirical robustness evaluation: a state-of-the-art model is trained on one-shot instrumental sounds and tested after applying audio effects via data augmentation, with accuracy measured per effect. No equations, parameter fits, uniqueness theorems, or self-citation chains are invoked to derive or predict any quantity; the reported accuracies are direct experimental outcomes rather than quantities forced by construction from the training procedure itself. The evaluation design is therefore self-contained against external benchmarks and contains no load-bearing steps that reduce to the inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Audio effects applied to one-shot sounds preserve the instrument label for classification purposes.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The repurposing of audio material, also known as sampling, has been a key component in Electronic Music Production (EMP) since its early days and became a practice which had a major in- fluence in a large variety of musical genres. The availability of software such as Digital Audio Workstations, together with the au- dio sharing possibilities ...
work page 2019
-
[2]
RELA TED WORK Automatic instrument classification can be split into two related tasks with a similar goal. The first is the identification of in- struments in single instrument recordings (which can be isolated or overlapping notes) while the second is the recognition of the predominant instrument in a mixture of sounds. A thorough de- scription of this task...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
METHODOLOGY In our study we will conduct two experiments. First, we will try to understand how augmenting a dataset with specific effects can improve instrument classification and secondly, we will see if this augmentation can improve the robustness of a model to the se- lected effect. To investigate this, we process the training, validation and test sets o...
work page 2019
-
[4]
RESULTS Two experiments were conducted in our study. We firstly evalu- ated how augmenting the training set of NSynth [14] by applying audio effects to the sounds can improve the automatic classifica- tion on the instruments of the unmodified test set. In the second experiment we evaluated how robust a state of the art model for instrument classification is w...
work page 2019
-
[5]
CONCLUSIONS In this paper we evaluated how a state of the art algorithm for automatic instrument classification performs when classifying the NSynth dataset and how augmenting this dataset with audio ef- fects commonly used in electronic music production influences its accuracy on both the original and processed versions of the audio. We identify that the a...
-
[6]
We thank Matthew Davies for reviewing a draft of this paper and providing helpful feedback
ACKNOWLEDGMENTS This project has received funding from the European Union’s Hori- zon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement N o 765068, MIP-Frontiers. We thank Matthew Davies for reviewing a draft of this paper and providing helpful feedback
work page 2020
-
[7]
Frederic Font, Gerard Roma, and Xavier Serra, “Freesound technical demo,” in ACM International Conference on Mul- timedia (MM’13), Barcelona, Spain, 2013, ACM, pp. 411– 412, ACM
work page 2013
-
[8]
Automatic classification of musical instrument sounds,
Perfecto Herrera-Boyer, Geoffroy Peeters, and Shlomo Dubnov, “Automatic classification of musical instrument sounds,” Journal of New Music Research, vol. 32, pp. 3–21, 2003
work page 2003
-
[9]
RWC music database: Popular, classical, and jazz music databases,
Masataka Goto, “RWC music database: Popular, classical, and jazz music databases,” in 3rd International Society for Music Information Retrieval Conference (ISMIR), 2002, pp. 287–288
work page 2002
-
[10]
Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera, “A comparison of sound segregation techniques for predominant instrument recognition in musical audio sig- nals,” in 13th International Society for Music Information Retrieval Conference (ISMIR), 2012, pp. 559–564
work page 2012
-
[11]
A real-time system for measuring sound goodness in instrumental sounds,
Oriol Romani Picas, Hector Parra Rodriguez, Dara Dabiri, Hiroshi Tokuda, Wataru Hariya, Koji Oishi, and Xavier Serra, “A real-time system for measuring sound goodness in instrumental sounds,” in Audio Engineering Society Con- vention 138, Warsaw, Poland, 2015, p. 9350
work page 2015
-
[12]
Musical Instrument Recog- nition in Multi-Instrument Audio Contexts,
Venkatesh Shenoy Kadandale, “Musical Instrument Recog- nition in Multi-Instrument Audio Contexts,” MSc thesis, Universitat Pompeu Fabra, Oct. 2018
work page 2018
-
[13]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436, 2015
work page 2015
-
[14]
The Effectiveness of Data Augmentation in Image Classification using Deep Learning
Luis Perez and Jason Wang, “The effectiveness of data aug- mentation in image classification using deep learning,”arXiv preprint arXiv:1712.04621, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Timbre analysis of music audio sig- nals with convolutional neural networks,
Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, and Xavier Serra, “Timbre analysis of music audio sig- nals with convolutional neural networks,” in 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 2744–2748
work page 2017
-
[16]
Deep convolutional neural networks for predominant instrument recognition in poly- phonic music,
Yoonchang Han, Jaehun Kim, Kyogu Lee, Yoonchang Han, Jaehun Kim, and Kyogu Lee, “Deep convolutional neural networks for predominant instrument recognition in poly- phonic music,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 1, pp. 208–221, Jan. 2017
work page 2017
-
[17]
Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks
Peter Li, Jiyuan Qian, and Tian Wang, “Automatic instru- ment recognition in polyphonic music using convolutional neural networks,” arXiv preprint arXiv:1511.05520, 2015. DAFX-5 Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Taejin Park and Taejin Lee, “Musical instrument sound clas- sification with deep convolutional neural network using fea- ture fusion approach,” arXiv preprint arXiv:1512.07370 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang, “Deep learning for audio-based music clas- sification and tagging: Teaching computers to distinguish rock from bach,” IEEE Signal Processing Magazine , vol. 36, no. 1, pp. 41–51, Jan 2019
work page 2019
-
[20]
Neural audio synthesis of musical notes with wavenet autoencoders,
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Diele- man, Mohammad Norouzi, Douglas Eck, and Karen Si- monyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proceedings of the 34th Interna- tional Conference on Machine Learning, ICML 2017, Syd- ney, NSW, Australia, 6-11 August 2017 , 2017, pp. 1068– 1077
work page 2017
-
[21]
A software framework for musical data augmentation.,
Brian McFee, Eric J Humphrey, and Juan Pablo Bello, “A software framework for musical data augmentation.,” in16th International Society for Music Information Retrieval Con- ference (ISMIR), 2015, pp. 248–254
work page 2015
-
[22]
Deep convolutional neural networks and data augmentation for environmental sound classification,
Justin Salamon and Juan Pablo Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017
work page 2017
-
[23]
A study on data augmenta- tion of reverberant speech for robust speech recognition,
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmenta- tion of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 5220– 5224
work page 2017
-
[24]
Udo Zölzer, DAFX: Digital Audio Effects , John Wiley & Sons, 2011
work page 2011
-
[25]
Joshua D Reiss and Andrew McPherson, Audio effects: the- ory, implementation and application, CRC Press, 2014
work page 2014
-
[26]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal co- variate shift,” arXiv preprint arXiv:1502.03167, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Fast and accurate deep network learning by expo- nential linear units (elus),
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochre- iter, “Fast and accurate deep network learning by expo- nential linear units (elus),” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016
work page 2016
-
[28]
Adam: A method for stochastic optimization,
Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
work page 2015
- [29]
-
[30]
A proposed typology of sampled mate- rial within electronic dance music,
Robert Ratcliffe, “A proposed typology of sampled mate- rial within electronic dance music,” Dancecult: Journal of Electronic Dance Music Culture , vol. 6, no. 1, pp. 97–122, 2014
work page 2014
-
[31]
123–138, Springer Singapore, Sin- gapore, 2018
Shruti Sarika Chakraborty and Ranjan Parekh, Improved Mu- sical Instrument Classification Using Cepstral Coefficients and Neural Networks, pp. 123–138, Springer Singapore, Sin- gapore, 2018
work page 2018
-
[32]
Med- leyDB: A multitrack dataset for annotation-intensive MIR Research,
Rachel M. Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello, “Med- leyDB: A multitrack dataset for annotation-intensive MIR Research,” in the 15th International Society for Music In- formation Retrieval Conference (ISMIR), 2014. DAFX-6
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.