pith. sign in

arxiv: 1907.08520 · v1 · pith:BT4CAD3Enew · submitted 2019-07-19 · 💻 cs.SD · eess.AS

Data Augmentation for Instrument Classification Robust to Audio Effects

Pith reviewed 2026-05-24 18:54 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords instrument classificationdata augmentationaudio effectselectronic music productionone-shot soundsrobustnesssample packs
0
0 comments X

The pith

Training instrument classifiers with audio-effect augmentation improves accuracy on processed one-shot sounds used in electronic music production.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates a state-of-the-art instrument classifier on one-shot sounds that have been processed with common audio effects. It applies data augmentation during training by adding the same effects to the original dataset and measures resulting changes in classification accuracy for each effect. A sympathetic reader would care because automatic classification of sample packs is only practical if the labels remain reliable after producers apply reverb, compression, distortion and similar processing. The work shows that augmentation narrows the performance gap between clean and effected sounds without changing the underlying instrument labels. This directly addresses the mismatch between laboratory training data and real electronic-music workflows.

Core claim

A model trained on a large set of clean one-shot instrumental sounds loses accuracy when the test sounds receive audio effects typical of electronic music production; retraining the same model with those effects included as data augmentation restores most of the lost accuracy, and the paper reports the per-effect contribution to the recovery.

What carries the argument

Data augmentation that applies audio effects (reverb, delay, distortion, compression, EQ, etc.) to the training examples while keeping the original instrument label.

If this is right

  • Classifiers trained this way can label large sample-pack libraries without manual correction after common production processing.
  • The per-effect accuracy tables identify which processing steps (for example heavy distortion) still require additional techniques.
  • The same augmentation pipeline can be reused for other audio classification tasks that encounter production effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the approach to multi-effect chains or to full music loops would test whether the robustness generalises beyond isolated one-shots.
  • If the method works, automatic tagging services could offer users the option to train custom models on their own effect chains.

Load-bearing premise

The chosen audio effects and their parameter ranges are representative of real production processing and never change an instrument's identity enough to require a new label.

What would settle it

Measure classification accuracy on a held-out set of one-shot sounds that have been processed with the same effects but at parameter values never seen during augmentation; if accuracy remains high only when augmentation was used in training, the claim holds.

Figures

Figures reproduced from arXiv: 1907.08520 by Ant\'onio Ramires, Xavier Serra.

Figure 1
Figure 1. Figure 1: For more information on this architecture and its proper [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Single-layer CNN architecture proposed in [9] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically categorising the sounds contained in these collections, allowing easier navigation and better characterisation. Automatic instrument classification has mostly targeted the classification of unprocessed isolated instrumental sounds or detecting predominant instruments in mixed music tracks. For this classification to be useful in audio databases for EMP, it has to be robust to the audio effects applied to unprocessed sounds. In this paper we evaluate how a state of the art model trained with a large dataset of one-shot instrumental sounds performs when classifying instruments processed with audio effects. In order to evaluate the robustness of the model, we use data augmentation with audio effects and evaluate how each effect influences the classification accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript evaluates the robustness of a state-of-the-art instrument classification model to audio effects commonly used in electronic music production. It trains the model on a large dataset of one-shot instrumental sounds and uses data augmentation with audio effects to assess how each effect influences classification accuracy on processed sounds.

Significance. If the empirical results demonstrate that augmentation with representative effects measurably improves accuracy on processed sounds while preserving performance on clean inputs, the work would offer a practical technique for deploying classifiers on real sample packs. The focus on EMP workflows addresses a gap between standard instrument classification benchmarks and production use cases.

minor comments (2)
  1. [Abstract] Abstract: the evaluation plan is described but no quantitative results, dataset sizes, model details, or statistical tests are provided, making it difficult to assess the strength of the robustness claims without the full experimental section.
  2. The weakest assumption (effects preserve instrument identity) is stated but receives no explicit validation or discussion of edge cases where heavy processing might alter perceived timbre enough to warrant relabeling.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were listed in the provided report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical robustness evaluation: a state-of-the-art model is trained on one-shot instrumental sounds and tested after applying audio effects via data augmentation, with accuracy measured per effect. No equations, parameter fits, uniqueness theorems, or self-citation chains are invoked to derive or predict any quantity; the reported accuracies are direct experimental outcomes rather than quantities forced by construction from the training procedure itself. The evaluation design is therefore self-contained against external benchmarks and contains no load-bearing steps that reduce to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from the stated approach. The work relies on an existing large dataset and state-of-the-art model whose training details are not described here.

axioms (1)
  • domain assumption Audio effects applied to one-shot sounds preserve the instrument label for classification purposes.
    Implicit in the plan to measure classification accuracy on processed versions of the same sounds.

pith-pipeline@v0.9.0 · 5694 in / 1205 out tokens · 24512 ms · 2026-05-24T18:54:58.727635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

  1. [1]

    production-ready

    INTRODUCTION The repurposing of audio material, also known as sampling, has been a key component in Electronic Music Production (EMP) since its early days and became a practice which had a major in- fluence in a large variety of musical genres. The availability of software such as Digital Audio Workstations, together with the au- dio sharing possibilities ...

  2. [2]

    RELA TED WORK Automatic instrument classification can be split into two related tasks with a similar goal. The first is the identification of in- struments in single instrument recordings (which can be isolated or overlapping notes) while the second is the recognition of the predominant instrument in a mixture of sounds. A thorough de- scription of this task...

  3. [3]

    METHODOLOGY In our study we will conduct two experiments. First, we will try to understand how augmenting a dataset with specific effects can improve instrument classification and secondly, we will see if this augmentation can improve the robustness of a model to the se- lected effect. To investigate this, we process the training, validation and test sets o...

  4. [4]

    RESULTS Two experiments were conducted in our study. We firstly evalu- ated how augmenting the training set of NSynth [14] by applying audio effects to the sounds can improve the automatic classifica- tion on the instruments of the unmodified test set. In the second experiment we evaluated how robust a state of the art model for instrument classification is w...

  5. [5]

    CONCLUSIONS In this paper we evaluated how a state of the art algorithm for automatic instrument classification performs when classifying the NSynth dataset and how augmenting this dataset with audio ef- fects commonly used in electronic music production influences its accuracy on both the original and processed versions of the audio. We identify that the a...

  6. [6]

    We thank Matthew Davies for reviewing a draft of this paper and providing helpful feedback

    ACKNOWLEDGMENTS This project has received funding from the European Union’s Hori- zon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement N o 765068, MIP-Frontiers. We thank Matthew Davies for reviewing a draft of this paper and providing helpful feedback

  7. [7]

    Freesound technical demo,

    Frederic Font, Gerard Roma, and Xavier Serra, “Freesound technical demo,” in ACM International Conference on Mul- timedia (MM’13), Barcelona, Spain, 2013, ACM, pp. 411– 412, ACM

  8. [8]

    Automatic classification of musical instrument sounds,

    Perfecto Herrera-Boyer, Geoffroy Peeters, and Shlomo Dubnov, “Automatic classification of musical instrument sounds,” Journal of New Music Research, vol. 32, pp. 3–21, 2003

  9. [9]

    RWC music database: Popular, classical, and jazz music databases,

    Masataka Goto, “RWC music database: Popular, classical, and jazz music databases,” in 3rd International Society for Music Information Retrieval Conference (ISMIR), 2002, pp. 287–288

  10. [10]

    A comparison of sound segregation techniques for predominant instrument recognition in musical audio sig- nals,

    Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera, “A comparison of sound segregation techniques for predominant instrument recognition in musical audio sig- nals,” in 13th International Society for Music Information Retrieval Conference (ISMIR), 2012, pp. 559–564

  11. [11]

    A real-time system for measuring sound goodness in instrumental sounds,

    Oriol Romani Picas, Hector Parra Rodriguez, Dara Dabiri, Hiroshi Tokuda, Wataru Hariya, Koji Oishi, and Xavier Serra, “A real-time system for measuring sound goodness in instrumental sounds,” in Audio Engineering Society Con- vention 138, Warsaw, Poland, 2015, p. 9350

  12. [12]

    Musical Instrument Recog- nition in Multi-Instrument Audio Contexts,

    Venkatesh Shenoy Kadandale, “Musical Instrument Recog- nition in Multi-Instrument Audio Contexts,” MSc thesis, Universitat Pompeu Fabra, Oct. 2018

  13. [13]

    Deep learning,

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436, 2015

  14. [14]

    The Effectiveness of Data Augmentation in Image Classification using Deep Learning

    Luis Perez and Jason Wang, “The effectiveness of data aug- mentation in image classification using deep learning,”arXiv preprint arXiv:1712.04621, 2017

  15. [15]

    Timbre analysis of music audio sig- nals with convolutional neural networks,

    Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, and Xavier Serra, “Timbre analysis of music audio sig- nals with convolutional neural networks,” in 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 2744–2748

  16. [16]

    Deep convolutional neural networks for predominant instrument recognition in poly- phonic music,

    Yoonchang Han, Jaehun Kim, Kyogu Lee, Yoonchang Han, Jaehun Kim, and Kyogu Lee, “Deep convolutional neural networks for predominant instrument recognition in poly- phonic music,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 1, pp. 208–221, Jan. 2017

  17. [17]

    Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks

    Peter Li, Jiyuan Qian, and Tian Wang, “Automatic instru- ment recognition in polyphonic music using convolutional neural networks,” arXiv preprint arXiv:1511.05520, 2015. DAFX-5 Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019

  18. [18]

    Musical instrument sound classification with deep convolutional neural network using feature fusion approach

    Taejin Park and Taejin Lee, “Musical instrument sound clas- sification with deep convolutional neural network using fea- ture fusion approach,” arXiv preprint arXiv:1512.07370 , 2015

  19. [19]

    Deep learning for audio-based music clas- sification and tagging: Teaching computers to distinguish rock from bach,

    Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang, “Deep learning for audio-based music clas- sification and tagging: Teaching computers to distinguish rock from bach,” IEEE Signal Processing Magazine , vol. 36, no. 1, pp. 41–51, Jan 2019

  20. [20]

    Neural audio synthesis of musical notes with wavenet autoencoders,

    Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Diele- man, Mohammad Norouzi, Douglas Eck, and Karen Si- monyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proceedings of the 34th Interna- tional Conference on Machine Learning, ICML 2017, Syd- ney, NSW, Australia, 6-11 August 2017 , 2017, pp. 1068– 1077

  21. [21]

    A software framework for musical data augmentation.,

    Brian McFee, Eric J Humphrey, and Juan Pablo Bello, “A software framework for musical data augmentation.,” in16th International Society for Music Information Retrieval Con- ference (ISMIR), 2015, pp. 248–254

  22. [22]

    Deep convolutional neural networks and data augmentation for environmental sound classification,

    Justin Salamon and Juan Pablo Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017

  23. [23]

    A study on data augmenta- tion of reverberant speech for robust speech recognition,

    Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmenta- tion of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 5220– 5224

  24. [24]

    Udo Zölzer, DAFX: Digital Audio Effects , John Wiley & Sons, 2011

  25. [25]

    Joshua D Reiss and Andrew McPherson, Audio effects: the- ory, implementation and application, CRC Press, 2014

  26. [26]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal co- variate shift,” arXiv preprint arXiv:1502.03167, 2015

  27. [27]

    Fast and accurate deep network learning by expo- nential linear units (elus),

    Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochre- iter, “Fast and accurate deep network learning by expo- nential linear units (elus),” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016

  28. [28]

    Adam: A method for stochastic optimization,

    Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

  29. [29]

    librosa/librosa: 0.6.3,

    Brian McFee et al., “librosa/librosa: 0.6.3,” Feb. 2019

  30. [30]

    A proposed typology of sampled mate- rial within electronic dance music,

    Robert Ratcliffe, “A proposed typology of sampled mate- rial within electronic dance music,” Dancecult: Journal of Electronic Dance Music Culture , vol. 6, no. 1, pp. 97–122, 2014

  31. [31]

    123–138, Springer Singapore, Sin- gapore, 2018

    Shruti Sarika Chakraborty and Ranjan Parekh, Improved Mu- sical Instrument Classification Using Cepstral Coefficients and Neural Networks, pp. 123–138, Springer Singapore, Sin- gapore, 2018

  32. [32]

    Med- leyDB: A multitrack dataset for annotation-intensive MIR Research,

    Rachel M. Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello, “Med- leyDB: A multitrack dataset for annotation-intensive MIR Research,” in the 15th International Society for Music In- formation Retrieval Conference (ISMIR), 2014. DAFX-6