pith. sign in

arxiv: 2105.00933 · v3 · pith:MTEZSEZNnew · submitted 2021-05-03 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

Deep Neural Network for Musical Instrument Recognition using MFCCs

Pith reviewed 2026-05-24 13:04 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGeess.AS
keywords musical instrument recognitionMFCCartificial neural networkaudio classificationLondon Philharmonic Orchestra datasetsound classification
0
0 comments X

The pith

An artificial neural network classifies twenty musical instruments at state-of-the-art accuracy using only MFCC audio features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains an artificial neural network on mel-frequency cepstral coefficients drawn from the full London Philharmonic Orchestra dataset to distinguish twenty instrument classes across woodwind, brass, percussion, and string families. The central effort is to show that this minimal feature set and standard network architecture reach the highest reported accuracy on that collection. If the result holds, it indicates that instrument recognition can proceed without hand-crafted extra features, data augmentation, or custom network designs. This would matter for building lighter audio classifiers in music applications where only cepstral data is available.

Core claim

The proposed ANN model, trained on MFCCs from the full twenty-class London Philharmonic Orchestra dataset spanning woodwinds, brass, percussion, and strings, achieves state-of-the-art accuracy in musical instrument recognition.

What carries the argument

An artificial neural network that takes mel-frequency cepstral coefficients as input for classifying audio into instrument classes.

If this is right

  • The model distinguishes instruments across all four families using only the chosen coefficients.
  • No supplementary audio descriptors or augmentation steps are needed for the reported accuracy.
  • A standard feed-forward network suffices where the dataset supplies clean, balanced examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same MFCC-plus-ANN pipeline could be applied to other instrument collections to test whether the accuracy transfers.
  • Real-time instrument detection on embedded devices might become feasible given the low input dimensionality.
  • If the result generalizes, music information retrieval pipelines could drop more elaborate front-ends without loss of performance.

Load-bearing premise

That MFCC features alone fed to an ANN are sufficient to reach state-of-the-art performance on the twenty-class dataset without additional features, data augmentation, or specialized architectures.

What would settle it

A replication experiment on the identical London Philharmonic Orchestra twenty-class splits that reports accuracy below the claimed state-of-the-art level under the same evaluation protocol.

Figures

Figures reproduced from arXiv: 2105.00933 by Abdullah Faiz Ur Rahman Khilji, Partha Pakray, Saranga Kingkor Mahanta.

Figure 1
Figure 1. Figure 1: Data distribution among the 20 classes classes, hence they are consolidated into a single class i.e. ‘percussion’. 4 Pre-processing The data was already noise-free and consisted of single instrument tones per example corre￾sponding to the respective class, thus relieving us from performing complex processing procedures. The various steps of pre-processing that were performed have been described in detail i… view at source ↗
Figure 2
Figure 2. Figure 2: Durations of all examples 0:00 0:10 0:20 0:30 0:40 0:50 1:00 1:10 Time 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Clarinet (piano) - 1 min 17 seconds 0 0.5 1 1.5 2 2.5 Time 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Clarinet (piano) trimmed to 3 seconds [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Longest audio clip trimmed to 3 seconds 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Time 0.004 0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004 Viola (pianissimo) - 0.07 seconds 0 0.5 1 1.5 2 2.5 Time 0.004 0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004Viola (pianissimo) padded until 3 seconds [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Shortest audio clip padded until 3 seconds defining features of the audio clip mostly occur before 3 seconds into the signal and most of the examples are almost periodic with minuscule periods. Another reason for not choosing a fixed length of more than 3 seconds is to limit the number of sparse values that result from padding and to reduce dimensional size. 4.2 Extracting Mel-Frequency Cepstral Coefficien… view at source ↗
Figure 5
Figure 5. Figure 5: Steps to extract Cepstral Coefficients from an audio signal Peaks are observed at periodic elements of the original signal while computing the log of the magnitude of the Fourier transform of the audio signal followed by taking its spectrum by a cosine transformation. The resulting spectrum lies in the quefrency domain [9]. Humans perceive amplitude logarithmically, hence conversion to the Log-Amplitude Sp… view at source ↗
Figure 6
Figure 6. Figure 6: Proposed model architecture [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy and Loss on the Train and Test sets The experimental setup, Section 6, was finalized after a commendable number of iterations of hyperparameter tuning. Although a different set of number of layers and neurons had resulted in a slightly better validation accuracy, this particular model resulted in a better and more uniform F1 score over the classes in addition to a more stable fluctuation of the ac… view at source ↗
Figure 8
Figure 8. Figure 8: Confusion Matrix [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Precision-Recall and AUC-ROC curves Acknowledgements We would like to thank the Department of Computer Science and Engineering and Center for Natural Language Processing (CNLP) at National Institute of Technology Silchar for providing the requisite support and infrastructure to execute this work [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

The task of efficient automatic music classification is of vital importance and forms the basis for various advanced applications of AI in the musical domain. Musical instrument recognition is the task of instrument identification by virtue of its audio. This audio, also termed as the sound vibrations are leveraged by the model to match with the instrument classes. In this paper, we use an artificial neural network (ANN) model that was trained to perform classification on twenty different classes of musical instruments. Here we use use only the mel-frequency cepstral coefficients (MFCCs) of the audio data. Our proposed model trains on the full London philharmonic orchestra dataset which contains twenty classes of instruments belonging to the four families viz. woodwinds, brass, percussion, and strings. Based on experimental results our model achieves state-of-the-art accuracy on the same.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes training an artificial neural network (ANN) on mel-frequency cepstral coefficients (MFCCs) extracted from audio to classify 20 musical instrument classes drawn from the four families in the London Philharmonic Orchestra dataset, and asserts that experimental results demonstrate state-of-the-art accuracy.

Significance. If the experimental protocol, accuracy figure, and direct comparisons to prior published results on the identical 20-class LPO task were supplied and shown to be independent of modeling choices, the work would provide evidence that a simple MFCC+ANN pipeline can match or exceed more elaborate approaches; this would be a useful negative result on the necessity of additional features or architectures for this dataset.

major comments (2)
  1. [Abstract] Abstract: the central claim that the model 'achieves state-of-the-art accuracy on the same' is unsupported because the manuscript supplies neither the achieved test accuracy, the train/test split, the class balance, nor any cited prior accuracy on the twenty-class London Philharmonic Orchestra dataset.
  2. [Abstract] Abstract: the assertion that MFCC features alone fed to an ANN suffice for SOTA performance rests on an unevidenced experimental result; without the model architecture, hyper-parameters, training details, or baseline comparisons, the claim cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: repeated word 'use use only'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We agree that the abstract requires additional details to support the claims made and will revise the manuscript to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the model 'achieves state-of-the-art accuracy on the same' is unsupported because the manuscript supplies neither the achieved test accuracy, the train/test split, the class balance, nor any cited prior accuracy on the twenty-class London Philharmonic Orchestra dataset.

    Authors: We acknowledge this point. The experimental results section of the full manuscript reports the test accuracy achieved by the ANN on MFCC features, along with the train/test split used and class distribution in the London Philharmonic Orchestra dataset. However, these specifics are not summarized in the abstract. In the revised version, we will update the abstract to explicitly state the achieved accuracy, describe the split and balance, and add citations to prior published results on the identical 20-class task to allow direct comparison. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that MFCC features alone fed to an ANN suffice for SOTA performance rests on an unevidenced experimental result; without the model architecture, hyper-parameters, training details, or baseline comparisons, the claim cannot be evaluated.

    Authors: We agree that the abstract as written does not provide these supporting details. The full manuscript includes the ANN architecture, hyperparameter settings, training procedure, and comparisons to baselines. To make the SOTA claim evaluable from the abstract alone, we will revise it to include a concise summary of the model, key hyperparameters, training details, and baseline results. This will also clarify that the result is based on the reported experimental protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy claim is independent of modeling inputs

full rationale

The paper reports an experimental result from training an ANN on MFCC features extracted from the London Philharmonic Orchestra dataset and states that this yields state-of-the-art accuracy. This is a direct performance measurement on held-out data rather than a derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-definitional steps, uniqueness theorems, ansatzes smuggled via citation, or renamings of known results appear. The central claim rests on empirical evaluation, which is self-contained against external benchmarks when the accuracy number and protocol are reported.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified experimental result and the domain assumption that MFCCs capture the necessary distinctions among the twenty classes. No free parameters or invented entities are described because the abstract provides no implementation details.

free parameters (1)
  • ANN architecture hyperparameters
    Number of layers, neurons, and training settings are chosen to produce the reported accuracy but are not specified.
axioms (2)
  • domain assumption MFCCs alone contain sufficient information to distinguish the twenty instrument classes
    The model is trained using only these coefficients as stated in the abstract.
  • domain assumption The London Philharmonic Orchestra dataset constitutes a fair benchmark for claiming state-of-the-art performance
    The paper trains on the full dataset and asserts SOTA without detailing prior baselines.

pith-pipeline@v0.9.0 · 5682 in / 1351 out tokens · 34368 ms · 2026-05-24T13:04:21.022384+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Chakraborty, S. S. & Parekh, R. (2018).Improved musical instrument classification using cepstral coefficients and neural networks. InMethodologies and Application Issues of Contemporary Computing Framework. Springer, pp. 123–138

  2. [2]

    D., Simmermacher, C., & Cranefield, S

    Deng, J. D., Simmermacher, C., & Cranefield, S. (2008).A study on feature analysis for musical instrument classification.IEEE T ransactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 38, No. 2, pp. 429–438

  3. [3]

    Eichner, M., Wolff, M., & Hoffmann, R. (2006). Instrument classification using hidden markov models.system, Vol. 1, No. 2, pp. 3

  4. [4]

    & Klapuri, A

    Eronen, A. & Klapuri, A. (2000).Musical instrument recognition using cepstral coefficients and temporal features.2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), volume 2, IEEE, pp. II753–II756

  5. [5]

    Essid, S., Richard, G., & David, B. (2005). Instrument recognition in polyphonic music based on automatic taxonomies.IEEE T ransactions on Audio, Speech, and Language Processing, Vol. 14, No. 1, pp. 68–80

  6. [6]

    R., Suresh, D

    Gulhane, S. R., Suresh, D. S., & Sanjay, S. B. (2018).Identification of musical instruments using mfcc features.International Conference On Computational Vision and Bio Inspired Computing, Springer, pp. 957–968

  7. [7]

    (2019).Music and instrument classification using deep learning technics.Recall, Vol

    Haidar-Ahmad, L. (2019).Music and instrument classification using deep learning technics.Recall, Vol. 67, No. 37.00, pp. 80–00

  8. [8]

    & Moreno, P

    Marques, J. & Moreno, P. J. (1999).A study of musical instrument classification using gaussian mixture models and support vector machines. Cambridge Research Laboratory T echnical Report Series CRL, Vol. 4, pp. 143

  9. [9]

    Oppenheim, A. V. & Schafer, R. W. (2004).From frequency to quefrency: A history of the cepstrum. IEEE signal processing Magazine, Vol. 21, No. 5, pp. 95–106

  10. [10]

    Siebert, X., M ´elot, H., & Hulshof, C.,.Study of the robustness of descriptors for musical instruments classification

  11. [11]

    (2019).Implementing musical instrument recogni- tion using cnn and svm.International Research Journal of Engineering and T echnology, pp

    Singh, P., Bachhav, D., Joshi, O., & Patil, N. (2019).Implementing musical instrument recogni- tion using cnn and svm.International Research Journal of Engineering and T echnology, pp. 1487– 1493

  12. [12]

    & Pandey, S

    Solanki, A. & Pandey, S. (2019).Music in- strument recognition using deep convolutional neural networks.International Journal of Information T echnology, pp. 1–10

  13. [13]

    Musical Instrument Recognition Using Their Distinctive Characteristics in Artificial Neural Networks

    Toghiani-Rizi, B. & Windmark, M. (2017). Musical instrument recognition using their distinctive characteristics in artificial neural networks.arXiv preprint arXiv:1705.04971

  14. [14]

    Valverde-Albacete, F. J. & Pel ´aez-Moreno, C. (2014).100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox.PloS one, Vol. 9, No. 1, pp. e84217