pith. machine review for the scientific record. sign in

arxiv: 2604.05011 · v1 · submitted 2026-04-06 · 💻 cs.SD · cs.AI

Recognition: 2 theorem links

· Lean Theorem

YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords Yemeni musicgenre classificationconvolutional neural networksMel-spectrogramsmusic information retrievalbenchmark datasetArabic music
0
0 comments X

The pith

A new dataset of Yemeni music enables a convolutional neural network to classify five traditional genres at 98.8 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the YMIR dataset of 1,475 audio clips spanning five Yemeni genres that were labeled by five experts with strong agreement. It develops the YMCM convolutional neural network and runs thirty experiments comparing it to standard models across several audio feature types. The best result occurs when Mel-spectrograms feed into YMCM. This matters because most existing music classification tools focus on Western styles and leave regional traditions without shared resources for training and testing. The work also shows how the choice of audio representation interacts with model design for this specific cultural material.

Core claim

We introduce the YMIR dataset containing 1,475 carefully selected audio clips across five traditional Yemeni genres: Sanaani, Hadhrami, Lahji, Tihami, and Adeni. The clips were annotated by five Yemeni music experts with a Fleiss kappa of 0.85. We propose YMCM, a CNN model for genre classification from time-frequency features. In experiments comparing six groups and five architectures, YMCM with Mel-spectrogram features achieves the highest accuracy of 98.8 percent and outperforms AlexNet, VGG16, MobileNet, and a baseline CNN.

What carries the argument

The Yemeni Music Classification Model (YMCM), a convolutional neural network that processes time-frequency representations such as Mel-spectrograms to assign clips to one of five Yemeni music genres.

If this is right

  • The YMIR dataset serves as a benchmark for future models that classify Yemeni music genres.
  • Mel-spectrograms yield higher accuracy than Chroma, FilterBank, or MFCC features when used with convolutional networks on this material.
  • A custom CNN architecture can exceed the performance of standard models such as AlexNet and VGG16 under matched experimental conditions.
  • The systematic comparison of features and architectures supplies guidance for applying similar methods to other culturally specific music traditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Acoustic patterns that separate the five genres appear sufficiently distinct for convolutional networks to exploit once suitable features are chosen.
  • The same dataset and modeling approach could be extended to additional Yemeni or regional genres to test whether the reported accuracy generalizes.
  • Trained models from this benchmark might later support comparative studies between Yemeni music and other Arabic or non-Western traditions.

Load-bearing premise

The 1,475 selected clips and the expert annotations form a representative sample of the five genres with no significant overlap or labeling artifacts.

What would settle it

Running the trained YMCM on a fresh collection of Yemeni music recordings gathered separately from the YMIR clips and checking whether accuracy remains near 98.8 percent.

Figures

Figures reproduced from arXiv: 2604.05011 by Abdulrahman A. AlKannad, Eiad Almekhlafi, Moeen AL-Makhlafi, Nawaf Q. Othman Ahmed Mohammed, Saher Qaid.

Figure 1
Figure 1. Figure 1: Waveform samples of the five YMIR genre classification dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data Labeling structure [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow diagram of the proposed framework [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: WAV-format audio data Short-Time Fourier Transform Extraction. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Workflow diagram of the proposed framework [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Summary of the experimental design and configurations conducted in this study. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Validation accuracy of the proposed YMCM framework using several [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation loss of the proposed YMCM framework using several [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confusion matrices for the proposed YMCM model with different audio feature extraction methods: (a) Filterbanks, (b) chroma features, (c) mel [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Automatic music genre classification is a major task in music information retrieval; however, most current benchmarks and models have been developed primarily for Western music, leaving culturally specific traditions underrepresented. In this paper, we introduce the Yemeni Music Information Retrieval (YMIR) dataset, which contains 1,475 carefully selected audio clips covering five traditional Yemeni genres: Sanaani, Hadhrami, Lahji, Tihami, and Adeni. The dataset was labeled by five Yemeni music experts following a clear and structured protocol, resulting in strong inter-annotator agreement (Fleiss kappa = 0.85). We also propose the Yemeni Music Classification Model (YMCM), a convolutional neural network (CNN)-based system designed to classify music genres from time-frequency features. Using a consistent preprocessing pipeline, we perform a systematic comparison across six experimental groups and five different architectures, resulting in a total of 30 experiments. Specifically, we evaluate several feature representations, including Mel-spectrograms, Chroma, FilterBank, and MFCCs with 13, 20, and 40 coefficients, and benchmark YMCM against standard models (AlexNet, VGG16, MobileNet, and a baseline CNN) under the same experimental conditions. The experimental findings reveal that YMCM is the most effective, achieving the highest accuracy of 98.8% with Mel-spectrogram features. The results also provide practical insights into the relationship between feature representation and model capacity. The findings establish YMIR as a useful benchmark and YMCM as a strong baseline for classifying Yemeni music genres.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the YMIR dataset of 1,475 audio clips spanning five Yemeni genres (Sanaani, Hadhrami, Lahji, Tihami, Adeni), annotated by five experts with Fleiss' kappa = 0.85. It proposes the YMCM CNN and reports results from 30 experiments comparing feature representations (Mel-spectrograms, Chroma, FilterBank, MFCCs) and architectures (YMCM vs. AlexNet, VGG16, MobileNet, baseline CNN), with YMCM achieving the highest accuracy of 98.8% on Mel-spectrograms.

Significance. If the reported accuracies are shown to reflect generalization rather than leakage, the work provides a valuable new benchmark for underrepresented non-Western music in MIR, supported by strong expert annotation agreement and a systematic multi-feature, multi-model comparison that yields practical insights on feature-model interactions.

major comments (1)
  1. [Abstract and experimental results] Abstract and experimental protocol: the headline claim that YMCM achieves 98.8% accuracy and is superior to AlexNet/VGG16/MobileNet under identical conditions is load-bearing, yet the manuscript supplies no train/test split ratios, source-blocking procedure for clips from the same performance or artist, cross-validation scheme, or variance across repeated runs. On a 1,475-clip corpus this omission prevents verification that the 30 experiments used a truly held-out test set, directly undermining the generalization and comparison results.
minor comments (2)
  1. [Dataset] Dataset section: per-genre clip counts and balance statistics are not reported, making it difficult to assess whether the five-class benchmark is balanced or whether class imbalance could affect the reported accuracies.
  2. [Abstract] The abstract refers to 'six experimental groups' without defining them; a brief enumeration in the methods would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the experimental protocol description in the original manuscript was incomplete, which is a significant omission given the importance of verifying generalization and the absence of data leakage. We will revise the manuscript to address this fully.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental protocol: the headline claim that YMCM achieves 98.8% accuracy and is superior to AlexNet/VGG16/MobileNet under identical conditions is load-bearing, yet the manuscript supplies no train/test split ratios, source-blocking procedure for clips from the same performance or artist, cross-validation scheme, or variance across repeated runs. On a 1,475-clip corpus this omission prevents verification that the 30 experiments used a truly held-out test set, directly undermining the generalization and comparison results.

    Authors: We agree that this is a valid and important criticism. The manuscript did not include the required details on data partitioning and evaluation, which prevents independent assessment of whether the reported accuracies reflect true generalization. In the revised version, we will add a dedicated subsection on the experimental protocol that specifies the train/test split ratios, the source-blocking procedure used to keep clips from the same performance or artist within the same split, the cross-validation scheme, and the mean accuracy with standard deviation across repeated runs. These additions will allow verification that a truly held-out test set was used and will strengthen the comparative claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy from held-out evaluation on new dataset

full rationale

The paper's central result (98.8% accuracy for YMCM on Mel-spectrograms) is produced by standard supervised training of CNNs on time-frequency features extracted from the 1,475-clip YMIR dataset. No equations, derivations, or parameter-fitting steps are present that would reduce the reported metric to a self-referential definition, a renamed input, or a load-bearing self-citation. Dataset labeling (Fleiss κ=0.85) and model comparisons occur under explicitly described experimental conditions without any uniqueness theorem or ansatz imported from prior author work. The evaluation chain is therefore self-contained against external audio benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that expert-labeled audio clips constitute reliable ground truth and that standard time-frequency features suffice for genre discrimination; no free parameters are explicitly fitted beyond normal neural-network training.

axioms (2)
  • domain assumption Expert annotations with Fleiss kappa 0.85 provide reliable ground-truth labels for the five genres
    Invoked to justify the training targets and reported accuracy.
  • domain assumption Mel-spectrograms and related features capture genre-discriminative information in Yemeni music
    Central to the choice of input representations and the claim that YMCM outperforms other models.

pith-pipeline@v0.9.0 · 5631 in / 1261 out tokens · 45871 ms · 2026-05-10T19:46:19.371379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Music genre classification and music recom- mendation by using deep learning,

    A. Elbir and N. Aydin, “Music genre classification and music recom- mendation by using deep learning,”Electronics Letters, vol. 56, no. 12, pp. 627–629, 2020

  2. [2]

    Representing musical genre: A state of the art,

    J.-J. Aucouturier and F. Pachet, “Representing musical genre: A state of the art,”Journal of New Music Research, vol. 32, no. 1, pp. 83–93, 2003

  3. [3]

    Music genre classification and recommendation by using machine learning techniques,

    A. Elbir, H. B. C ¸ am, M. E. ˙Iyican, B. ¨Ozt¨urk, and N. Aydın, “Music genre classification and recommendation by using machine learning techniques,” in2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018, pp. 1–5. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

  4. [4]

    Ga-based parameterization and feature selection for automatic music genre recognition,

    M. Serwach and B. Stasiak, “Ga-based parameterization and feature selection for automatic music genre recognition,” inProceedings of 2016 17th International Conference Computational Problems of Electrical Engineering (CPEE), 2016

  5. [5]

    Musical genre classification of audio signals,

    G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,”IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002

  6. [6]

    Content-based classifi- cation, search, and retrieval of audio,

    E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classifi- cation, search, and retrieval of audio,”IEEE Multimedia, vol. 3, no. 3, pp. 27–36, 1996

  7. [7]

    A historical glimpse of music in yemen in the 1930s: The cylinders recorded by hans helfritz,

    J. Lambert, “A historical glimpse of music in yemen in the 1930s: The cylinders recorded by hans helfritz,”The World of Music (new series), vol. 9, no. 2, pp. 49–66, 2020. [Online]. Available: https://journals.uni-goettingen.de/wom/article/view/1426

  8. [8]

    Music in colonial aden: Globalization, cultural politics and the record industry in an indian ocean port city, c. 1937–1960,

    G. Lavin, “Music in colonial aden: Globalization, cultural politics and the record industry in an indian ocean port city, c. 1937–1960,”SOAS Musicology Series, 2020. [Online]. Available: https://www.academia. edu/73304097

  9. [9]

    J. A. Ahmed,From the Yemeni Music Scene. Cultural Affairs Publishing House, 2007

  10. [10]

    Cultural heritage of yemeni folk song: The importance of singing,

    M. Altwaiji, “Cultural heritage of yemeni folk song: The importance of singing,”International Journal of Sociology and Political Science, vol. 6, no. 1, 2024. [Online]. Available: https://www.sociologyjournal. in/assets/archives/2024/vol6issue1/6001.pdf

  11. [11]

    Al-ghina al-san’ani, traditional sanaa-style singing,

    UNESCO, “Al-ghina al-san’ani, traditional sanaa-style singing,” https://ich.unesco.org/en/RL/ al-ghina-al-sanani-traditional-sanaa-style-singing-00161, 2003

  12. [12]

    Genre classification empowered by knowledge-embedded music representation,

    H. Ding, L. Zhai, C. Zhao, F. Wang, G. Wang, W. Xi, Z. Wang, and J. Zhao, “Genre classification empowered by knowledge-embedded music representation,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2764–2776, 2024

  13. [13]

    Multimodal music genre classification of sotho-tswana musical videos,

    O. E. Oguike and M. Primus, “Multimodal music genre classification of sotho-tswana musical videos,”IEEE Access, 2025

  14. [14]

    Music genre classification based on functional data analysis,

    J. Shen and G. Xiao, “Music genre classification based on functional data analysis,”IEEE Access, 2024

  15. [15]

    Musical genre classification using advanced audio analysis and deep learning techniques,

    M. Ahmed, U. Rozario, M. M. Kabir, Z. Aung, J. Shin, and M. Mridha, “Musical genre classification using advanced audio analysis and deep learning techniques,”IEEE Open Journal of the Computer Society, 2024

  16. [16]

    Music genre classification using deep neural networks and data augmentation,

    T. C. Ba, T. D. T. Le, and L. T. Van, “Music genre classification using deep neural networks and data augmentation,”Entertainment Computing, vol. 53, p. 100929, 2025

  17. [17]

    Optimizing the configuration of deep learning models for music genre classification,

    T. Li, “Optimizing the configuration of deep learning models for music genre classification,”Heliyon, p. e24892, 2024

  18. [18]

    An improved vit model for music genre classification based on mel spectrogram,

    P. Wu, W. Gao, Y . Chen, F. Xu, Y . Ji, J. Tu, and H. Lin, “An improved vit model for music genre classification based on mel spectrogram,”PLOS ONE, vol. 20, no. 3, p. e0319027, 2025

  19. [19]

    Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa,

    J. J. Randolph, “Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa,” Online sub- mission, 2005

  20. [20]

    Music genre classification using machine learning algorithms: a comparison,

    S. Chillara, A. Kavitha, S. A. Neginhal, S. Haldia, and K. Vidyul- latha, “Music genre classification using machine learning algorithms: a comparison,”International Research Journal of Engineering and Technology, vol. 6, no. 5, pp. 851–858, 2019

  21. [21]

    Music genre classification with convolutional neural networks and comparison with f, q, and mel spectrogram-based images,

    A. Dhall, Y . S. Murthy, and S. G. Koolagudi, “Music genre classification with convolutional neural networks and comparison with f, q, and mel spectrogram-based images,” inAdvances in Speech and Music Technology. Springer, 2021, pp. 235–248

  22. [22]

    Combining cnn and broad learning for music classification,

    H. Tang and N. Chen, “Combining cnn and broad learning for music classification,”IEICE Transactions on Information and Systems, vol. 103, no. 3, pp. 695–701, 2020

  23. [23]

    Music genre classification using machine learning,

    A. Ghildiyal, K. Singh, and S. Sharma, “Music genre classification using machine learning,” in2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE, 2020, pp. 1368–1372

  24. [24]

    Speech processing for machine learning: Filter banks, mel-frequency cepstral coefficients (mfccs) and what’s in-between,

    H. Fayek, “Speech processing for machine learning: Filter banks, mel-frequency cepstral coefficients (mfccs) and what’s in-between,” URL: https://haythamfayek. com/2016/04/21/speech-processingfor- machine-learning. html, 2016

  25. [25]

    Verification for robustness of chroma feature,

    X. Zhang, N. Li, and W. Li, “Verification for robustness of chroma feature,”Computer Science, p. S1, 2014

  26. [26]

    Deep learning neural networks for music information retrieval,

    M. Singh, S. K. Jha, B. Singh, and B. Rajput, “Deep learning neural networks for music information retrieval,” in2021 International Confer- ence on Computational Intelligence and Knowledge Economy (ICCIKE). IEEE, 2021, pp. 500–503

  27. [27]

    Music genre classification using convolutional neural networks,

    S. Pulipati, C. P. Sai, K. S. Krishna, and C. Akhil, “Music genre classification using convolutional neural networks,”Design Engineering, pp. 2727–2732, 2021

  28. [28]

    Classification of classic turkish music makams,

    M. A. Kızrak, K. S. Bayram, and B. Bolat, “Classification of classic turkish music makams,” in2014 IEEE International Symposium on In- novations in Intelligent Systems and Applications (INISTA) Proceedings. IEEE, 2014, pp. 394–397

  29. [29]

    Music genre classification using mfcc and aann,

    R. Thiruvengatanadhan, “Music genre classification using mfcc and aann,”International Research Journal of Engineering and Technology (IRJET), 2018

  30. [30]

    Automatic music genre detection using artificial neural networks,

    P. Mandal, I. Nath, N. Gupta, K. Jha Madhav, G. Ganguly Dev, and S. Pal, “Automatic music genre detection using artificial neural networks,”Intelligent Computing in Engineering, pp. 17–24, 2020

  31. [31]

    Classification of indian classical music with time-series matching deep learning approach,

    A. K. Sharma, G. Aggarwal, S. Bhardwaj, P. Chakrabarti, T. Chakrabarti, J. H. Abawajyet al., “Classification of indian classical music with time-series matching deep learning approach,”IEEE Access, vol. 9, pp. 102 041–102 052, 2021

  32. [32]

    To catch a chorus: Using chroma- based representations for audio thumbnailing,

    M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chroma- based representations for audio thumbnailing,” inProceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575). IEEE, 2001, pp. 15–18

  33. [33]

    Music emotion recognition in assamese songs using mfcc features and mlp classifier,

    J. Dutta and D. Chanda, “Music emotion recognition in assamese songs using mfcc features and mlp classifier,” in2021 International Conference on Intelligent Technologies (CONIT). IEEE, 2021, pp. 1–5

  34. [34]

    A low latency modular- level deeply integrated mfcc feature extraction architecture for speech recognition,

    A. X. Glittas, L. Gopalakrishnanet al., “A low latency modular- level deeply integrated mfcc feature extraction architecture for speech recognition,”Integration, vol. 76, pp. 69–75, 2021

  35. [35]

    Genre classification and the invariance of mfcc features to key and tempo,

    T. L. Li and A. B. Chan, “Genre classification and the invariance of mfcc features to key and tempo,” inInternational Conference on MultiMedia Modeling. Springer, 2011, pp. 317–327

  36. [36]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”Advances in neural informa- tion processing systems, vol. 25, 2012

  37. [37]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- lutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

  38. [38]

    librosa: Audio and music signal analysis in python

    B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python.” SciPy, vol. 2015, pp. 18–24, 2015