pith. sign in

arxiv: 2601.18339 · v2 · submitted 2026-01-26 · 💻 cs.SD · cs.LG

A Dataset for Automatic Vocal Mode Classification

Pith reviewed 2026-05-16 11:11 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords vocal mode classificationComplete Vocal Techniquesinging datasetautomatic classificationResNet18CVT vocal modessinging teachingmachine learning
0
0 comments X

The pith

A new dataset of over 13,000 vocal samples enables automatic classification of four singing modes with 81.3% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a dataset for classifying vocal modes in singing according to the Complete Vocal Technique. The authors recorded sustained vowels across the full vocal range from four singers using multiple microphones and had them annotated by three experienced experts. They provide baseline results showing that a ResNet18 model achieves 81.3% balanced accuracy in 5-fold cross-validation. This resource addresses the previous lack of data for developing technology to assist singing students in learning specific vocal modes. If successful, such classification could support real-time feedback in singing apps or teaching tools.

Core claim

The paper presents a novel dataset consisting of 3,752 unique sustained vowel samples from four singers, augmented to over 13,000 samples via four microphones, with annotations for the CVT vocal modes Neutral, Curbing, Overdrive, and Edge. Baseline classification using deep learning models like ResNet18 yields a best balanced accuracy of 81.3% across 5-fold cross validation, establishing a performance benchmark for future work on automatic vocal mode classification.

What carries the argument

The merged annotations from three CVT-experienced annotators on the multi-microphone recordings of sustained vowels, which serve as the labeled data for training classifiers.

If this is right

  • Automatic classification of vocal modes can support technology-assisted singing teaching.
  • The dataset enables development of models for identifying Neutral, Curbing, Overdrive, and Edge modes.
  • Multi-microphone setup provides natural data augmentation for improved model robustness.
  • Baseline results set a standard for comparing future classification approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This dataset could be extended to include full songs or dynamic transitions between modes for more realistic applications.
  • Integration with mobile apps might allow singers to receive instant feedback on their vocal technique.
  • The annotation process highlights the subjectivity in vocal mode identification, suggesting potential for consensus-based or probabilistic labeling in future datasets.
  • Cross-singer generalization might be tested by training on some singers and evaluating on others.

Load-bearing premise

The merged annotations from the three experienced annotators accurately and consistently identify the intended vocal modes in the samples without significant disagreement or influence from recording conditions.

What would settle it

A study where independent CVT experts re-annotate a subset of the samples and find substantial disagreement with the provided merged labels, or where classifiers trained on the dataset fail to generalize to new singers or recording setups.

Figures

Figures reproduced from arXiv: 2601.18339 by Alexander Lange, J\"orn Ostermann, Reemt Hinrichs, Sonja Stephan.

Figure 1
Figure 1. Figure 1: Number of samples per note for the entire dataset. The highest note, a [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of samples per subject and empirical cumulative distribution of [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Balanced accuracies on the test set across the 5-fold cross validation of all [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Balanced accuracy across half-octaves on the test set for the best iterations [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fleiss’ kappa score across cut-off note threshold. The computation of the [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

The Complete Vocal Technique (CVT) is a school of singing developed in the past decades by Cathrin Sadolin et al.. CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge. Knowledge of the desired vocal mode can be helpful for singing students. Automatic classification of vocal modes can thus be important for technology-assisted singing teaching. Previously, automatic classification of vocal modes has been attempted without major success, potentially due to a lack of data. Therefore, we recorded a novel vocal mode dataset consisting of sustained vowels recorded from four singers, three of which professional singers with more than five years of CVT-experience. The dataset covers the entire vocal range of the subjects, totaling 3,752 unique samples. By using four microphones, thereby offering a natural data augmentation, the dataset consists of more than 13,000 samples combined. An annotation was created using three CVT-experienced annotators, each providing an individual annotation. The merged annotation as well as the three individual annotations come with the published dataset. Additionally, we provide some baseline classification results. The best balanced accuracy across a 5-fold cross validation of 81.3\,\% was achieved with a ResNet18. The dataset can be downloaded under https://zenodo.org/records/14276415.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a new dataset of sustained vowels from four singers (three professional with CVT experience) for classifying Complete Vocal Technique modes (Neutral, Curbing, Overdrive, Edge). Recordings use four microphones for natural augmentation (>13k samples total), with annotations from three CVT-experienced annotators (individual and merged versions released). Baseline supervised classification reports 81.3% balanced accuracy via 5-fold CV on a ResNet18.

Significance. A publicly released, multi-microphone vocal-mode dataset with expert annotations would address the acknowledged data scarcity in this subfield and support development of tools for singing pedagogy. The multi-annotator design and release of raw annotations are strengths that enable future work on label uncertainty. However, the baseline's evidential value for dataset utility depends on whether the reported accuracy reflects mode discrimination rather than singer identification.

major comments (2)
  1. [Baseline results] Baseline results paragraph: The 5-fold cross-validation protocol is not described as stratified by singer (or using leave-one-singer-out). With only four singers total, folds almost certainly mix samples from the same singer across train and test sets, allowing a ResNet18 to exploit stable singer-specific timbral cues (formant structure, glottal source) rather than CVT mode distinctions. This directly weakens support for the 81.3% figure as evidence of the dataset's utility for general automatic classification.
  2. [Dataset annotation] Dataset annotation subsection: No description is given of how the merged annotation was constructed from the three individual annotations (e.g., majority vote threshold, tie-breaking rule, or exclusion of high-disagreement samples), and no inter-annotator agreement statistics (Cohen's kappa, percentage agreement) are reported. This leaves the ground-truth quality unverifiable and is load-bearing for any downstream classification claims.
minor comments (2)
  1. [Abstract] Abstract: The total sample count after augmentation is stated as 'more than 13,000' but the exact breakdown by mode, singer, and microphone is not summarized, making it harder to assess class balance or coverage of the vocal range.
  2. [Dataset release] The Zenodo link is provided but the manuscript does not list the exact file structure or README contents that accompany the released annotations and raw recordings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of the baseline results and annotation process.

read point-by-point responses
  1. Referee: [Baseline results] Baseline results paragraph: The 5-fold cross-validation protocol is not described as stratified by singer (or using leave-one-singer-out). With only four singers total, folds almost certainly mix samples from the same singer across train and test sets, allowing a ResNet18 to exploit stable singer-specific timbral cues (formant structure, glottal source) rather than CVT mode distinctions. This directly weakens support for the 81.3% figure as evidence of the dataset's utility for general automatic classification.

    Authors: We agree that the 5-fold CV protocol as described does not isolate singer identity and that, with only four singers, the model could exploit singer-specific cues. To provide stronger evidence of the dataset's utility for mode classification, we will add leave-one-singer-out (LOSO) cross-validation results to the revised manuscript. These will be reported alongside the existing 5-fold results for direct comparison, using the same ResNet18 architecture and balanced accuracy metric. revision: yes

  2. Referee: [Dataset annotation] Dataset annotation subsection: No description is given of how the merged annotation was constructed from the three individual annotations (e.g., majority vote threshold, tie-breaking rule, or exclusion of high-disagreement samples), and no inter-annotator agreement statistics (Cohen's kappa, percentage agreement) are reported. This leaves the ground-truth quality unverifiable and is load-bearing for any downstream classification claims.

    Authors: We will expand the Dataset annotation subsection to explicitly describe the merging procedure: a majority-vote rule across the three annotators, with ties resolved by selecting the label from the annotator with the most CVT teaching experience. We will also compute and report inter-annotator agreement using both percentage agreement and Cohen's kappa on the individual annotations. The already-released individual annotations enable users to perform additional uncertainty analyses. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical dataset of recorded sustained vowels from four singers, with annotations from three CVT-experienced annotators, followed by standard supervised classification baselines (ResNet18 achieving 81.3% balanced accuracy via 5-fold CV). No equations, fitted parameters, or predictions are defined in terms of themselves; the reported accuracy is a direct empirical result on the collected data rather than a quantity forced by construction or self-citation. The work contains no self-citation load-bearing steps, uniqueness theorems, or ansatzes that reduce the central claims to prior author work. The derivation chain is self-contained against external benchmarks (new recordings and standard ML evaluation).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is empirical data collection plus standard ML baselines; no new theoretical entities, fitted constants, or ad-hoc axioms are introduced beyond ordinary assumptions of audio classification.

axioms (1)
  • standard math Standard assumptions of supervised audio classification (i.i.d. samples, consistent labeling, convolutional networks suitable for spectrogram inputs).
    Invoked implicitly when applying ResNet18 to the audio data.

pith-pipeline@v0.9.0 · 5543 in / 1220 out tokens · 24519 ms · 2026-05-16T11:11:18.797637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Journal of Voice (2021).https: //doi.org/https://doi.org/10.1016/j.jvoice.2021.11.013

    Aaen, M., McGlashan, J., Christoph, N., Sadolin, C.: Deconstructing timbre into 5 physiological parameters: vocal mode, amount of metal, degree of density, size of larynx, and sound coloring. Journal of Voice (2021).https: //doi.org/https://doi.org/10.1016/j.jvoice.2021.11.013

  2. [2]

    In: Proceedings of the 134th Audio Engineering Society Conven- tion 2013 (05 2013)

    Brixen, E., Sadolin, C., Kjelin, H.: Acoustical characteristics of vocal modes in singing. In: Proceedings of the 134th Audio Engineering Society Conven- tion 2013 (05 2013)

  3. [3]

    In: Proceedings of the 137th Audio Engineering Society Convention 2014 (10 2014)

    Brixen, E., Sadolin, C., Kjelin, H.: The importance of onset features in listeners’ perception of vocal modes in singing. In: Proceedings of the 137th Audio Engineering Society Convention 2014 (10 2014)

  4. [4]

    In: Proceedings of the 132nd Audio Engineering Society Convention 2012

    Brixen, E.B., Sadolin, C., Kjelin, H.: On acoustic detection of vocal modes. In: Proceedings of the 132nd Audio Engineering Society Convention 2012. Audio Engineering Society (04 2012)

  5. [5]

    XGBoost: A Scalable Tree Boosting System

    Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Pro- ceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM, New York, NY, USA (2016).https://doi.org/10.1145/2939672.2939785,http: //doi.acm.org/10.1145/2939672.2939785

  6. [6]

    Complete Vocal Institute: Complete Vocal Technique.https: //completevocalinstitute.com/complete-vocal-technique/, ac- cessed: 2026-01-23

  7. [7]

    Logopedics Phoniatrics Vocology42(4), 146–152 (2017)

    Fantini, M., Fussi, F., Crosetti, E., Succo, G.: Estill voice training and voice quality control in contemporary commercial singing: an exploratory study. Logopedics Phoniatrics Vocology42(4), 146–152 (2017)

  8. [8]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog- nition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 770–778 (2016)

  9. [9]

    Journal of interdisci- plinary music studies2(1&2), 71–93 (May 2008),https://hal.science/ hal-00297248

    Henrich Bernardoni, N., Bezard, P., Expert, R., Garnier, M., Guerin, C., Pillot-Loiseau, C., Quattrocchi, S., Roubeau, B., Terk, B.: Towards a Com- mon Terminology to Describe Voice Quality in Western Lyrical Singing: Contribution of a Multidisciplinary Research Group. Journal of interdisci- plinary music studies2(1&2), 71–93 (May 2008),https://hal.scienc...

  10. [10]

    IEEE Signal Processing Magazine36(1), 82–94 (2019).https://doi.org/ 10.1109/MSP.2018.2875133

    Humphrey, E.J., Reddy, S., Seetharaman, P., Kumar, A., Bittner, R.M., Demetriou, A., Gulati, S., Jansson, A., Jehan, T., Lehner, B., Krupse, A., Yang, L.: An introduction to signal processing for singing-voice analysis: High notes in the effort to automate the understanding of vocals in music. IEEE Signal Processing Magazine36(1), 82–94 (2019).https://doi...

  11. [11]

    Journal of Voice35(5), 804–e27 (2021)

    Leppävuori, M., Lammentausta, E., Peuna, A., Bode, M.K., Jokelainen, J., Ojala, J., Nieminen, M.T.: Characterizing vocal tract dimensions in the A Dataset for Automatic Vocal Mode Classification 17 vocal modes using magnetic resonance imaging. Journal of Voice35(5), 804–e27 (2021)

  12. [12]

    Electronic Theses and Disserta- tions (2011),https://digitalcommons.memphis.edu/etd/376

    McClellan, J.W.: A comparative analysis of speech level singing and tradi- tional vocal training in the united states. Electronic Theses and Disserta- tions (2011),https://digitalcommons.memphis.edu/etd/376

  13. [13]

    Pilot and Feasibility Studies9(1), 88 (2023)

    McGlashan, J., Aaen, M., White, A., Sadolin, C.: A mixed-method feasi- bility study of the use of the complete vocal technique (cvt), a pedagogic method to improve the voice and vocal function in singers and actors, in the treatment of patients with muscle tension dysphonia: a study protocol. Pilot and Feasibility Studies9(1), 88 (2023)

  14. [14]

    belting”?: An empirical study qualifying and categorizing “belting

    McGlashan, J., Thuesen, M.A., Sadolin, C.: Overdrive and edge as refiners of “belting”?: An empirical study qualifying and categorizing “belting” based on audio perception, laryngostroboscopic imaging, acoustics, ltas, and egg. Journal of Voice31(3), 385.e11–385.e22 (2017).https://doi.org/https: //doi.org/10.1016/j.jvoice.2016.09.006

  15. [15]

    In: On the Art of Singing

    Miller, R.: The Singing Teacher in the Age of Voice Science. In: On the Art of Singing. Oxford University Press (09 1996).https://doi.org/10.1093/ acprof:osobl/9780195098259.003.0070

  16. [16]

    Bosworth Music (2013)

    Sadolin, C.: Complete Vocal Technique. Bosworth Music (2013)

  17. [17]

    Logopedics, phoniatrics, vocology pp

    Saldías, M., Castro, C., Espinoza Catalán, V., Stoney, J., Quezada, C., Laukkanen, A.M.: Spectral features related to the auditory perception of twang-like voices. Logopedics, phoniatrics, vocology pp. 1–18 (04 2024). https://doi.org/10.1080/14015439.2024.2345373

  18. [18]

    Journal of Voice (2023)

    Sol, J., Aaen, M., Sadolin, C., Ten Bosch, L.: Towards automated vocal mode classification in healthy singing voice—an xgboost decision tree-based machine learning classifier. Journal of Voice (2023)

  19. [19]

    In: Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference

    Wilkins, J., Seetharaman, P., Wahl, A., Pardo, B.: Vocalset: A singing voice dataset. In: Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference. pp. 468–474 (2018)

  20. [20]

    Circuits, Systems and Signal Processing28(6), 819–843 (Dec 2009).https://doi.org/10

    Yoo, J.C., Han, T.H.: Fast normalized cross-correlation. Circuits, Systems and Signal Processing28(6), 819–843 (Dec 2009).https://doi.org/10. 1007/s00034-009-9130-7 18 Reemt Hinrichs, Sonja Stephan, Alexander Lange, and Jörn Ostermann Appendix C2-F2 C#2-F#2 D2-G2 D#2-G#2 E2-A2F2-A#2F#2-B2G2-C3 G#2-C#3 A2-D3 A#2-D#3 B2-E3C3-F3 C#3-F#3 D3-G3 D#3-G#3 E3-A3F3...