A Dataset for Automatic Vocal Mode Classification
Pith reviewed 2026-05-16 11:11 UTC · model grok-4.3
The pith
A new dataset of over 13,000 vocal samples enables automatic classification of four singing modes with 81.3% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a novel dataset consisting of 3,752 unique sustained vowel samples from four singers, augmented to over 13,000 samples via four microphones, with annotations for the CVT vocal modes Neutral, Curbing, Overdrive, and Edge. Baseline classification using deep learning models like ResNet18 yields a best balanced accuracy of 81.3% across 5-fold cross validation, establishing a performance benchmark for future work on automatic vocal mode classification.
What carries the argument
The merged annotations from three CVT-experienced annotators on the multi-microphone recordings of sustained vowels, which serve as the labeled data for training classifiers.
If this is right
- Automatic classification of vocal modes can support technology-assisted singing teaching.
- The dataset enables development of models for identifying Neutral, Curbing, Overdrive, and Edge modes.
- Multi-microphone setup provides natural data augmentation for improved model robustness.
- Baseline results set a standard for comparing future classification approaches.
Where Pith is reading between the lines
- This dataset could be extended to include full songs or dynamic transitions between modes for more realistic applications.
- Integration with mobile apps might allow singers to receive instant feedback on their vocal technique.
- The annotation process highlights the subjectivity in vocal mode identification, suggesting potential for consensus-based or probabilistic labeling in future datasets.
- Cross-singer generalization might be tested by training on some singers and evaluating on others.
Load-bearing premise
The merged annotations from the three experienced annotators accurately and consistently identify the intended vocal modes in the samples without significant disagreement or influence from recording conditions.
What would settle it
A study where independent CVT experts re-annotate a subset of the samples and find substantial disagreement with the provided merged labels, or where classifiers trained on the dataset fail to generalize to new singers or recording setups.
Figures
read the original abstract
The Complete Vocal Technique (CVT) is a school of singing developed in the past decades by Cathrin Sadolin et al.. CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge. Knowledge of the desired vocal mode can be helpful for singing students. Automatic classification of vocal modes can thus be important for technology-assisted singing teaching. Previously, automatic classification of vocal modes has been attempted without major success, potentially due to a lack of data. Therefore, we recorded a novel vocal mode dataset consisting of sustained vowels recorded from four singers, three of which professional singers with more than five years of CVT-experience. The dataset covers the entire vocal range of the subjects, totaling 3,752 unique samples. By using four microphones, thereby offering a natural data augmentation, the dataset consists of more than 13,000 samples combined. An annotation was created using three CVT-experienced annotators, each providing an individual annotation. The merged annotation as well as the three individual annotations come with the published dataset. Additionally, we provide some baseline classification results. The best balanced accuracy across a 5-fold cross validation of 81.3\,\% was achieved with a ResNet18. The dataset can be downloaded under https://zenodo.org/records/14276415.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a new dataset of sustained vowels from four singers (three professional with CVT experience) for classifying Complete Vocal Technique modes (Neutral, Curbing, Overdrive, Edge). Recordings use four microphones for natural augmentation (>13k samples total), with annotations from three CVT-experienced annotators (individual and merged versions released). Baseline supervised classification reports 81.3% balanced accuracy via 5-fold CV on a ResNet18.
Significance. A publicly released, multi-microphone vocal-mode dataset with expert annotations would address the acknowledged data scarcity in this subfield and support development of tools for singing pedagogy. The multi-annotator design and release of raw annotations are strengths that enable future work on label uncertainty. However, the baseline's evidential value for dataset utility depends on whether the reported accuracy reflects mode discrimination rather than singer identification.
major comments (2)
- [Baseline results] Baseline results paragraph: The 5-fold cross-validation protocol is not described as stratified by singer (or using leave-one-singer-out). With only four singers total, folds almost certainly mix samples from the same singer across train and test sets, allowing a ResNet18 to exploit stable singer-specific timbral cues (formant structure, glottal source) rather than CVT mode distinctions. This directly weakens support for the 81.3% figure as evidence of the dataset's utility for general automatic classification.
- [Dataset annotation] Dataset annotation subsection: No description is given of how the merged annotation was constructed from the three individual annotations (e.g., majority vote threshold, tie-breaking rule, or exclusion of high-disagreement samples), and no inter-annotator agreement statistics (Cohen's kappa, percentage agreement) are reported. This leaves the ground-truth quality unverifiable and is load-bearing for any downstream classification claims.
minor comments (2)
- [Abstract] Abstract: The total sample count after augmentation is stated as 'more than 13,000' but the exact breakdown by mode, singer, and microphone is not summarized, making it harder to assess class balance or coverage of the vocal range.
- [Dataset release] The Zenodo link is provided but the manuscript does not list the exact file structure or README contents that accompany the released annotations and raw recordings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of the baseline results and annotation process.
read point-by-point responses
-
Referee: [Baseline results] Baseline results paragraph: The 5-fold cross-validation protocol is not described as stratified by singer (or using leave-one-singer-out). With only four singers total, folds almost certainly mix samples from the same singer across train and test sets, allowing a ResNet18 to exploit stable singer-specific timbral cues (formant structure, glottal source) rather than CVT mode distinctions. This directly weakens support for the 81.3% figure as evidence of the dataset's utility for general automatic classification.
Authors: We agree that the 5-fold CV protocol as described does not isolate singer identity and that, with only four singers, the model could exploit singer-specific cues. To provide stronger evidence of the dataset's utility for mode classification, we will add leave-one-singer-out (LOSO) cross-validation results to the revised manuscript. These will be reported alongside the existing 5-fold results for direct comparison, using the same ResNet18 architecture and balanced accuracy metric. revision: yes
-
Referee: [Dataset annotation] Dataset annotation subsection: No description is given of how the merged annotation was constructed from the three individual annotations (e.g., majority vote threshold, tie-breaking rule, or exclusion of high-disagreement samples), and no inter-annotator agreement statistics (Cohen's kappa, percentage agreement) are reported. This leaves the ground-truth quality unverifiable and is load-bearing for any downstream classification claims.
Authors: We will expand the Dataset annotation subsection to explicitly describe the merging procedure: a majority-vote rule across the three annotators, with ties resolved by selecting the label from the annotator with the most CVT teaching experience. We will also compute and report inter-annotator agreement using both percentage agreement and Cohen's kappa on the individual annotations. The already-released individual annotations enable users to perform additional uncertainty analyses. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical dataset of recorded sustained vowels from four singers, with annotations from three CVT-experienced annotators, followed by standard supervised classification baselines (ResNet18 achieving 81.3% balanced accuracy via 5-fold CV). No equations, fitted parameters, or predictions are defined in terms of themselves; the reported accuracy is a direct empirical result on the collected data rather than a quantity forced by construction or self-citation. The work contains no self-citation load-bearing steps, uniqueness theorems, or ansatzes that reduce the central claims to prior author work. The derivation chain is self-contained against external benchmarks (new recordings and standard ML evaluation).
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of supervised audio classification (i.i.d. samples, consistent labeling, convolutional networks suitable for spectrogram inputs).
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The best balanced accuracy across a 5-fold cross validation of 81.3% was achieved with a ResNet18.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of Voice (2021).https: //doi.org/https://doi.org/10.1016/j.jvoice.2021.11.013
Aaen, M., McGlashan, J., Christoph, N., Sadolin, C.: Deconstructing timbre into 5 physiological parameters: vocal mode, amount of metal, degree of density, size of larynx, and sound coloring. Journal of Voice (2021).https: //doi.org/https://doi.org/10.1016/j.jvoice.2021.11.013
-
[2]
In: Proceedings of the 134th Audio Engineering Society Conven- tion 2013 (05 2013)
Brixen, E., Sadolin, C., Kjelin, H.: Acoustical characteristics of vocal modes in singing. In: Proceedings of the 134th Audio Engineering Society Conven- tion 2013 (05 2013)
work page 2013
-
[3]
In: Proceedings of the 137th Audio Engineering Society Convention 2014 (10 2014)
Brixen, E., Sadolin, C., Kjelin, H.: The importance of onset features in listeners’ perception of vocal modes in singing. In: Proceedings of the 137th Audio Engineering Society Convention 2014 (10 2014)
work page 2014
-
[4]
In: Proceedings of the 132nd Audio Engineering Society Convention 2012
Brixen, E.B., Sadolin, C., Kjelin, H.: On acoustic detection of vocal modes. In: Proceedings of the 132nd Audio Engineering Society Convention 2012. Audio Engineering Society (04 2012)
work page 2012
-
[5]
XGBoost: A Scalable Tree Boosting System
Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Pro- ceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM, New York, NY, USA (2016).https://doi.org/10.1145/2939672.2939785,http: //doi.acm.org/10.1145/2939672.2939785
-
[6]
Complete Vocal Institute: Complete Vocal Technique.https: //completevocalinstitute.com/complete-vocal-technique/, ac- cessed: 2026-01-23
work page 2026
-
[7]
Logopedics Phoniatrics Vocology42(4), 146–152 (2017)
Fantini, M., Fussi, F., Crosetti, E., Succo, G.: Estill voice training and voice quality control in contemporary commercial singing: an exploratory study. Logopedics Phoniatrics Vocology42(4), 146–152 (2017)
work page 2017
-
[8]
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog- nition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 770–778 (2016)
work page 2016
-
[9]
Henrich Bernardoni, N., Bezard, P., Expert, R., Garnier, M., Guerin, C., Pillot-Loiseau, C., Quattrocchi, S., Roubeau, B., Terk, B.: Towards a Com- mon Terminology to Describe Voice Quality in Western Lyrical Singing: Contribution of a Multidisciplinary Research Group. Journal of interdisci- plinary music studies2(1&2), 71–93 (May 2008),https://hal.scienc...
work page 2008
-
[10]
IEEE Signal Processing Magazine36(1), 82–94 (2019).https://doi.org/ 10.1109/MSP.2018.2875133
Humphrey, E.J., Reddy, S., Seetharaman, P., Kumar, A., Bittner, R.M., Demetriou, A., Gulati, S., Jansson, A., Jehan, T., Lehner, B., Krupse, A., Yang, L.: An introduction to signal processing for singing-voice analysis: High notes in the effort to automate the understanding of vocals in music. IEEE Signal Processing Magazine36(1), 82–94 (2019).https://doi...
-
[11]
Journal of Voice35(5), 804–e27 (2021)
Leppävuori, M., Lammentausta, E., Peuna, A., Bode, M.K., Jokelainen, J., Ojala, J., Nieminen, M.T.: Characterizing vocal tract dimensions in the A Dataset for Automatic Vocal Mode Classification 17 vocal modes using magnetic resonance imaging. Journal of Voice35(5), 804–e27 (2021)
work page 2021
-
[12]
Electronic Theses and Disserta- tions (2011),https://digitalcommons.memphis.edu/etd/376
McClellan, J.W.: A comparative analysis of speech level singing and tradi- tional vocal training in the united states. Electronic Theses and Disserta- tions (2011),https://digitalcommons.memphis.edu/etd/376
work page 2011
-
[13]
Pilot and Feasibility Studies9(1), 88 (2023)
McGlashan, J., Aaen, M., White, A., Sadolin, C.: A mixed-method feasi- bility study of the use of the complete vocal technique (cvt), a pedagogic method to improve the voice and vocal function in singers and actors, in the treatment of patients with muscle tension dysphonia: a study protocol. Pilot and Feasibility Studies9(1), 88 (2023)
work page 2023
-
[14]
belting”?: An empirical study qualifying and categorizing “belting
McGlashan, J., Thuesen, M.A., Sadolin, C.: Overdrive and edge as refiners of “belting”?: An empirical study qualifying and categorizing “belting” based on audio perception, laryngostroboscopic imaging, acoustics, ltas, and egg. Journal of Voice31(3), 385.e11–385.e22 (2017).https://doi.org/https: //doi.org/10.1016/j.jvoice.2016.09.006
-
[15]
Miller, R.: The Singing Teacher in the Age of Voice Science. In: On the Art of Singing. Oxford University Press (09 1996).https://doi.org/10.1093/ acprof:osobl/9780195098259.003.0070
- [16]
-
[17]
Logopedics, phoniatrics, vocology pp
Saldías, M., Castro, C., Espinoza Catalán, V., Stoney, J., Quezada, C., Laukkanen, A.M.: Spectral features related to the auditory perception of twang-like voices. Logopedics, phoniatrics, vocology pp. 1–18 (04 2024). https://doi.org/10.1080/14015439.2024.2345373
-
[18]
Sol, J., Aaen, M., Sadolin, C., Ten Bosch, L.: Towards automated vocal mode classification in healthy singing voice—an xgboost decision tree-based machine learning classifier. Journal of Voice (2023)
work page 2023
-
[19]
In: Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference
Wilkins, J., Seetharaman, P., Wahl, A., Pardo, B.: Vocalset: A singing voice dataset. In: Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference. pp. 468–474 (2018)
work page 2018
-
[20]
Circuits, Systems and Signal Processing28(6), 819–843 (Dec 2009).https://doi.org/10
Yoo, J.C., Han, T.H.: Fast normalized cross-correlation. Circuits, Systems and Signal Processing28(6), 819–843 (Dec 2009).https://doi.org/10. 1007/s00034-009-9130-7 18 Reemt Hinrichs, Sonja Stephan, Alexander Lange, and Jörn Ostermann Appendix C2-F2 C#2-F#2 D2-G2 D#2-G#2 E2-A2F2-A#2F#2-B2G2-C3 G#2-C#3 A2-D3 A#2-D#3 B2-E3C3-F3 C#3-F#3 D3-G3 D#3-G#3 E3-A3F3...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.