DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm
Pith reviewed 2026-05-25 15:39 UTC · model grok-4.3
The pith
A teacher-student paradigm iteratively improves singing voice detection to align 5358 audio tracks with lyrics and notes from karaoke data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that starting with non-expert annotations of lyrics and notes without audio, retrieving web audio candidates, using a teacher singing-voice detection model to generate probability curves, matching them to refine alignments, and then training student models on the new data allows progressive improvement in detection performance and produces a large dataset of aligned multimodal tracks.
What carries the argument
The teacher-student machine learning paradigm in which improved singing-voice detection models are used to refine audio-lyrics alignments and the refined data is used to train better models.
If this is right
- The singing voice detection system improves with each iteration.
- Audio matching and time-alignment become more accurate.
- The resulting DALI dataset provides synchronized data at multiple granularities for 5358 tracks.
- The process can be repeated to further enhance performance.
Where Pith is reading between the lines
- This approach could generalize to creating datasets for other music elements like chords or structures.
- Researchers might use the dataset to train models for automatic singing transcription or lyric synchronization.
- The method reduces reliance on expensive expert annotations by leveraging web data and iterative refinement.
Load-bearing premise
The initial non-expert karaoke annotations contain enough accurate timing information to be recovered and improved by matching against singing voice probability curves from candidate audio.
What would settle it
If expert listeners find that a significant portion of the aligned lyrics and notes in DALI do not match the audio timing within a small tolerance, the claim of reliable refinement would be false.
read the original abstract
The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a set of manual annotations of draft time-aligned lyrics and notes made by non-expert users of Karaoke games. This set comes without audio. Therefore, we need to find the corresponding audio and adapt the annotations to it. To that end, we retrieve audio candidates from the Web. Each candidate is then turned into a singing-voice probability over time using a teacher, a deep convolutional neural network singing-voice detection system (SVD), trained on cleaned data. Comparing the time-aligned lyrics and the singing-voice probability, we detect matches and update the time-alignment lyrics accordingly. From this, we obtain new audio sets. They are then used to train new SVD students used to perform again the above comparison. The process could be repeated iteratively. We show that this allows to progressively improve the performances of our SVD and get better audio-matching and alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the DALI dataset comprising 5358 audio tracks with time-aligned vocal melody notes and lyrics at four granularity levels. The dataset is constructed via an iterative teacher-student paradigm: non-expert karaoke annotations (lacking audio) are matched to web-retrieved audio candidates by comparing their timestamps against singing-voice probability curves produced by a teacher SVD (deep CNN) model; matched data then trains student SVD models whose outputs are used for further matching and alignment refinement, with the loop asserted to progressively improve both SVD performance and alignment quality.
Significance. If the iterative matching and refinement process can be shown to produce accurate alignments and improved SVD models, the resulting dataset would constitute a substantial resource for music information retrieval tasks involving singing voice detection, lyric alignment, and melody transcription. The teacher-student interaction offers a potentially scalable annotation refinement strategy that reduces reliance on expert labeling.
major comments (2)
- [Abstract / matching step] Abstract (matching step description): The alignment update is performed solely by comparing draft karaoke timestamps to the scalar singing-voice probability curve output by the SVD teacher. This curve encodes only presence/absence of vocal activity and supplies no pitch, phoneme, or lyric-text information; consequently the procedure can at best perform coarse block-level shifts but cannot recover or correct intra-block note- or word-level timing offsets when the initial non-expert annotations deviate by more than a typical note duration.
- [Abstract] Abstract: The central claim that the iterative teacher-student loop 'progressively improve[s] the performances of our SVD and get[s] better audio-matching and alignment' is asserted without any quantitative metrics, error rates, alignment accuracy figures, or validation against ground-truth annotations. No tables, figures, or sections report these measurements, rendering the improvement claim unverifiable from the manuscript.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript introducing the DALI dataset. Below we respond point-by-point to the major comments, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [Abstract / matching step] Abstract (matching step description): The alignment update is performed solely by comparing draft karaoke timestamps to the scalar singing-voice probability curve output by the SVD teacher. This curve encodes only presence/absence of vocal activity and supplies no pitch, phoneme, or lyric-text information; consequently the procedure can at best perform coarse block-level shifts but cannot recover or correct intra-block note- or word-level timing offsets when the initial non-expert annotations deviate by more than a typical note duration.
Authors: We agree that the singing-voice probability provides only binary vocal activity information and therefore supports only coarse, block-level alignment shifts rather than intra-block corrections of note- or word-level timing. The approach is intentionally designed around the granularity of the available non-expert karaoke annotations, which are themselves phrase- or block-level. We will revise the abstract and method description to explicitly characterize the alignment as coarse block-level matching and to note this as a current limitation of the scalable pipeline. revision: yes
-
Referee: [Abstract] Abstract: The central claim that the iterative teacher-student loop 'progressively improve[s] the performances of our SVD and get[s] better audio-matching and alignment' is asserted without any quantitative metrics, error rates, alignment accuracy figures, or validation against ground-truth annotations. No tables, figures, or sections report these measurements, rendering the improvement claim unverifiable from the manuscript.
Authors: The manuscript asserts improvement from the iterative process, yet we acknowledge that no quantitative metrics, tables, or validation results against ground truth are presented to support the claim of progressive gains in SVD performance or alignment quality. We will add a dedicated evaluation section (or subsection) reporting SVD metrics across iterations and any available alignment quality indicators to make the improvement claim verifiable. revision: yes
Circularity Check
No circularity; derivation relies on external anchors
full rationale
The paper's methodology begins with external non-expert karaoke annotations and web-retrieved audio candidates. An initial SVD teacher is trained on cleaned data (external to the loop). Alignment updates are performed by comparing draft timestamps to the teacher's singing-voice probability curve, after which new data trains student models for iteration. No equations are present, no parameters are fitted to a subset and then called predictions, and no self-citations are invoked as load-bearing uniqueness theorems. The process is a standard teacher-student bootstrap with independent initial inputs; the central construction does not reduce to its own outputs by definition.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Singing voice is one of the most important elements in pop- ular music. It combines its two main dimensions: melody and lyrics. Together, they tell stories and convey emo- tions improving our listening experience. Singing voice is usually the central element around which songs are com- posed. It adds a linguistic dimension that complements th...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
RELA TED WORKS We review previous works related to our work: singing voice detection methods and the teacher-student paradigm. Singing V oice detection. Most approaches share a common architecture. Short-time observations are used to train a classifier that discriminates observations (per frame) in vocal or non-vocal classes. The final stream of predic- tio...
-
[3]
trained then to obtain ideal binary masks. Teacher-student paradigm. Teacher-student learning paradigm [2,28] has appeared as a solution to overcome the problem of insufficient labeled training data in MIR. Since manual labeling is a time-consuming tasks, the teacher- student paradigm explores the use of unlabeled data for su- pervised problems. The two ma...
-
[4]
One of these sources is Karaoke video games that fit exactly our requirements
SINGING VOICE DA TASET: CREA TION 3.1 Karaoke resources Outside the MIR community there are rich sources of in- formation that can be explored. One of these sources is Karaoke video games that fit exactly our requirements. In these games, users have to sing along with the music to win points according to their singing accuracy. To mea- sure their accuracy,...
-
[5]
find the value of o and fr that provides the best alignment between ˆp(t) andavs(t),
-
[6]
Stu- dent (Teacher J train) (2673)
based on this best alignment, deciding if ˆp(t) and avs(t) actually match each other and establishing if the match is good enough to be kept. Since we are interested in a global matching between ˆp(t)∈ [0, 1] andavs(t)∈{ 0, 1} we use the normalized cross-correlation (NCC) as distance 2 : NCC (o,fr ) = ∑ tavsfr (t−o)ˆp(t)√∑ tavsfr (t)2√∑ t ˆp(t)2 The NCC p...
work page 2000
-
[7]
SINGING VOICE DA TASET: ACCESS The DALI dataset can be downloaded athttps://github. com/gabolsgabs/DALI. There, we provide the detailed de- sciption of the dataset as well as all the necessary informa- tion for using it. DALI is presented under the recommen- dation made by [19] for the description of MIR corpora. The current version of DALI is 1.0. Future...
-
[8]
CONCLUSION AND FUTURE WORKS In this paper we introduced DALI, a large and rich mul- timodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. We explained our methodology where dataset cre- ation and learning models interact using a teacher-student paradigm benefiting one-another. From...
-
[9]
”in machine learning, is more data always better than better algorithms?”
Xavier Amatriain. ”in machine learning, is more data always better than better algorithms?”. https:// bit.ly/2seQzj9
- [10]
-
[11]
FMA: A Dataset For Music Analysis
K. Benzi, M. Defferrard, P. Vandergheynst, and X. Bresson. FMA: A dataset for music analysis.CoRR, abs/1612.01840, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
A. Berenzweig, D. P. W. Ellis, and S. Lawrence. Us- ing voice segments to improve artist classification of music. In AES 22, 2002
work page 2002
-
[13]
A. L. Berenzweig and D. P. W. Ellis. Locating singing voice segments within music signals. In WASPAA, pages 119–122, 2001
work page 2001
-
[14]
R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Can- nam, and J. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In ISMIR, 2014
work page 2014
-
[15]
A. Cont, D. Schwarz, N. Schnell, and C. Raphael. Eval- uation of Real-Time Audio-to-Score Alignment. In IS- MIR, Vienna, Austria, 2007
work page 2007
-
[16]
J. Cui, B. Kingsbury, B. Ramabhadran, G. Saon, T. Sercu, K. Audhkhasi, A. Sethy, M. Nussbaum- Thom, and A. Rosenberg. Knowledge distillation across ensembles of multilingual models for low- resource languages. In ICASSP, 2017
work page 2017
-
[17]
E. Fonseca, J. Pons, X. Favory, F. Font, D. Bog- danov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound datasets: A platform for the creation of open audio datasets. In ISMIR, Suzhou, China, 2017
work page 2017
-
[18]
H. Fujihara and M. Goto. Lyrics-to-Audio Alignment and its Application. In Multimodal Music Process- ing, volume 3 of Dagstuhl Follow-Ups, pages 23–36. Dagstuhl, Germany, 2012
work page 2012
-
[19]
H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno. Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics. 5(6):1252– 1261, 2011
work page 2011
-
[20]
M. Goto. Singing information processing. In ICSP, pages 2431–2438, 2014
work page 2014
-
[21]
E. J. Humphrey, N. Montecchio, R. Bittner, A. Jansson, and T. Jehan. Mining labeled data from web-scale col- lections for vocal activity detection in music. InISMIR, 2017
work page 2017
-
[22]
S. Leglaive, R. Hennequin, and R. Badeau. Singing voice detection with deep recurrent neural networks. In IEEE, editor, ICASSP, pages 121–125, Brisbane, Aus- tralia, 2015
work page 2015
- [23]
- [24]
-
[25]
A. Mesaros. Singing voice identification and lyrics transcription for music information retrieval invited pa- per. In 7th Conference on Speech Technology and Hu- man - Computer Dialogue (SpeD), pages 1–10, 2013
work page 2013
-
[26]
G. Meseguer-Brocal, G. Peeters, G. Pellerin, M. Buffa, E. Cabrio, C. Faron Zucker, A. Giboin, I. Mirbel, R. Hennequin, M. Moussallam, F. Piccoli, and T. Fil- lon. W ASABI: a Two Million Song Database Project with Audio and Cultural Metadata plus WebAudio en- hanced Client Applications. In Web Audio Conf., Lon- don, U.K., 2017. Queen Mary University of London
work page 2017
-
[27]
G. Peeters and K. Fort. Towards a (better) Definition of Annotated MIR Corpora. In ISMIR, Porto, Portugal, 2012
work page 2012
- [28]
-
[29]
L. Regnier and G. Peeters. Singing V oice Detection in Music Tracks using Direct V oice Vibrato Detection. In ICASSP, page 1, taipei, Taiwan, 2009
work page 2009
-
[30]
J. Salamon and E. G ´omez. Melody extraction from polyphonic music signals using pitch contour char- acteristics. IEEE Transactions on Audio, Speech and Language Processing, 20:1759–1770, 2012
work page 2012
-
[31]
J. Schl ¨uter. Learning to pinpoint singing voice from weakly labeled examples. In ISMIR, New York City, USA, 2016. ISMIR
work page 2016
-
[32]
J. Schl ¨uter and T. Grill. Exploring Data Augmenta- tion for Improved Singing V oice Detection with Neural Networks. In ISMIR 2015, Malaga, Spain, 2015
work page 2015
-
[33]
A. J. R. Simpson, G. Roma, and M. D. Plumb- ley. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network. abs/1504.04658, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [34]
-
[35]
S. Watanabe, T. Hori, J. Le Roux, and J. Hershey. Student-teacher network learning with enhanced fea- tures. In ICASSP, pages 5275–5279, 2017
work page 2017
- [36]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.