Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Chitralekha Gupta; Emre Y{\i}lmaz; Haizhou Li

arxiv: 1906.10369 · v1 · pith:BCZIRSS3new · submitted 2019-06-25 · 📡 eess.AS · cs.CL· cs.SD

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Chitralekha Gupta , Emre Y{\i}lmaz , Haizhou Li This is my paper

Pith reviewed 2026-05-25 16:21 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD

keywords lyrics alignmentacoustic modelingpolyphonic audiomodel adaptationsinging vocalsword boundary detectionautomatic alignment

0 comments

The pith

Adapting solo-singing acoustic models with extra voicing and auditory features plus small polyphonic data cuts word-boundary alignment errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that lyrics-to-audio alignment in real music can be made more accurate by first adding speech and music-informed features to standard acoustic inputs and then shifting models trained on clean solo vocals toward polyphonic recordings using only a modest amount of target-domain audio. A sympathetic reader would care because background music and the scarcity of annotated polyphonic corpora have long limited reliable automatic alignment, so a workable adaptation route could make systems practical without needing massive new labeled collections. The central mechanism is domain adaptation that narrows the mismatch between training and test conditions while the added features increase robustness to the larger spectro-temporal swings found in accompanied singing. Experiments demonstrate lower alignment error rates than prior systems, with the largest gains appearing on the hardest cases that contain extended musical interludes.

Core claim

Incorporating voicing and auditory features together with conventional acoustic features brings robustness against increased spectro-temporal variations in singing vocals; adapting the acoustic model using a small amount of polyphonic audio data reduces the domain mismatch between training and testing data; the combined strategy yields a significant reduction in word-boundary alignment error over comparable existing systems, especially on challenging polyphonic data with long-duration musical interludes.

What carries the argument

Acoustic-model adaptation from large solo-singing corpora to polyphonic audio using limited in-domain data, augmented by speech- and music-informed features such as voicing and auditory descriptors.

If this is right

Alignment systems become usable on real-world polyphonic recordings without requiring large annotated polyphonic corpora.
Error rates drop most on segments containing extended instrumental passages.
The same adaptation recipe can be applied to other singing-related tasks that suffer from domain shift between solo and accompanied audio.
Word-boundary precision improves enough to support downstream applications that rely on accurate timing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other music genres or languages to check whether the adaptation gain generalizes beyond the evaluated data.
If the small-data adaptation works reliably, similar transfer methods might reduce labeling costs for related audio tasks such as singing transcription.
Systems built this way could feed more accurate timing into music-search or karaoke tools that currently struggle with accompanied vocals.

Load-bearing premise

A small amount of in-domain polyphonic audio data suffices to adapt solo-singing models without introducing overfitting.

What would settle it

A controlled test in which the same small polyphonic adaptation set produces no error reduction or an increase in word-boundary error on held-out polyphonic tracks.

Figures

Figures reproduced from arXiv: 1906.10369 by Chitralekha Gupta, Emre Y{\i}lmaz, Haizhou Li.

**Figure 1.** Figure 1: Comparison of word boundary alignment error distribution between C1 on extracted vocals test data and C6 on polyphonic test data on (a) Hansen’s and (b) Mauch’s datasets [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Automatic lyrics to polyphonic audio alignment is a challenging task not only because the vocals are corrupted by background music, but also there is a lack of annotated polyphonic corpus for effective acoustic modeling. In this work, we propose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data. Incorporating additional information such as voicing and auditory features together with conventional acoustic features aims to bring robustness against the increased spectro-temporal variations in singing vocals. By adapting the acoustic model using a small amount of polyphonic audio data, we reduce the domain mismatch between training and testing data. We perform several alignment experiments and present an in-depth alignment error analysis on acoustic features, and model adaptation techniques. The results demonstrate that the proposed strategy provides a significant error reduction of word boundary alignment over comparable existing systems, especially on more challenging polyphonic data with long-duration musical interludes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies voicing/auditory features plus solo-to-polyphonic adaptation for lyrics alignment and claims error drops, but the abstract supplies no numbers, data sizes, or adaptation controls, leaving the main result hard to evaluate.

read the letter

The core move is straightforward: take acoustic models trained on solo singing, add speech and music-informed features like voicing and auditory cues, then adapt them to polyphonic audio with a small in-domain set. The abstract says this cuts word-boundary errors, especially on tracks with long instrumental sections, over prior systems. That addresses a real practical issue in music audio where background music messes up alignment. The feature addition and domain adaptation step are reasonable extensions of existing work in singing voice processing rather than a new framework. The paper does a service by focusing on the domain mismatch and testing on more challenging polyphonic cases. The in-depth error analysis mentioned sounds useful if it breaks down where the gains come from. The soft spot is exactly the one the stress-test note flags. The abstract gives no count of how much polyphonic data was used for adaptation, no description of the adaptation procedure, and no sign of validation splits or regularization. Without those, it is impossible to know whether the reported improvement reflects better handling of music interference or just fitting to a handful of examples. The claim of significant reduction therefore rests on unshown evidence. If the full paper includes tables with absolute error rates, baseline comparisons, dataset durations, and adaptation details, that would change the picture; the abstract alone does not. This work is aimed at researchers in music information retrieval who already work on lyrics alignment or singing voice tasks. A reader who needs a practical tweak for polyphonic conditions might pick up the feature set or adaptation idea. It is not a foundational advance, so it would not change how most labs approach acoustic modeling. The paper shows clear thinking on the problem setup and honest engagement with the domain shift, so it deserves a serious referee who can ask for the missing experimental controls and check whether the adaptation holds up. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that automatic lyrics-to-polyphonic audio alignment can be improved by (1) augmenting conventional acoustic features with speech- and music-informed features such as voicing and auditory features to increase robustness to spectro-temporal variation, and (2) adapting acoustic models pretrained on large solo-singing corpora to the polyphonic domain using only a small amount of in-domain polyphonic data. Experiments and error analysis are said to demonstrate significant reductions in word-boundary alignment error relative to existing systems, with the largest gains on challenging polyphonic excerpts containing long musical interludes.

Significance. If the adaptation procedure can be shown to succeed without overfitting, the work would offer a practical route to mitigating the domain mismatch between solo-singing training data and polyphonic test conditions without requiring large annotated polyphonic corpora, which is a recognized bottleneck in the field.

major comments (2)

[Methods / Adaptation procedure] The central claim rests on successful domain adaptation with a small in-domain polyphonic set, yet the manuscript supplies neither the exact duration or number of polyphonic clips used for adaptation, the adaptation algorithm (full fine-tuning, layer-wise, MAP, etc.), nor any description of held-out validation, early stopping, or regularization. Without these controls it is impossible to rule out that observed gains arise from memorization rather than improved robustness to background music.
[Abstract and Results] The abstract asserts that 'experiments demonstrate error reduction' and that the proposed strategy yields 'significant error reduction … especially on more challenging polyphonic data,' but reports no numerical error rates, baseline comparisons, dataset sizes, or statistical tests. The absence of these quantities prevents assessment of whether the reported gains are load-bearing or merely suggestive.

minor comments (1)

[Abstract] The abstract would be strengthened by the inclusion of at least one key quantitative result (e.g., absolute or relative word-boundary error reduction) to substantiate the headline claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Methods / Adaptation procedure] The central claim rests on successful domain adaptation with a small in-domain polyphonic set, yet the manuscript supplies neither the exact duration or number of polyphonic clips used for adaptation, the adaptation algorithm (full fine-tuning, layer-wise, MAP, etc.), nor any description of held-out validation, early stopping, or regularization. Without these controls it is impossible to rule out that observed gains arise from memorization rather than improved robustness to background music.

Authors: We agree that the manuscript would benefit from greater specificity on the adaptation procedure. In the revised version we will report the exact duration and number of polyphonic clips used for adaptation, the precise adaptation algorithm (including whether full fine-tuning or another method was applied), and any held-out validation, early-stopping, or regularization steps that were performed. These additions will allow readers to evaluate the risk of overfitting more rigorously. revision: yes
Referee: [Abstract and Results] The abstract asserts that 'experiments demonstrate error reduction' and that the proposed strategy yields 'significant error reduction … especially on more challenging polyphonic data,' but reports no numerical error rates, baseline comparisons, dataset sizes, or statistical tests. The absence of these quantities prevents assessment of whether the reported gains are load-bearing or merely suggestive.

Authors: The Experiments section already contains the numerical error rates, baseline comparisons, dataset sizes, and statistical tests. To improve accessibility we will revise the abstract to include the key quantitative results and an explicit reference to the statistical significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical adaptation results are externally grounded

full rationale

The paper describes a standard pipeline of incorporating additional speech/music features and adapting solo-singing acoustic models on a small polyphonic corpus, then reports measured word-boundary error reductions on held-out test data. No equations, uniqueness theorems, or self-citations are invoked to derive the performance gains; the gains are presented as outcomes of experiments whose validity rests on external acoustic-modeling literature and reproducible evaluation protocols rather than any definitional or fitted-input reduction internal to the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that voicing and auditory features add robustness beyond conventional acoustic features, and that limited in-domain data suffices for effective adaptation. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Additional voicing and auditory features increase robustness against spectro-temporal variations in polyphonic singing vocals.
Invoked when proposing the use of speech and music-informed features to handle background music interference.
domain assumption Acoustic models trained on solo singing can be effectively adapted to polyphonic conditions with only a small amount of in-domain data.
Invoked in the model adaptation step to reduce domain mismatch.

pith-pipeline@v0.9.0 · 5707 in / 1319 out tokens · 44869 ms · 2026-05-25T16:21:33.041638+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Introduction The goal of an automatic lyrics-to-audio alignment algorithm is the time synchronization between the lyrics and the singing vocals with or without background music. It potentially enables various applications such as generating karaoke scrolling lyrics, music video subtitling, and music retrieval. The task of lyrics-to-audio alignment is ofte...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

musical surface

Speech and music-informed features Speech and singing have many similarities because they share the underlying physiological mechanisms for production, such as articulatory movements in vocal production [14, 15]. Mod- ern ASR systems use conventional acoustic features such as mel-scaled cepstral coefﬁcients (MFCC) to capture the pho- netic aspects in conj...

work page
[3]

Model adaptation for domain mismatch Our goal is to build a framework to automatically align lyrics to the polyphonic music audio. With an acoustic model trained on solo-singing data, we can adapt the model towards the test data in two ways: (a) by making the test data closer to the trained solo-singing acoustic models by applying vocal separation on poly...

work page
[4]

It was created with a set of initial manual annotations of time-aligned lyrics made by non- expert users of Karaoke games, where the audio was not avail- able

was introduced, that consists of 5,000+ polyphonic songs with note annotations and weak word-level, line-level, and paragraph-level lyrics annotations. It was created with a set of initial manual annotations of time-aligned lyrics made by non- expert users of Karaoke games, where the audio was not avail- able. The corresponding audio candidates were then ...

work page
[5]

Experimental setup We conduct two sets of experiments to study the impact of our proposed acoustic modeling strategies for lyrics alignment: (1) we ﬁrst assess the effect of the speech and music informed fea- tures on lyrics alignment in solo-singing, and (2) then we inves- tigate the effects of these features in polyphonic music lyrics alignment, along w...

work page
[6]

We use the DALI ground-truth data for domain adaptation of the acoustic models to the poly- Table 2: Dataset description

The training data for solo-singing acoustic modeling is ap- proximately 50 hours of the DAMP dataset [5, 9] that has weak line-level lyrics transcription. We use the DALI ground-truth data for domain adaptation of the acoustic models to the poly- Table 2: Dataset description. (solo: solo-singing; poly: singing mixed with music) Name Audio type Content Lyr...

work page 2017
[7]

Performance on solo-singing In the ﬁrst set of experiments, we explore the effect of each of the speech and music informed feature groups combined with MFCCs and i-vectors

Results and discussion 5.1. Performance on solo-singing In the ﬁrst set of experiments, we explore the effect of each of the speech and music informed feature groups combined with MFCCs and i-vectors. The alignment results provided by differ- ent feature conﬁgurations on the Hansen’s solo-singing dataset is shown in Table 4. Training the solo-singing acou...

work page
[8]

Conclusions In this study, we discuss two strategies to obtain improved acoustic modeling for the task of lyrics-to-audio alignment. Par- ticularly, we propose to (1) employ additional features with speech- and music-related information together with conven- tional MFCCs, and (2) adapt solo-singing acoustic model using small amount of in-domain polyphonic...

work page
[9]

Acknowledgments This research is supported by Ministry of Education, Singapore AcRF Tier 1 NUS Start-up Grant FY2016, Non-parametric ap- proach to voice morphing

work page
[10]

Lyricsynchro- nizer: Automatic synchronization system between musical audio signals and lyrics,

H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsynchro- nizer: Automatic synchronization system between musical audio signals and lyrics,”IEEE Journal of Selected Topics in Signal Pro- cessing, vol. 5, no. 6, pp. 1252–1261, 2011

work page 2011
[11]

Integrating additional chord information into HMM-based lyrics-to-audio alignment,

M. Mauch, H. Fujihara, and M. Goto, “Integrating additional chord information into HMM-based lyrics-to-audio alignment,” IEEE Transactions on Audio, Speech and Language Processing , vol. 20, no. 1, pp. 200–210, 2012

work page 2012
[12]

Leveraging repetition for improved automatic lyric transcription in popular music,

M. McVicar, D. P. Ellis, and M. Goto, “Leveraging repetition for improved automatic lyric transcription in popular music,” inProc. ICASSP, 2014, pp. 3117–3121

work page 2014
[13]

Automatic recognition of lyrics in singing,

A. Mesaros and T. Virtanen, “Automatic recognition of lyrics in singing,” EURASIP Journal on Audio, Speech, and Music Pro- cessing, vol. 2010, p. 4, 2010

work page 2010
[14]

Semi-supervised lyrics and solo-singing alignment,

C. Gupta, R. Tong, H. Li, and Y . Wang, “Semi-supervised lyrics and solo-singing alignment,” in Proc. ISMIR, 2018

work page 2018
[15]

V ocal detection in music with support vector machines,

M. Ramona, G. Richard, and B. David, “V ocal detection in music with support vector machines,” in 2008 Proc. ICASSP . IEEE, 2008, pp. 1885–1888

work page 2008
[16]

Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing,

A. M. Kruspe, “Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing,” inProc. ISMIR, 2016, pp. 358–364

work page 2016
[17]

Modeling of phoneme dura- tions for alignment between polyphonic audio and lyrics,

G. B. Dzhambazov and X. Serra, “Modeling of phoneme dura- tions for alignment between polyphonic audio and lyrics,” in12th Sound and Music Computing Conference, 2015, pp. 281–286

work page 2015
[18]

Smule.digital archive mobile performances(damp),

S. Sing!, “Smule.digital archive mobile performances(damp),” https://ccrma.stanford.edu/damp/, 2010 (accessed March 15, 2018)

work page 2010
[19]

Automatic pronunciation evalua- tion of singing,

C. Gupta, H. Li, and Y . Wang, “Automatic pronunciation evalua- tion of singing,” Proc. INTERSPEECH, pp. 1507–1511, 2018

work page 2018
[20]

Automatic lyrics- to-audio alignment on polyphonic music using singing-adapted acoustic models,

B. Sharma, C. Gupta, H. Li, and Y . Wang, “Automatic lyrics- to-audio alignment on polyphonic music using singing-adapted acoustic models,” in Proc. ICASSP. IEEE, 2019

work page 2019
[21]

Mirex2018: Lyrics-to-audio alignment for instru- ment accompanied singings,

C.-C. Wang, “Mirex2018: Lyrics-to-audio alignment for instru- ment accompanied singings,” in MIREX 2018, 2018

work page 2018
[22]

End-to-end lyrics align- ment for polyphonic music using an audio-to-character recogni- tion model,

S. E. Daniel Stoller, Simon Durand, “End-to-end lyrics align- ment for polyphonic music using an audio-to-character recogni- tion model,” in Proc. ICASSP. IEEE, 2019

work page 2019
[23]

Musical melody and speech into- nation: Singing a different tune,

R. J. Zatorre and S. R. Baum, “Musical melody and speech into- nation: Singing a different tune,” PLoS biology, vol. 10, no. 7, p. e1001372, 2012

work page 2012
[24]

Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing

S. Zhang, R. C. Repetto, and X. Serra, “Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing.” in Proc. ISMIR, 2014, pp. 343–348

work page 2014
[25]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, May 2011

work page 2011
[26]

Musical genre classiﬁcation of au- dio signals,

G. Tzanetakis and P. Cook, “Musical genre classiﬁcation of au- dio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002

work page 2002
[27]

A Hybrid of Deep Audio Feature and i-vector for Artist Recognition

J. Park, D. Kim, J. Lee, S. Kum, and J. Nam, “A hybrid of deep audio feature and i-vector for artist recognition,” arXiv preprint arXiv:1807.09208, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Song-level features and support vector machines for music classiﬁcation,

M. Mandel and D. Ellis, “Song-level features and support vector machines for music classiﬁcation,” in Proc. ISMIR, 2005

work page 2005
[29]

Lyrics-to-audio alignment and its ap- plication,

H. Fujihara and M. Goto, “Lyrics-to-audio alignment and its ap- plication,” in Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik, 2012

work page 2012
[30]

Low-delay singing voice alignment to text

A. Loscos, P. Cano, and J. Bonada, “Low-delay singing voice alignment to text.” in Proc. ICMC, 1999

work page 1999
[31]

Opensmile: the Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the Munich versatile and fast open-source audio feature extractor,” in Proc. ACM Multimedia. ACM, 2010, pp. 1459–1462

work page 2010
[32]

The interspeech 2013 computational paralinguistics chal- lenge: social signals, conﬂict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi et al., “The interspeech 2013 computational paralinguistics chal- lenge: social signals, conﬂict, emotion, autism,” in Proc. INTER- SPEECH, 2013

work page 2013
[33]

An information- theoretic approach to machine-oriented music summarization,

F. A. Raposo, D. M. de Matos, and R. Ribeiro, “An information- theoretic approach to machine-oriented music summarization,” Pattern Recognition Letters, 2019

work page 2019
[34]

Benchmarking music emotion recognition systems,

A. Alajanki, Y .-H. Yang, and M. Soleymani, “Benchmarking music emotion recognition systems,” PLOS ONE, pp. 835–838, 2016

work page 2016
[35]

Seek- ing the superstar: Automatic assessment of perceived singing quality,

J. B ¨ohm, F. Eyben, M. Schmitt, H. Kosch, and B. Schuller, “Seek- ing the superstar: Automatic assessment of perceived singing quality,” in 2017 International Joint Conference on Neural Net- works (IJCNN). IEEE, 2017, pp. 1560–1569

work page 2017
[36]

A comparison of features for speech, music discrimination,

M. J. Carey, E. S. Parris, and H. Lloyd-Thomas, “A comparison of features for speech, music discrimination,” in Proc. ICASSP, vol. 1. IEEE, 1999, pp. 149–152

work page 1999
[37]

A speech/music discriminator based on RMS and zero-crossings,

C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on RMS and zero-crossings,” IEEE Transactions on Multi- media, vol. 7, no. 1, pp. 155–166, 2005

work page 2005
[38]

Rasta processing of speech,

H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994

work page 1994
[39]

Automatic musical genre clas- siﬁcation of audio signals,

T. George, E. Georg, and C. Perry, “Automatic musical genre clas- siﬁcation of audio signals,” in Proc. ISMIR, 2001

work page 2001
[40]

Classifying music audio with timbral and chroma fea- tures,

D. Ellis, “Classifying music audio with timbral and chroma fea- tures,” in Proc. ISMIR, 2007

work page 2007
[41]

Dali: A large dataset of synchronized audio, lyrics and notes, automati- cally created using teacher-student machine learning paradigm,

G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Dali: A large dataset of synchronized audio, lyrics and notes, automati- cally created using teacher-student machine learning paradigm,” in Proc. ISMIR, 2018

work page 2018
[42]

Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefﬁcients,

J. K. Hansen, “Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefﬁcients,” in 9th Sound and Music Computing Conference (SMC), 2012, pp. 494–499

work page 2012
[43]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in in Proc. ASRU, 2011

work page 2011
[44]

Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. INTERSPEECH, 2018, pp. 3743–3747

work page 2018
[45]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inProc. INTERSPEECH, 2015, pp. 3586–3589

work page 2015
[46]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” inProc. ASRU, Dec 2013, pp. 55–59

work page 2013
[47]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” inProc. INTER- SPEECH, 2016, pp. 2751–2755

work page 2016
[48]

Wave-u-net: A multi-scale neural network for end-to-end audio source separation,

D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” in Proc. ISMIR, 2018

work page 2018
[49]

An overview of noise-robust automatic speech recognition,

J. Li, L. Deng, Y . Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, April 2014

work page 2014
[50]

Knowledge-based probabilistic modeling for tracking lyrics in music audio signals,

G. Dzhambazov, “Knowledge-based probabilistic modeling for tracking lyrics in music audio signals,” Ph.D. dissertation, Uni- versitat Pompeu Fabra, 2017

work page 2017

[1] [1]

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Introduction The goal of an automatic lyrics-to-audio alignment algorithm is the time synchronization between the lyrics and the singing vocals with or without background music. It potentially enables various applications such as generating karaoke scrolling lyrics, music video subtitling, and music retrieval. The task of lyrics-to-audio alignment is ofte...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

musical surface

Speech and music-informed features Speech and singing have many similarities because they share the underlying physiological mechanisms for production, such as articulatory movements in vocal production [14, 15]. Mod- ern ASR systems use conventional acoustic features such as mel-scaled cepstral coefﬁcients (MFCC) to capture the pho- netic aspects in conj...

work page

[3] [3]

Model adaptation for domain mismatch Our goal is to build a framework to automatically align lyrics to the polyphonic music audio. With an acoustic model trained on solo-singing data, we can adapt the model towards the test data in two ways: (a) by making the test data closer to the trained solo-singing acoustic models by applying vocal separation on poly...

work page

[4] [4]

It was created with a set of initial manual annotations of time-aligned lyrics made by non- expert users of Karaoke games, where the audio was not avail- able

was introduced, that consists of 5,000+ polyphonic songs with note annotations and weak word-level, line-level, and paragraph-level lyrics annotations. It was created with a set of initial manual annotations of time-aligned lyrics made by non- expert users of Karaoke games, where the audio was not avail- able. The corresponding audio candidates were then ...

work page

[5] [5]

Experimental setup We conduct two sets of experiments to study the impact of our proposed acoustic modeling strategies for lyrics alignment: (1) we ﬁrst assess the effect of the speech and music informed fea- tures on lyrics alignment in solo-singing, and (2) then we inves- tigate the effects of these features in polyphonic music lyrics alignment, along w...

work page

[6] [6]

We use the DALI ground-truth data for domain adaptation of the acoustic models to the poly- Table 2: Dataset description

The training data for solo-singing acoustic modeling is ap- proximately 50 hours of the DAMP dataset [5, 9] that has weak line-level lyrics transcription. We use the DALI ground-truth data for domain adaptation of the acoustic models to the poly- Table 2: Dataset description. (solo: solo-singing; poly: singing mixed with music) Name Audio type Content Lyr...

work page 2017

[7] [7]

Performance on solo-singing In the ﬁrst set of experiments, we explore the effect of each of the speech and music informed feature groups combined with MFCCs and i-vectors

Results and discussion 5.1. Performance on solo-singing In the ﬁrst set of experiments, we explore the effect of each of the speech and music informed feature groups combined with MFCCs and i-vectors. The alignment results provided by differ- ent feature conﬁgurations on the Hansen’s solo-singing dataset is shown in Table 4. Training the solo-singing acou...

work page

[8] [8]

Conclusions In this study, we discuss two strategies to obtain improved acoustic modeling for the task of lyrics-to-audio alignment. Par- ticularly, we propose to (1) employ additional features with speech- and music-related information together with conven- tional MFCCs, and (2) adapt solo-singing acoustic model using small amount of in-domain polyphonic...

work page

[9] [9]

Acknowledgments This research is supported by Ministry of Education, Singapore AcRF Tier 1 NUS Start-up Grant FY2016, Non-parametric ap- proach to voice morphing

work page

[10] [10]

Lyricsynchro- nizer: Automatic synchronization system between musical audio signals and lyrics,

H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsynchro- nizer: Automatic synchronization system between musical audio signals and lyrics,”IEEE Journal of Selected Topics in Signal Pro- cessing, vol. 5, no. 6, pp. 1252–1261, 2011

work page 2011

[11] [11]

Integrating additional chord information into HMM-based lyrics-to-audio alignment,

M. Mauch, H. Fujihara, and M. Goto, “Integrating additional chord information into HMM-based lyrics-to-audio alignment,” IEEE Transactions on Audio, Speech and Language Processing , vol. 20, no. 1, pp. 200–210, 2012

work page 2012

[12] [12]

Leveraging repetition for improved automatic lyric transcription in popular music,

M. McVicar, D. P. Ellis, and M. Goto, “Leveraging repetition for improved automatic lyric transcription in popular music,” inProc. ICASSP, 2014, pp. 3117–3121

work page 2014

[13] [13]

Automatic recognition of lyrics in singing,

A. Mesaros and T. Virtanen, “Automatic recognition of lyrics in singing,” EURASIP Journal on Audio, Speech, and Music Pro- cessing, vol. 2010, p. 4, 2010

work page 2010

[14] [14]

Semi-supervised lyrics and solo-singing alignment,

C. Gupta, R. Tong, H. Li, and Y . Wang, “Semi-supervised lyrics and solo-singing alignment,” in Proc. ISMIR, 2018

work page 2018

[15] [15]

V ocal detection in music with support vector machines,

M. Ramona, G. Richard, and B. David, “V ocal detection in music with support vector machines,” in 2008 Proc. ICASSP . IEEE, 2008, pp. 1885–1888

work page 2008

[16] [16]

Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing,

A. M. Kruspe, “Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing,” inProc. ISMIR, 2016, pp. 358–364

work page 2016

[17] [17]

Modeling of phoneme dura- tions for alignment between polyphonic audio and lyrics,

G. B. Dzhambazov and X. Serra, “Modeling of phoneme dura- tions for alignment between polyphonic audio and lyrics,” in12th Sound and Music Computing Conference, 2015, pp. 281–286

work page 2015

[18] [18]

Smule.digital archive mobile performances(damp),

S. Sing!, “Smule.digital archive mobile performances(damp),” https://ccrma.stanford.edu/damp/, 2010 (accessed March 15, 2018)

work page 2010

[19] [19]

Automatic pronunciation evalua- tion of singing,

C. Gupta, H. Li, and Y . Wang, “Automatic pronunciation evalua- tion of singing,” Proc. INTERSPEECH, pp. 1507–1511, 2018

work page 2018

[20] [20]

Automatic lyrics- to-audio alignment on polyphonic music using singing-adapted acoustic models,

B. Sharma, C. Gupta, H. Li, and Y . Wang, “Automatic lyrics- to-audio alignment on polyphonic music using singing-adapted acoustic models,” in Proc. ICASSP. IEEE, 2019

work page 2019

[21] [21]

Mirex2018: Lyrics-to-audio alignment for instru- ment accompanied singings,

C.-C. Wang, “Mirex2018: Lyrics-to-audio alignment for instru- ment accompanied singings,” in MIREX 2018, 2018

work page 2018

[22] [22]

End-to-end lyrics align- ment for polyphonic music using an audio-to-character recogni- tion model,

S. E. Daniel Stoller, Simon Durand, “End-to-end lyrics align- ment for polyphonic music using an audio-to-character recogni- tion model,” in Proc. ICASSP. IEEE, 2019

work page 2019

[23] [23]

Musical melody and speech into- nation: Singing a different tune,

R. J. Zatorre and S. R. Baum, “Musical melody and speech into- nation: Singing a different tune,” PLoS biology, vol. 10, no. 7, p. e1001372, 2012

work page 2012

[24] [24]

Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing

S. Zhang, R. C. Repetto, and X. Serra, “Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing.” in Proc. ISMIR, 2014, pp. 343–348

work page 2014

[25] [25]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, May 2011

work page 2011

[26] [26]

Musical genre classiﬁcation of au- dio signals,

G. Tzanetakis and P. Cook, “Musical genre classiﬁcation of au- dio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002

work page 2002

[27] [27]

A Hybrid of Deep Audio Feature and i-vector for Artist Recognition

J. Park, D. Kim, J. Lee, S. Kum, and J. Nam, “A hybrid of deep audio feature and i-vector for artist recognition,” arXiv preprint arXiv:1807.09208, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Song-level features and support vector machines for music classiﬁcation,

M. Mandel and D. Ellis, “Song-level features and support vector machines for music classiﬁcation,” in Proc. ISMIR, 2005

work page 2005

[29] [29]

Lyrics-to-audio alignment and its ap- plication,

H. Fujihara and M. Goto, “Lyrics-to-audio alignment and its ap- plication,” in Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik, 2012

work page 2012

[30] [30]

Low-delay singing voice alignment to text

A. Loscos, P. Cano, and J. Bonada, “Low-delay singing voice alignment to text.” in Proc. ICMC, 1999

work page 1999

[31] [31]

Opensmile: the Munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the Munich versatile and fast open-source audio feature extractor,” in Proc. ACM Multimedia. ACM, 2010, pp. 1459–1462

work page 2010

[32] [32]

The interspeech 2013 computational paralinguistics chal- lenge: social signals, conﬂict, emotion, autism,

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi et al., “The interspeech 2013 computational paralinguistics chal- lenge: social signals, conﬂict, emotion, autism,” in Proc. INTER- SPEECH, 2013

work page 2013

[33] [33]

An information- theoretic approach to machine-oriented music summarization,

F. A. Raposo, D. M. de Matos, and R. Ribeiro, “An information- theoretic approach to machine-oriented music summarization,” Pattern Recognition Letters, 2019

work page 2019

[34] [34]

Benchmarking music emotion recognition systems,

A. Alajanki, Y .-H. Yang, and M. Soleymani, “Benchmarking music emotion recognition systems,” PLOS ONE, pp. 835–838, 2016

work page 2016

[35] [35]

Seek- ing the superstar: Automatic assessment of perceived singing quality,

J. B ¨ohm, F. Eyben, M. Schmitt, H. Kosch, and B. Schuller, “Seek- ing the superstar: Automatic assessment of perceived singing quality,” in 2017 International Joint Conference on Neural Net- works (IJCNN). IEEE, 2017, pp. 1560–1569

work page 2017

[36] [36]

A comparison of features for speech, music discrimination,

M. J. Carey, E. S. Parris, and H. Lloyd-Thomas, “A comparison of features for speech, music discrimination,” in Proc. ICASSP, vol. 1. IEEE, 1999, pp. 149–152

work page 1999

[37] [37]

A speech/music discriminator based on RMS and zero-crossings,

C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on RMS and zero-crossings,” IEEE Transactions on Multi- media, vol. 7, no. 1, pp. 155–166, 2005

work page 2005

[38] [38]

Rasta processing of speech,

H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994

work page 1994

[39] [39]

Automatic musical genre clas- siﬁcation of audio signals,

T. George, E. Georg, and C. Perry, “Automatic musical genre clas- siﬁcation of audio signals,” in Proc. ISMIR, 2001

work page 2001

[40] [40]

Classifying music audio with timbral and chroma fea- tures,

D. Ellis, “Classifying music audio with timbral and chroma fea- tures,” in Proc. ISMIR, 2007

work page 2007

[41] [41]

Dali: A large dataset of synchronized audio, lyrics and notes, automati- cally created using teacher-student machine learning paradigm,

G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Dali: A large dataset of synchronized audio, lyrics and notes, automati- cally created using teacher-student machine learning paradigm,” in Proc. ISMIR, 2018

work page 2018

[42] [42]

Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefﬁcients,

J. K. Hansen, “Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefﬁcients,” in 9th Sound and Music Computing Conference (SMC), 2012, pp. 494–499

work page 2012

[43] [43]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in in Proc. ASRU, 2011

work page 2011

[44] [44]

Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. INTERSPEECH, 2018, pp. 3743–3747

work page 2018

[45] [45]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” inProc. INTERSPEECH, 2015, pp. 3586–3589

work page 2015

[46] [46]

Speaker adap- tation of neural network acoustic models using i-vectors,

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap- tation of neural network acoustic models using i-vectors,” inProc. ASRU, Dec 2013, pp. 55–59

work page 2013

[47] [47]

Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI,” inProc. INTER- SPEECH, 2016, pp. 2751–2755

work page 2016

[48] [48]

Wave-u-net: A multi-scale neural network for end-to-end audio source separation,

D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” in Proc. ISMIR, 2018

work page 2018

[49] [49]

An overview of noise-robust automatic speech recognition,

J. Li, L. Deng, Y . Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, April 2014

work page 2014

[50] [50]

Knowledge-based probabilistic modeling for tracking lyrics in music audio signals,

G. Dzhambazov, “Knowledge-based probabilistic modeling for tracking lyrics in music audio signals,” Ph.D. dissertation, Uni- versitat Pompeu Fabra, 2017

work page 2017