Cover Detection using Dominant Melody Embeddings
Pith reviewed 2026-05-25 09:40 UTC · model grok-4.3
The pith
A neural network creates single embedding vectors from dominant melody to detect covers via simple distance calculations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A neural network architecture trained to represent each track as a single embedding vector extracted from its dominant melody representation improves state-of-the-art accuracy on small and large datasets and scales to query databases of thousands of tracks in a few seconds, with the computation burden shifted to offline embedding extraction.
What carries the argument
Neural network that maps a track's dominant melody representation to a single embedding vector whose Euclidean distances identify covers.
If this is right
- Embeddings can be extracted and stored offline, leaving only fast distance computations at query time.
- The approach raises accuracy on both small and large datasets compared with earlier methods.
- Databases of thousands of tracks can be queried in seconds rather than requiring exhaustive pairwise processing.
- Dominant melody alone supplies enough information for reliable cover identification.
Where Pith is reading between the lines
- The same embedding approach might transfer to other retrieval tasks where melody carries the main identity signal.
- If dominant-melody embeddings prove robust, full-spectrum audio features could be dropped from some similarity pipelines.
- Real-time cover detection on streaming platforms becomes feasible once embeddings are precomputed.
Load-bearing premise
Dominant melody representations contain the essential information needed to identify covers, and embeddings learned from them generalize across datasets and cover styles.
What would settle it
Running the embedding method on a large cover-song dataset and finding no accuracy gain over previous state-of-the-art algorithms would falsify the central claim.
read the original abstract
Automatic cover detection -- the task of finding in an audio database all the covers of one or several query tracks -- has long been seen as a challenging theoretical problem in the MIR community and as an acute practical problem for authors and composers societies. Original algorithms proposed for this task have proven their accuracy on small datasets, but are unable to scale up to modern real-life audio corpora. On the other hand, faster approaches designed to process thousands of pairwise comparisons resulted in lower accuracy, making them unsuitable for practical use. In this work, we propose a neural network architecture that is trained to represent each track as a single embedding vector. The computation burden is therefore left to the embedding extraction -- that can be conducted offline and stored, while the pairwise comparison task reduces to a simple Euclidean distance computation. We further propose to extract each track's embedding out of its dominant melody representation, obtained by another neural network trained for this task. We then show that this architecture improves state-of-the-art accuracy both on small and large datasets, and is able to scale to query databases of thousands of tracks in a few seconds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a neural network architecture for automatic cover detection. A first NN extracts the dominant melody from audio tracks; a second NN then maps these melody representations to fixed embedding vectors. Cover detection reduces to Euclidean distance computation on the precomputed embeddings. The authors claim this yields higher accuracy than prior state-of-the-art methods on both small and large datasets while scaling to query databases of thousands of tracks in seconds.
Significance. If the empirical claims hold, the work supplies a practical, scalable solution to cover detection by moving heavy computation offline. The dominant-melody embedding choice is a distinctive design decision that could lower both storage and query cost; explicit credit is due for framing the problem around offline embedding extraction rather than repeated pairwise comparisons.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim that the architecture 'improves state-of-the-art accuracy both on small and large datasets' is stated without any dataset names, sizes, baseline methods, quantitative metrics, or error analysis. This absence prevents verification that the data actually support the accuracy claim.
- [§3 and §4] §3 (Architecture) and §4: the two-stage pipeline assumes that the dominant-melody representation retains all information needed to separate covers from non-covers. No ablation comparing melody-only embeddings against embeddings learned from full spectrograms (or other representations) is reported, leaving the information-bottleneck assumption untested and load-bearing for the accuracy claim.
minor comments (1)
- [§3] Notation for the embedding dimension and distance metric is introduced without an explicit equation or table; adding a short table of hyper-parameters would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that the architecture 'improves state-of-the-art accuracy both on small and large datasets' is stated without any dataset names, sizes, baseline methods, quantitative metrics, or error analysis. This absence prevents verification that the data actually support the accuracy claim.
Authors: The referee is correct that the abstract itself contains no dataset names, sizes, baselines, or metrics. While the full Section 4 supplies these experimental details, we agree that the central claim would be easier to evaluate if the abstract were more informative. We will revise the abstract to name the datasets, state their sizes, identify the main baselines, and report the key quantitative improvements. revision: yes
-
Referee: [§3 and §4] §3 (Architecture) and §4: the two-stage pipeline assumes that the dominant-melody representation retains all information needed to separate covers from non-covers. No ablation comparing melody-only embeddings against embeddings learned from full spectrograms (or other representations) is reported, leaving the information-bottleneck assumption untested and load-bearing for the accuracy claim.
Authors: We agree that the absence of an ablation study leaves the core design assumption untested. The manuscript reports no direct comparison between melody-only embeddings and embeddings derived from full spectrograms. We will add an ablation experiment in the revised version that trains an otherwise identical embedding network on full spectrograms and compares its cover-detection performance to the melody-based model. revision: yes
Circularity Check
No circularity; empirical NN architecture evaluated on independent datasets
full rationale
The paper proposes a two-stage neural architecture (dominant melody extractor followed by embedding network) and reports accuracy improvements via training and testing on small/large cover datasets. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs by construction. Claims rest on experimental results rather than self-definitional loops, renamed patterns, or load-bearing self-citations. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
to which work this track belongs to ?
INTRODUCTION Covers are different interpretations of the same original musical work. They usually share a similar melodic line, but typically differ greatly in one or several other dimen- sions, such as their structure, tempo, key, instrumentation, genre, etc. Automatic cover detection – the task of finding in an audio database all the covers of one or sev...
work page 2019
-
[2]
RELATED WORK We review here the main concepts used in this study. 2.1 Cover detection Successful approaches in cover detection used an input representation preserving common musical facets between different versions, in particular dominant melody [19, 27, 40], tonal progression – typically a sequence of chromas [10, 12, 33, 39] or chords [2], or a fusion ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
PROPOSED METHOD We present here the input data used to train our network, the network architecture itself and its training loss. 3.1 Input data We have used as input data the dominant melody 2D repre- sentation (F0-CQT) obtained by the network we proposed in [9]. The frequency and time resolutions required for melody extraction (60 bins per octave and 11 ...
-
[4]
PRELIMINARY EXPERIMENTS We present here some experiments conducted to develop the system. The 7460 works were split into disjoint train and evaluation sets, with respectively 6216 and 1244 works and five covers per work. The evaluation set represents ~20% of the training set, which we considered fair enough given the total amount of covers. The same split ...
-
[5]
large audio collections lookup
LARGE SCALE LOOKUP EXPERIMENTS We now present experiments investigating the realistic use case, i.e. large audio collections lookup. When query- ing an audio collection, each query track can be of three kinds: a) it is already present in the database, b) it is a cover of some other track(s) already in the database, or c) it is a track that has no cover in...
work page 2019
-
[6]
This would yield to MT@10=2.0, and MR1=80.2. This kind of discrepancy between MR1 and MT@10 re- flects the fact that some works in our dataset have similar covers that are easily clustered, while other are much more difficult to discriminate. This can be observed on the pos- itive pairs distribution pc(d) on Figure 7 (left), which is spread over a large ran...
-
[7]
COMPARISON WITH OTHER METHODS 6.1 Comparison on small dataset We first compared with two recent methods [31, 34], who reported results for a small dataset of 50 works with 7 cov- ers each. The query set includes five covers of each work (250 tracks), while the reference set includes each work’s remaining two covers (100 tracks). As this dataset is not publi...
-
[8]
As there are only two covers per work in the reference set, P@10 maximum value is 0.2)
0.648 0.145 8.270 Proposeda) 0.675 0.165 3.439 (0.04), p=.29 (0.005), p<.001(1.062), p<.001 Proposedb) 0.782 0.179 2.618 (0.104), p<.01(0.014), p<.001(1.351), p<.001 Table 2: Comparison between recent method [31, 34] and our proposed method on a small dataset (precision at 10 P@10 is reported instead of MT@10. As there are only two covers per work in the ...
-
[9]
0.285 1844 - - Proposed 0.936 78 2.010 33 (0.001), p<.001(6), p<.001 (<0.001) (3) b)
-
[10]
For b), the MR percentile should be compared, as our reference set does not have 1M tracks (6th vs
0.134 173117 - - Proposed 0.220 3865 1.622 430 (0.007), p<.001(81), p<.001 (0.003) (19) Table 3: Comparison between method [16] and our proposed method on a large dataset (MR=Mean rank). For b), the MR percentile should be compared, as our reference set does not have 1M tracks (6th vs. 17th percentile for [16]). Our method significantly improve previous re...
-
[11]
CONCLUSION In this work, we presented a method for cover detection, using a convolutional network which encodes each track as a single vector, and is trained to minimize cover pairs Euclidean distance in the embeddings space, while max- imizing it for non-covers. We show that extracting em- beddings out of the dominant melody 2D representation drastically...
-
[12]
Neural networks for fingerprint recognition
Pierre Baldi and Yves Chauvin. Neural networks for fingerprint recognition. Neural Computation , 5(3):402–418, 1993
work page 1993
-
[13]
Juan Pablo Bello. Audio-based cover song retrieval us- ing approximate chord sequences: Testing shifts, gaps, swaps and beats. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2007
work page 2007
-
[14]
Large- scale cover song recognition using hashed chroma landmarks
Thierry Bertin-Mahieux and Daniel PW Ellis. Large- scale cover song recognition using hashed chroma landmarks. In Proceedings of IEEE WASPAA (Work- shop on Applications of Signal Processing to Audio and Acoustics), pages 117–120. IEEE, 2011
work page 2011
-
[15]
Large- scale cover song recognition using the 2d fourier trans- form magnitude
Thierry Bertin-Mahieux and Daniel PW Ellis. Large- scale cover song recognition using the 2d fourier trans- form magnitude. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2012
work page 2012
-
[16]
Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whit- man, and Paul Lamere. The million song dataset. Pro- ceedings of ISMIR (International Society of Music In- formation Retrieval), 2011
work page 2011
-
[17]
Anil Bhattacharyya. On a measure of divergence be- tween two statistical populations defined by their prob- ability distributions. Bull. Calcutta Math. Soc., 35:99– 109, 1943
work page 1943
-
[18]
Deep salience representations for f0 estimation in polyphonic music
Rachel M Bittner, Brian McFee, Justin Salamon, Pe- ter Li, and Juan P Bello. Deep salience representations for f0 estimation in polyphonic music. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2017
work page 2017
-
[19]
Signature verifica- tion using a "siamese" time delay neural network
Jane Bromley, Isabelle Guyon, Yann LeCun, Ed- uard Säckinger, and Roopak Shah. Signature verifica- tion using a "siamese" time delay neural network. In Advances in Neural Information Processing Systems , pages 737–744, 1994
work page 1994
-
[20]
On the use of u-net for dominant melody es- timation in polyphonic music
Guillaume Doras, Philippe Esling, and Geoffroy Peeters. On the use of u-net for dominant melody es- timation in polyphonic music. In International Work- shop on Multilayer Music Representation and Process- ing (MMRP), pages 66–70. IEEE, 2019
work page 2019
-
[21]
Identifying- cover songs with chroma features and dynamic pro- gramming beat tracking
Daniel PW Ellis and Graham E Poliner. Identifying- cover songs with chroma features and dynamic pro- gramming beat tracking. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2007
work page 2007
-
[22]
Multimodal similarity between mu- sical streams for cover version detection
Rémi Foucard, Jean-Louis Durrieu, Mathieu Lagrange, and Gäel Richard. Multimodal similarity between mu- sical streams for cover version detection. In Proceed- ings of ICASSP (International Conference on Acous- tics, Speech and Signal Processing). IEEE, 2010
work page 2010
-
[23]
The song remains the same: identifying versions of the same piece using tonal descriptors
Emilia Gómez and Perfecto Herrera. The song remains the same: identifying versions of the same piece using tonal descriptors. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2006
work page 2006
-
[24]
Dimen- sionality reduction by learning an invariant mapping
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimen- sionality reduction by learning an invariant mapping. In Proceedings of IEEE CVPR (Conference on Com- puter Vision and Pattern Recognition), volume 2, pages 1735–1742. IEEE, 2006
work page 2006
-
[25]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision , pages 630–
-
[26]
Eric J Humphrey, Juan Pablo Bello, and Yann Le- Cun. Moving beyond feature design: Deep architec- tures and automatic feature learning in music informat- ics. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2012
work page 2012
-
[27]
Data driven and discriminative projections for large- scale cover song identification
Eric J Humphrey, Oriol Nieto, and Juan Pablo Bello. Data driven and discriminative projections for large- scale cover song identification. In Proceedings of IS- MIR (International Society of Music Information Re- trieval), 2013
work page 2013
-
[28]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[29]
Multiple fundamental frequency esti- mation by summing harmonic amplitudes
Anssi Klapuri. Multiple fundamental frequency esti- mation by summing harmonic amplitudes. In Proceed- ings of ISMIR (International Society of Music Informa- tion Retrieval), 2006
work page 2006
-
[30]
Matija Marolt. A mid-level representation for melody- based retrieval in audio collections.IEEE Transactions on Multimedia, 10(8):1617–1625, 2008
work page 2008
-
[31]
Blast for audio sequences alignment: a fast scalable cover identification
Benjamin Martin, Daniel G Brown, Pierre Hanna, and Pascal Ferraro. Blast for audio sequences alignment: a fast scalable cover identification. In Proceedings of ISMIR (International Society of Music Information Re- trieval), 2012
work page 2012
-
[32]
Learning content similarity for music recommenda- tion
Brian McFee, Luke Barrington, and Gert Lanckriet. Learning content similarity for music recommenda- tion. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 20(8):2207–2218, 2012
work page 2012
-
[33]
librosa: Audio and music signal analysis in python
Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference , pages 18–25, 2015
work page 2015
-
[34]
Triplet convolutional network for music version identification
Xiaoyu Qi, Deshun Yang, and Xiaoou Chen. Triplet convolutional network for music version identification. In International Conference on Multimedia Modeling , pages 544–555. Springer, 2018
work page 2018
-
[35]
Pruning subse- quence search with attention-based embedding
Colin Raffel and Daniel PW Ellis. Pruning subse- quence search with attention-based embedding. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2016
work page 2016
-
[36]
Cover song de- tection: from high scores to general classification
Suman Ravuri and Daniel PW Ellis. Cover song de- tection: from high scores to general classification. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2010
work page 2010
-
[37]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer, 2015
work page 2015
-
[38]
Finding cover songs by melodic similarity.MIREX extended abstract, 2006
Christian Sailer and Karin Dressler. Finding cover songs by melodic similarity.MIREX extended abstract, 2006
work page 2006
-
[39]
Melody extrac- tion from polyphonic music signals using pitch contour characteristics
Justin Salamon and Emilia Gómez. Melody extrac- tion from polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, 2012
work page 2012
-
[40]
Tonal representations for music retrieval: from version iden- tification to query-by-humming
Justin Salamon, Joan Serra, and Emilia Gómez. Tonal representations for music retrieval: from version iden- tification to query-by-humming. International Jour- nal of Multimedia Information Retrieval , 2(1):45–58, 2013
work page 2013
-
[41]
Facenet: A unified embedding for face recog- nition and clustering
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recog- nition and clustering. In Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recogni- tion), pages 815–823, 2015
work page 2015
-
[42]
Cover song iden- tification with 2d fourier transform sequences
Prem Seetharaman and Zafar Rafii. Cover song iden- tification with 2d fourier transform sequences. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2017
work page 2017
-
[43]
Joan Serrà. Music similarity based on sequences of de- scriptors tonal features applied to audio cover song identification. PhD thesis, Universitat Pompeu Fabra, Spain, 2007
work page 2007
-
[44]
Xavier Serra, Ralph G Andrzejak, et al. Cross recur- rence quantification for cover song identification.New Journal of Physics, 11(9):093017, 2009
work page 2009
-
[45]
Simple: assessing music similarity using subsequences joins
Diego F Silva, Chin-Chin M Yeh, Gustavo Enrique de Almeida Prado Alves Batista, Eamonn Keogh, et al. Simple: assessing music similarity using subsequences joins. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2016
work page 2016
-
[46]
Discriminative learning of deep convolutional feature point descriptors
Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE Interna- tional Conference on Computer Vision, pages 118–126, 2015
work page 2015
-
[47]
Improved deep metric learning with multi-class n-pair loss objective
Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neu- ral Information Processing Systems, pages 1857–1865, 2016
work page 2016
-
[48]
Deep metric learning via lifted structured feature embedding
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Sil- vio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recog- nition), pages 4004–4012. IEEE, 2016
work page 2016
-
[49]
Known artist live song id: A hashprint approach
TJ Tsai, Thomas Prätzlich, and Meinard Müller. Known artist live song id: A hashprint approach. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2016
work page 2016
-
[50]
Query-by-example technique for retrieving cover ver- sions of popular songs with similar melodies
Wei-Ho Tsai, Hung-Ming Yu, Hsin-Min Wang, et al. Query-by-example technique for retrieving cover ver- sions of popular songs with similar melodies. In Pro- ceedings of ISMIR (International Society of Music In- formation Retrieval), 2005
work page 2005
-
[51]
Wei-Ho Tsai, Hung-Ming Yu, Hsin-Min Wang, and Jorng-Tzong Horng. Using the similarity of main melodies to identify cover versions of popular songs for music document retrieval. Journal of Information Science & Engineering, 24(6), 2008
work page 2008
-
[52]
Deep content-based music recommenda- tion
Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommenda- tion. In Advances in neural information processing sys- tems, pages 2643–2651, 2013
work page 2013
-
[53]
Adaptive harmonic spectral decomposition for mul- tiple pitch estimation
Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for mul- tiple pitch estimation. IEEE Transactions on Au- dio, Speech and Language Processing, 18(3):528–537, 2010
work page 2010
-
[54]
Distance metric learning with applica- tion to clustering with side-information
Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with applica- tion to clustering with side-information. In Advances in Neural Information Processing Systems, pages 521– 528, 2003
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.