pith. sign in

arxiv: 1907.01824 · v1 · pith:VZ5VSZPQnew · submitted 2019-07-03 · 💻 cs.SD · cs.LG· stat.ML

Cover Detection using Dominant Melody Embeddings

Pith reviewed 2026-05-25 09:40 UTC · model grok-4.3

classification 💻 cs.SD cs.LGstat.ML
keywords cover detectiondominant melodyneural embeddingsmusic information retrievalaudio similaritycover song identification
0
0 comments X

The pith

A neural network creates single embedding vectors from dominant melody to detect covers via simple distance calculations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve cover detection by training one neural network to turn each track's dominant melody into a fixed embedding vector. Pairwise comparisons then reduce to Euclidean distance on these precomputed vectors instead of heavy audio processing at query time. This setup is shown to raise accuracy above prior methods on both small and large datasets while handling thousands of tracks in seconds. A reader would care because earlier accurate algorithms could not scale and earlier scalable algorithms lost accuracy, leaving a gap for real-world music databases.

Core claim

A neural network architecture trained to represent each track as a single embedding vector extracted from its dominant melody representation improves state-of-the-art accuracy on small and large datasets and scales to query databases of thousands of tracks in a few seconds, with the computation burden shifted to offline embedding extraction.

What carries the argument

Neural network that maps a track's dominant melody representation to a single embedding vector whose Euclidean distances identify covers.

If this is right

  • Embeddings can be extracted and stored offline, leaving only fast distance computations at query time.
  • The approach raises accuracy on both small and large datasets compared with earlier methods.
  • Databases of thousands of tracks can be queried in seconds rather than requiring exhaustive pairwise processing.
  • Dominant melody alone supplies enough information for reliable cover identification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding approach might transfer to other retrieval tasks where melody carries the main identity signal.
  • If dominant-melody embeddings prove robust, full-spectrum audio features could be dropped from some similarity pipelines.
  • Real-time cover detection on streaming platforms becomes feasible once embeddings are precomputed.

Load-bearing premise

Dominant melody representations contain the essential information needed to identify covers, and embeddings learned from them generalize across datasets and cover styles.

What would settle it

Running the embedding method on a large cover-song dataset and finding no accuracy gain over previous state-of-the-art algorithms would falsify the central claim.

read the original abstract

Automatic cover detection -- the task of finding in an audio database all the covers of one or several query tracks -- has long been seen as a challenging theoretical problem in the MIR community and as an acute practical problem for authors and composers societies. Original algorithms proposed for this task have proven their accuracy on small datasets, but are unable to scale up to modern real-life audio corpora. On the other hand, faster approaches designed to process thousands of pairwise comparisons resulted in lower accuracy, making them unsuitable for practical use. In this work, we propose a neural network architecture that is trained to represent each track as a single embedding vector. The computation burden is therefore left to the embedding extraction -- that can be conducted offline and stored, while the pairwise comparison task reduces to a simple Euclidean distance computation. We further propose to extract each track's embedding out of its dominant melody representation, obtained by another neural network trained for this task. We then show that this architecture improves state-of-the-art accuracy both on small and large datasets, and is able to scale to query databases of thousands of tracks in a few seconds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a neural network architecture for automatic cover detection. A first NN extracts the dominant melody from audio tracks; a second NN then maps these melody representations to fixed embedding vectors. Cover detection reduces to Euclidean distance computation on the precomputed embeddings. The authors claim this yields higher accuracy than prior state-of-the-art methods on both small and large datasets while scaling to query databases of thousands of tracks in seconds.

Significance. If the empirical claims hold, the work supplies a practical, scalable solution to cover detection by moving heavy computation offline. The dominant-melody embedding choice is a distinctive design decision that could lower both storage and query cost; explicit credit is due for framing the problem around offline embedding extraction rather than repeated pairwise comparisons.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that the architecture 'improves state-of-the-art accuracy both on small and large datasets' is stated without any dataset names, sizes, baseline methods, quantitative metrics, or error analysis. This absence prevents verification that the data actually support the accuracy claim.
  2. [§3 and §4] §3 (Architecture) and §4: the two-stage pipeline assumes that the dominant-melody representation retains all information needed to separate covers from non-covers. No ablation comparing melody-only embeddings against embeddings learned from full spectrograms (or other representations) is reported, leaving the information-bottleneck assumption untested and load-bearing for the accuracy claim.
minor comments (1)
  1. [§3] Notation for the embedding dimension and distance metric is introduced without an explicit equation or table; adding a short table of hyper-parameters would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that the architecture 'improves state-of-the-art accuracy both on small and large datasets' is stated without any dataset names, sizes, baseline methods, quantitative metrics, or error analysis. This absence prevents verification that the data actually support the accuracy claim.

    Authors: The referee is correct that the abstract itself contains no dataset names, sizes, baselines, or metrics. While the full Section 4 supplies these experimental details, we agree that the central claim would be easier to evaluate if the abstract were more informative. We will revise the abstract to name the datasets, state their sizes, identify the main baselines, and report the key quantitative improvements. revision: yes

  2. Referee: [§3 and §4] §3 (Architecture) and §4: the two-stage pipeline assumes that the dominant-melody representation retains all information needed to separate covers from non-covers. No ablation comparing melody-only embeddings against embeddings learned from full spectrograms (or other representations) is reported, leaving the information-bottleneck assumption untested and load-bearing for the accuracy claim.

    Authors: We agree that the absence of an ablation study leaves the core design assumption untested. The manuscript reports no direct comparison between melody-only embeddings and embeddings derived from full spectrograms. We will add an ablation experiment in the revised version that trains an otherwise identical embedding network on full spectrograms and compares its cover-detection performance to the melody-based model. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical NN architecture evaluated on independent datasets

full rationale

The paper proposes a two-stage neural architecture (dominant melody extractor followed by embedding network) and reports accuracy improvements via training and testing on small/large cover datasets. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs by construction. Claims rest on experimental results rather than self-definitional loops, renamed patterns, or load-bearing self-citations. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5718 in / 952 out tokens · 31009 ms · 2026-05-25T09:40:25.223223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

  1. [1]

    to which work this track belongs to ?

    INTRODUCTION Covers are different interpretations of the same original musical work. They usually share a similar melodic line, but typically differ greatly in one or several other dimen- sions, such as their structure, tempo, key, instrumentation, genre, etc. Automatic cover detection – the task of finding in an audio database all the covers of one or sev...

  2. [2]

    RELATED WORK We review here the main concepts used in this study. 2.1 Cover detection Successful approaches in cover detection used an input representation preserving common musical facets between different versions, in particular dominant melody [19, 27, 40], tonal progression – typically a sequence of chromas [10, 12, 33, 39] or chords [2], or a fusion ...

  3. [3]

    3.1 Input data We have used as input data the dominant melody 2D repre- sentation (F0-CQT) obtained by the network we proposed in [9]

    PROPOSED METHOD We present here the input data used to train our network, the network architecture itself and its training loss. 3.1 Input data We have used as input data the dominant melody 2D repre- sentation (F0-CQT) obtained by the network we proposed in [9]. The frequency and time resolutions required for melody extraction (60 bins per octave and 11 ...

  4. [4]

    The 7460 works were split into disjoint train and evaluation sets, with respectively 6216 and 1244 works and five covers per work

    PRELIMINARY EXPERIMENTS We present here some experiments conducted to develop the system. The 7460 works were split into disjoint train and evaluation sets, with respectively 6216 and 1244 works and five covers per work. The evaluation set represents ~20% of the training set, which we considered fair enough given the total amount of covers. The same split ...

  5. [5]

    large audio collections lookup

    LARGE SCALE LOOKUP EXPERIMENTS We now present experiments investigating the realistic use case, i.e. large audio collections lookup. When query- ing an audio collection, each query track can be of three kinds: a) it is already present in the database, b) it is a cover of some other track(s) already in the database, or c) it is a track that has no cover in...

  6. [6]

    This would yield to MT@10=2.0, and MR1=80.2. This kind of discrepancy between MR1 and MT@10 re- flects the fact that some works in our dataset have similar covers that are easily clustered, while other are much more difficult to discriminate. This can be observed on the pos- itive pairs distribution pc(d) on Figure 7 (left), which is spread over a large ran...

  7. [7]

    The query set includes five covers of each work (250 tracks), while the reference set includes each work’s remaining two covers (100 tracks)

    COMPARISON WITH OTHER METHODS 6.1 Comparison on small dataset We first compared with two recent methods [31, 34], who reported results for a small dataset of 50 works with 7 cov- ers each. The query set includes five covers of each work (250 tracks), while the reference set includes each work’s remaining two covers (100 tracks). As this dataset is not publi...

  8. [8]

    As there are only two covers per work in the reference set, P@10 maximum value is 0.2)

    0.648 0.145 8.270 Proposeda) 0.675 0.165 3.439 (0.04), p=.29 (0.005), p<.001(1.062), p<.001 Proposedb) 0.782 0.179 2.618 (0.104), p<.01(0.014), p<.001(1.351), p<.001 Table 2: Comparison between recent method [31, 34] and our proposed method on a small dataset (precision at 10 P@10 is reported instead of MT@10. As there are only two covers per work in the ...

  9. [9]

    0.285 1844 - - Proposed 0.936 78 2.010 33 (0.001), p<.001(6), p<.001 (<0.001) (3) b)

  10. [10]

    For b), the MR percentile should be compared, as our reference set does not have 1M tracks (6th vs

    0.134 173117 - - Proposed 0.220 3865 1.622 430 (0.007), p<.001(81), p<.001 (0.003) (19) Table 3: Comparison between method [16] and our proposed method on a large dataset (MR=Mean rank). For b), the MR percentile should be compared, as our reference set does not have 1M tracks (6th vs. 17th percentile for [16]). Our method significantly improve previous re...

  11. [11]

    CONCLUSION In this work, we presented a method for cover detection, using a convolutional network which encodes each track as a single vector, and is trained to minimize cover pairs Euclidean distance in the embeddings space, while max- imizing it for non-covers. We show that extracting em- beddings out of the dominant melody 2D representation drastically...

  12. [12]

    Neural networks for fingerprint recognition

    Pierre Baldi and Yves Chauvin. Neural networks for fingerprint recognition. Neural Computation , 5(3):402–418, 1993

  13. [13]

    Audio-based cover song retrieval us- ing approximate chord sequences: Testing shifts, gaps, swaps and beats

    Juan Pablo Bello. Audio-based cover song retrieval us- ing approximate chord sequences: Testing shifts, gaps, swaps and beats. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2007

  14. [14]

    Large- scale cover song recognition using hashed chroma landmarks

    Thierry Bertin-Mahieux and Daniel PW Ellis. Large- scale cover song recognition using hashed chroma landmarks. In Proceedings of IEEE WASPAA (Work- shop on Applications of Signal Processing to Audio and Acoustics), pages 117–120. IEEE, 2011

  15. [15]

    Large- scale cover song recognition using the 2d fourier trans- form magnitude

    Thierry Bertin-Mahieux and Daniel PW Ellis. Large- scale cover song recognition using the 2d fourier trans- form magnitude. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2012

  16. [16]

    The million song dataset

    Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whit- man, and Paul Lamere. The million song dataset. Pro- ceedings of ISMIR (International Society of Music In- formation Retrieval), 2011

  17. [17]

    On a measure of divergence be- tween two statistical populations defined by their prob- ability distributions

    Anil Bhattacharyya. On a measure of divergence be- tween two statistical populations defined by their prob- ability distributions. Bull. Calcutta Math. Soc., 35:99– 109, 1943

  18. [18]

    Deep salience representations for f0 estimation in polyphonic music

    Rachel M Bittner, Brian McFee, Justin Salamon, Pe- ter Li, and Juan P Bello. Deep salience representations for f0 estimation in polyphonic music. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2017

  19. [19]

    Signature verifica- tion using a "siamese" time delay neural network

    Jane Bromley, Isabelle Guyon, Yann LeCun, Ed- uard Säckinger, and Roopak Shah. Signature verifica- tion using a "siamese" time delay neural network. In Advances in Neural Information Processing Systems , pages 737–744, 1994

  20. [20]

    On the use of u-net for dominant melody es- timation in polyphonic music

    Guillaume Doras, Philippe Esling, and Geoffroy Peeters. On the use of u-net for dominant melody es- timation in polyphonic music. In International Work- shop on Multilayer Music Representation and Process- ing (MMRP), pages 66–70. IEEE, 2019

  21. [21]

    Identifying- cover songs with chroma features and dynamic pro- gramming beat tracking

    Daniel PW Ellis and Graham E Poliner. Identifying- cover songs with chroma features and dynamic pro- gramming beat tracking. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2007

  22. [22]

    Multimodal similarity between mu- sical streams for cover version detection

    Rémi Foucard, Jean-Louis Durrieu, Mathieu Lagrange, and Gäel Richard. Multimodal similarity between mu- sical streams for cover version detection. In Proceed- ings of ICASSP (International Conference on Acous- tics, Speech and Signal Processing). IEEE, 2010

  23. [23]

    The song remains the same: identifying versions of the same piece using tonal descriptors

    Emilia Gómez and Perfecto Herrera. The song remains the same: identifying versions of the same piece using tonal descriptors. In Proceedings of ISMIR (Interna- tional Society of Music Information Retrieval), 2006

  24. [24]

    Dimen- sionality reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimen- sionality reduction by learning an invariant mapping. In Proceedings of IEEE CVPR (Conference on Com- puter Vision and Pattern Recognition), volume 2, pages 1735–1742. IEEE, 2006

  25. [25]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision , pages 630–

  26. [26]

    Moving beyond feature design: Deep architec- tures and automatic feature learning in music informat- ics

    Eric J Humphrey, Juan Pablo Bello, and Yann Le- Cun. Moving beyond feature design: Deep architec- tures and automatic feature learning in music informat- ics. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2012

  27. [27]

    Data driven and discriminative projections for large- scale cover song identification

    Eric J Humphrey, Oriol Nieto, and Juan Pablo Bello. Data driven and discriminative projections for large- scale cover song identification. In Proceedings of IS- MIR (International Society of Music Information Re- trieval), 2013

  28. [28]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  29. [29]

    Multiple fundamental frequency esti- mation by summing harmonic amplitudes

    Anssi Klapuri. Multiple fundamental frequency esti- mation by summing harmonic amplitudes. In Proceed- ings of ISMIR (International Society of Music Informa- tion Retrieval), 2006

  30. [30]

    A mid-level representation for melody- based retrieval in audio collections.IEEE Transactions on Multimedia, 10(8):1617–1625, 2008

    Matija Marolt. A mid-level representation for melody- based retrieval in audio collections.IEEE Transactions on Multimedia, 10(8):1617–1625, 2008

  31. [31]

    Blast for audio sequences alignment: a fast scalable cover identification

    Benjamin Martin, Daniel G Brown, Pierre Hanna, and Pascal Ferraro. Blast for audio sequences alignment: a fast scalable cover identification. In Proceedings of ISMIR (International Society of Music Information Re- trieval), 2012

  32. [32]

    Learning content similarity for music recommenda- tion

    Brian McFee, Luke Barrington, and Gert Lanckriet. Learning content similarity for music recommenda- tion. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 20(8):2207–2218, 2012

  33. [33]

    librosa: Audio and music signal analysis in python

    Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference , pages 18–25, 2015

  34. [34]

    Triplet convolutional network for music version identification

    Xiaoyu Qi, Deshun Yang, and Xiaoou Chen. Triplet convolutional network for music version identification. In International Conference on Multimedia Modeling , pages 544–555. Springer, 2018

  35. [35]

    Pruning subse- quence search with attention-based embedding

    Colin Raffel and Daniel PW Ellis. Pruning subse- quence search with attention-based embedding. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2016

  36. [36]

    Cover song de- tection: from high scores to general classification

    Suman Ravuri and Daniel PW Ellis. Cover song de- tection: from high scores to general classification. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2010

  37. [37]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer, 2015

  38. [38]

    Finding cover songs by melodic similarity.MIREX extended abstract, 2006

    Christian Sailer and Karin Dressler. Finding cover songs by melodic similarity.MIREX extended abstract, 2006

  39. [39]

    Melody extrac- tion from polyphonic music signals using pitch contour characteristics

    Justin Salamon and Emilia Gómez. Melody extrac- tion from polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, 2012

  40. [40]

    Tonal representations for music retrieval: from version iden- tification to query-by-humming

    Justin Salamon, Joan Serra, and Emilia Gómez. Tonal representations for music retrieval: from version iden- tification to query-by-humming. International Jour- nal of Multimedia Information Retrieval , 2(1):45–58, 2013

  41. [41]

    Facenet: A unified embedding for face recog- nition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recog- nition and clustering. In Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recogni- tion), pages 815–823, 2015

  42. [42]

    Cover song iden- tification with 2d fourier transform sequences

    Prem Seetharaman and Zafar Rafii. Cover song iden- tification with 2d fourier transform sequences. In Proceedings of ICASSP (International Conference on Acoustics, Speech and Signal Processing). IEEE, 2017

  43. [43]

    Music similarity based on sequences of de- scriptors tonal features applied to audio cover song identification

    Joan Serrà. Music similarity based on sequences of de- scriptors tonal features applied to audio cover song identification. PhD thesis, Universitat Pompeu Fabra, Spain, 2007

  44. [44]

    Cross recur- rence quantification for cover song identification.New Journal of Physics, 11(9):093017, 2009

    Xavier Serra, Ralph G Andrzejak, et al. Cross recur- rence quantification for cover song identification.New Journal of Physics, 11(9):093017, 2009

  45. [45]

    Simple: assessing music similarity using subsequences joins

    Diego F Silva, Chin-Chin M Yeh, Gustavo Enrique de Almeida Prado Alves Batista, Eamonn Keogh, et al. Simple: assessing music similarity using subsequences joins. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2016

  46. [46]

    Discriminative learning of deep convolutional feature point descriptors

    Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE Interna- tional Conference on Computer Vision, pages 118–126, 2015

  47. [47]

    Improved deep metric learning with multi-class n-pair loss objective

    Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neu- ral Information Processing Systems, pages 1857–1865, 2016

  48. [48]

    Deep metric learning via lifted structured feature embedding

    Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Sil- vio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recog- nition), pages 4004–4012. IEEE, 2016

  49. [49]

    Known artist live song id: A hashprint approach

    TJ Tsai, Thomas Prätzlich, and Meinard Müller. Known artist live song id: A hashprint approach. In Proceedings of ISMIR (International Society of Music Information Retrieval), 2016

  50. [50]

    Query-by-example technique for retrieving cover ver- sions of popular songs with similar melodies

    Wei-Ho Tsai, Hung-Ming Yu, Hsin-Min Wang, et al. Query-by-example technique for retrieving cover ver- sions of popular songs with similar melodies. In Pro- ceedings of ISMIR (International Society of Music In- formation Retrieval), 2005

  51. [51]

    Using the similarity of main melodies to identify cover versions of popular songs for music document retrieval

    Wei-Ho Tsai, Hung-Ming Yu, Hsin-Min Wang, and Jorng-Tzong Horng. Using the similarity of main melodies to identify cover versions of popular songs for music document retrieval. Journal of Information Science & Engineering, 24(6), 2008

  52. [52]

    Deep content-based music recommenda- tion

    Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommenda- tion. In Advances in neural information processing sys- tems, pages 2643–2651, 2013

  53. [53]

    Adaptive harmonic spectral decomposition for mul- tiple pitch estimation

    Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for mul- tiple pitch estimation. IEEE Transactions on Au- dio, Speech and Language Processing, 18(3):528–537, 2010

  54. [54]

    Distance metric learning with applica- tion to clustering with side-information

    Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with applica- tion to clustering with side-information. In Advances in Neural Information Processing Systems, pages 521– 528, 2003