MERIT: Learning Disentangled Music Representations for Audio Similarity

Abhinaba Roy; Dorien Herremans; Junyi Liang

arxiv: 2605.27346 · v1 · pith:TXLB2H3Snew · submitted 2026-05-26 · 💻 cs.SD

MERIT: Learning Disentangled Music Representations for Audio Similarity

Abhinaba Roy , Junyi Liang , Dorien Herremans This is my paper

Pith reviewed 2026-06-29 15:37 UTC · model grok-4.3

classification 💻 cs.SD

keywords music similaritydisentangled representationsmelodyrhythmtimbreaudio embeddingsfactor isolationmachine learning

0 comments

The pith

MERIT learns separate heads for melody, rhythm, and timbre by training on data with isolated factor variations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current music similarity models output one entangled score that mixes distinct dimensions. MERIT introduces three heads, each dedicated to melody, rhythm, or timbre. A training strategy generates synthetic data through conditional audio generation and source-separated stems so that only one factor changes per example. Evaluations show each head activates strongly for its target dimension and stays near chance on the others. The same selective response appears when the heads are tested on independent real-world audio.

Core claim

MERIT is a framework for learning disentangled, factor-specific music representations tailored to melody, rhythm, and timbre. To overcome the lack of isolated musical variations in real-world audio, the method uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Evaluations demonstrate strong factor-wise disentanglement where each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a property that holds across both the synthetic training domain and independent real-world audio.

What carries the argument

Factor-specific heads in a multi-head architecture, trained on single-factor variation data produced by conditional audio generation and stem separation.

If this is right

Users can issue similarity queries that target only one musical dimension at a time.
Similarity scores become interpretable as separate contributions from melody, rhythm, or timbre.
The disentanglement property transfers from synthetic training data to real-world recordings.
Applications gain the ability to control which musical aspects drive recommendations or search results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-factor training strategy could be adapted to disentangle other audio domains such as speech or environmental sounds.
The heads might serve as controllable conditioning signals inside music generation systems.
Adding further factors such as harmony would test whether the isolation effect scales beyond the three dimensions examined.

Load-bearing premise

The conditional audio generation and stem separation process produces training data with strongly isolated single-factor variations that induce the claimed disentanglement.

What would settle it

Test the heads on real audio examples in which only one factor, such as tempo, is systematically altered while melody and timbre are held constant, and check whether activation remains selective to the rhythm head.

read the original abstract

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MERIT proposes training disentangled music heads via conditional generation plus stem separation to isolate melody, rhythm, and timbre, but the approach hinges on data that may still carry cross-factor leakage.

read the letter

The main thing to know is that MERIT creates training examples where only one musical factor changes by combining conditional audio generation with source-separated stems, then learns separate heads for melody, rhythm, and timbre.

This training strategy is the actual new element. The problem of monolithic similarity scores is well known, and the paper gives a concrete way to manufacture the single-factor data that real recordings rarely provide.

The work is clear on the motivation and on the intended use case of more controllable queries. If the heads really stay near chance on the wrong dimensions while responding to the right ones, even on held-out real audio, that would be useful for music retrieval tasks.

The soft spot is exactly the one the stress-test flags. Stem separation is imperfect and generators can pass along shared artifacts or correlations, so the heads might exploit those instead of learning the perceptual dimensions. Without seeing the architecture, the loss terms, the quantitative metrics, or any ablation on separation quality, it is impossible to tell whether the reported disentanglement follows from the intended mechanism or from residual signals in the data.

This paper is for audio ML groups working on music similarity or controllable generation. A reader already thinking about factorized representations could pick up the training trick and test it themselves.

It deserves a serious referee. The idea is testable and the application is narrow enough that a review can focus on whether the data isolation actually holds.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MERIT, a framework for learning disentangled factor-specific music representations along the dimensions of melody, rhythm, and timbre. To address the lack of isolated variations in real audio, it proposes a training strategy that combines conditional audio generation with source-separated stems to produce training data with strongly single-factor variations. Evaluations are reported to show strong factor-wise disentanglement, with each head responding primarily to its target dimension and near chance on others, and this property is claimed to hold on both the synthetic training domain and independent real-world audio.

Significance. If the reported disentanglement is robust and attributable to the intended mechanism rather than data artifacts, the work would meaningfully advance interpretable and controllable music similarity models in MIR. The use of generative models and stem separation to synthesize isolated-factor training data is a creative response to data scarcity and could influence future representation learning in audio if the isolation is validated.

major comments (2)

[method section] Training strategy description (method section): the central claim that conditional generation plus source-separated stems produces data with 'strongly isolated single-factor variations' is load-bearing for attributing the observed head specialization to true perceptual disentanglement rather than residual correlations. No quantitative validation of isolation (e.g., pairwise factor correlation, mutual information, or leakage metrics on the generated stems) is provided, leaving open the possibility that imperfect separation or generative-model entanglements allow heads to exploit spurious cues.
[evaluation section] Evaluation section: the claim that 'each head responds strongly to its intended perceptual dimension while remaining near chance on the others' and that this holds on real-world audio requires the reader to accept that the test probes are themselves factor-isolated. Without reporting how the real-world test set was constructed or controlled for cross-factor correlations, the generalization result cannot be distinguished from the model learning dataset-specific artifacts that happen to align with the synthetic training distribution.

minor comments (2)

Notation for the three heads (melody, rhythm, timbre) should be introduced with explicit symbols early in the method section to improve readability when discussing per-head losses or activations.
The abstract states the property 'holds across both the synthetic training domain and independent real-world audio' but the manuscript should clarify whether the real-world evaluation uses the same probe tasks or a different protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. The two major points raise valid concerns about the strength of evidence for factor isolation in both training data and real-world evaluation. We address each below and will revise the manuscript to incorporate additional validation and documentation.

read point-by-point responses

Referee: [method section] Training strategy description (method section): the central claim that conditional generation plus source-separated stems produces data with 'strongly isolated single-factor variations' is load-bearing for attributing the observed head specialization to true perceptual disentanglement rather than residual correlations. No quantitative validation of isolation (e.g., pairwise factor correlation, mutual information, or leakage metrics on the generated stems) is provided, leaving open the possibility that imperfect separation or generative-model entanglements allow heads to exploit spurious cues.

Authors: We agree that the absence of explicit quantitative isolation metrics on the generated training stems leaves the attribution of head specialization open to alternative explanations. In the revised manuscript we will add a dedicated subsection reporting pairwise factor correlations, mutual information estimates, and leakage metrics computed on the synthetic stems used for training. These analyses will be performed both before and after the conditional generation and stem-separation pipeline to quantify the degree of isolation achieved. revision: yes
Referee: [evaluation section] Evaluation section: the claim that 'each head responds strongly to its intended perceptual dimension while remaining near chance on the others' and that this holds on real-world audio requires the reader to accept that the test probes are themselves factor-isolated. Without reporting how the real-world test set was constructed or controlled for cross-factor correlations, the generalization result cannot be distinguished from the model learning dataset-specific artifacts that happen to align with the synthetic training distribution.

Authors: We concur that the real-world generalization claim requires transparent documentation of the test-set construction and any controls for cross-factor correlations. The revised evaluation section will include a detailed description of how the independent real-world audio was selected and annotated, together with summary statistics on observed cross-factor correlations within that set. Where feasible we will also report performance on a controlled subset that minimizes such correlations. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical evaluations of training data

full rationale

The paper introduces MERIT for factor-specific music representations and relies on a training strategy using conditional audio generation plus source-separated stems to create isolated single-factor variations. The central claim of strong factor-wise disentanglement is presented as an outcome of evaluations on both synthetic and real-world audio, with each head responding to its intended dimension. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The result is not forced by definition or prior author work; it is an empirical observation whose validity hinges on the quality of the generated training data rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; all such elements remain unidentified.

pith-pipeline@v0.9.1-grok · 5651 in / 1097 out tokens · 44006 ms · 2026-06-29T15:37:44.470760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages · 2 internal anchors

[1]

MERIT: Learning Disentangled Music Representations for Audio Similarity

Introduction Music similarity is inherently multi-dimensional. A solo piano cover of a rock anthem preserves the melody and harmonic identity of the original while replacing every in- strument and reshaping the groove. Two recordings by the same artist often share a timbral signature with no con- straint at all on melody. Within a dance genre, different t...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

A scalable data pipeline for constructing factor- controlled music triplets via generative conditioning and source separation, along with our constructed dataset
[3]

MERIT, a representational architecture that demon- strates high functional selectivity by decoupling en- tangled musical dimensions into independent, ad- dressable scoring channels
[4]

Code and pre-trained models are available at https:// github.com/AMAAI-Lab/MERIT

An evaluation protocol that quantifies factor-wise selectivity, alongside zero-shot probes confirming that this selectivity generalizes to independent, real- world audio collections. Code and pre-trained models are available at https:// github.com/AMAAI-Lab/MERIT
[5]

Related Work General audio and music embeddings.Large-scale con- trastive audio–language pre-training, as in CLAP [1] and MuLan [2], produces rich audio representations by align- ing audio with free-form text descriptions. Self-supervised music encoders such as MERT [3] extend masked language modelling to audio with auxiliary pitch, chroma, and beat objec...
[6]

Folk song with accordion and acoustic guitar

Method 3.1 Factor-Specific Triplet Construction A training triplet for factor f is a tuple (A, P f , N) where anchor A and positive Pf are similar on factor f and differ in other respects, while negative N differs from A on factor f. We construct three separate triplet datasets, one per factor, using different conditioning strategies. Given k positives pe...
[7]

Let It Be

Experiments and Results 4.1 Datasets All training triplets are derived from MoisesDB [14], a multitrack source-separation corpus that provides per-song stems with instrument labels. Melody and rhythm anchors 3 Pair type Melody Rhythm Timbre Melody 60.0±30.3 53.4±28.8 26.3±27.5 Rhythm 34.0±27.3 65.8±25.6 37.5±26.3 Timbre 34.2±27.7 37.4±29.8 57.3±31.7 Table...
[8]

Discussion The diagonal scores near 100% in Table 2 reflect super- vision aligned with what a shallow MLP on multi-layer MERT can extract; the held-out test split is folder-disjoint from training, so this is not overfitting in the conventional sense. A residual concern is that the within-pipeline test set could inherit JASCO-borne correlations that a head...
[9]

Conclusion We presentedMERIT, a representational framework that exposes melodic, rhythmic, and timbral similarity as three separable scores. On three zero-shot probes, the intended head is the strongest factor head on instrument-class iden- tity (MUSDB18-HQ) and on dance-style rhythmic signa- tures (Ballroom), and the cross-factor profile recovered on cov...
[10]

SUTD SKI 2021_04_06 and from MOE grant no

Acknowledgments This work has received funding from grant no. SUTD SKI 2021_04_06 and from MOE grant no. MOE-T2EP20124- 0014
[11]

AI Usage Statement We acknowledge the use of Gemini and ChatGPT for para- phrasing and grammar improvements
[12]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5

2023
[13]

Mulan: A joint embedding of music audio and natural language.arXiv preprint arXiv:2208.12415,

Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y . Li, and D. P. Ellis, “Mulan: A joint embedding of music audio and natural language,”arXiv preprint arXiv:2208.12415, 2022

work page arXiv 2022
[14]

Mert: Acoustic music understanding model with large-scale self-supervised training,

Y . Li, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetoset al., “Mert: Acoustic music understanding model with large-scale self-supervised training,”arXiv preprint arXiv:2306.00107, 2023

work page arXiv 2023
[15]

Learning a rep- resentation for cover song identification using convo- lutional neural network,

Z. Yu, X. Xu, X. Chen, and D. Yang, “Learning a rep- resentation for cover song identification using convo- lutional neural network,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2020, pp. 541–545

2020
[16]

Melodysim: measuring melody-aware mu- sic similarity for plagiarism detection,

T. Lu, C.-M. Geist, J. Melechovsky, A. Roy, and D. Her- remans, “Melodysim: measuring melody-aware mu- sic similarity for plagiarism detection,”arXiv preprint arXiv:2505.20979, 2025

work page arXiv 2025
[17]

Sonicverse: Multi-task learning for music feature-informed caption- ing,

A. Chopra, A. Roy, and D. Herremans, “Sonicverse: Multi-task learning for music feature-informed caption- ing,”arXiv preprint arXiv:2506.15154, 2025

work page arXiv 2025
[18]

In Defense of the Triplet Loss for Person Re-Identification

A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,”arXiv preprint arXiv:1703.07737, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Circle loss: A unified perspective of pair similarity optimization,

Y . Sun, C. Cheng, Y . Zhang, C. Zhang, L. Zheng, Z. Wang, and Y . Wei, “Circle loss: A unified perspective of pair similarity optimization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6398–6407

2020
[20]

Neural audio fingerprint for high-specific au- dio retrieval based on contrastive learning,

S. Chang, D. Lee, J. Park, H. Lim, K. Lee, K. Ko, and Y . Han, “Neural audio fingerprint for high-specific au- dio retrieval based on contrastive learning,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3025–3029

2021
[21]

Contrastive self- supervised learning for text-independent speaker ver- ification,

H. Zhang, Y . Zou, and H. Wang, “Contrastive self- supervised learning for text-independent speaker ver- ification,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6713–6717

2021
[22]

Contrastive learning of musical representations,

J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,”arXiv preprint arXiv:2103.09410, 2021

work page arXiv 2021
[23]

An experimental comparison of multi-view self-supervised methods for music tagging,

G. Meseguer-Brocal, D. Desblancs, and R. Hen- nequin, “An experimental comparison of multi-view self-supervised methods for music tagging,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1141–1145

2024
[24]

Learning disen- tangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,

Y .-J. Luo, K. Agres, and D. Herremans, “Learning disen- tangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,” in20th Conference of the International Society for Music Information Retrieval (ISMIR). IS- MIR, 2019

2019
[25]

Moisesdb: A dataset for source separation beyond 4- stems,

I. Pereira, F. Araújo, F. Korzeniowski, and R. V ogl, “Moisesdb: A dataset for source separation beyond 4- stems,”arXiv preprint arXiv:2307.15913, 2023

work page arXiv 2023
[26]

Joint audio and symbolic conditioning for temporally controlled text-to-music generation,

O. Tal, A. Ziv, I. Gat, F. Kreuk, and Y . Adi, “Joint audio and symbolic conditioning for temporally controlled text-to-music generation,”arXiv preprint arXiv:2406.10970, 2024

work page arXiv 2024
[27]

Leveraging llm embeddings for cross dataset label alignment and zero shot music emotion prediction,

R. Liu, A. Roy, and D. Herremans, “Leveraging llm embeddings for cross dataset label alignment and zero shot music emotion prediction,”arXiv preprint arXiv:2410.11522, 2024. 7

work page arXiv 2024
[28]

The faiss library,

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou, “The faiss library,”IEEE Transactions on Big Data, 2025

2025
[29]

Musdb18-hq-an uncompressed version of musdb18,

Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “Musdb18-hq-an uncompressed version of musdb18,”(No Title), 2019

2019
[30]

Rhythmic pattern modeling for beat and downbeat tracking in musical audio

F. Krebs, S. Böck, and G. Widmer, “Rhythmic pattern modeling for beat and downbeat tracking in musical audio.” inIsmir, 2013, pp. 227–232

2013
[31]

The 2007 labrosa cover song detection system,

D. P. Ellis and C. V . Cotton, “The 2007 labrosa cover song detection system,” 2007. 8

2007

[1] [1]

MERIT: Learning Disentangled Music Representations for Audio Similarity

Introduction Music similarity is inherently multi-dimensional. A solo piano cover of a rock anthem preserves the melody and harmonic identity of the original while replacing every in- strument and reshaping the groove. Two recordings by the same artist often share a timbral signature with no con- straint at all on melody. Within a dance genre, different t...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

A scalable data pipeline for constructing factor- controlled music triplets via generative conditioning and source separation, along with our constructed dataset

[3] [3]

MERIT, a representational architecture that demon- strates high functional selectivity by decoupling en- tangled musical dimensions into independent, ad- dressable scoring channels

[4] [4]

Code and pre-trained models are available at https:// github.com/AMAAI-Lab/MERIT

An evaluation protocol that quantifies factor-wise selectivity, alongside zero-shot probes confirming that this selectivity generalizes to independent, real- world audio collections. Code and pre-trained models are available at https:// github.com/AMAAI-Lab/MERIT

[5] [5]

Related Work General audio and music embeddings.Large-scale con- trastive audio–language pre-training, as in CLAP [1] and MuLan [2], produces rich audio representations by align- ing audio with free-form text descriptions. Self-supervised music encoders such as MERT [3] extend masked language modelling to audio with auxiliary pitch, chroma, and beat objec...

[6] [6]

Folk song with accordion and acoustic guitar

Method 3.1 Factor-Specific Triplet Construction A training triplet for factor f is a tuple (A, P f , N) where anchor A and positive Pf are similar on factor f and differ in other respects, while negative N differs from A on factor f. We construct three separate triplet datasets, one per factor, using different conditioning strategies. Given k positives pe...

[7] [7]

Let It Be

Experiments and Results 4.1 Datasets All training triplets are derived from MoisesDB [14], a multitrack source-separation corpus that provides per-song stems with instrument labels. Melody and rhythm anchors 3 Pair type Melody Rhythm Timbre Melody 60.0±30.3 53.4±28.8 26.3±27.5 Rhythm 34.0±27.3 65.8±25.6 37.5±26.3 Timbre 34.2±27.7 37.4±29.8 57.3±31.7 Table...

[8] [8]

Discussion The diagonal scores near 100% in Table 2 reflect super- vision aligned with what a shallow MLP on multi-layer MERT can extract; the held-out test split is folder-disjoint from training, so this is not overfitting in the conventional sense. A residual concern is that the within-pipeline test set could inherit JASCO-borne correlations that a head...

[9] [9]

Conclusion We presentedMERIT, a representational framework that exposes melodic, rhythmic, and timbral similarity as three separable scores. On three zero-shot probes, the intended head is the strongest factor head on instrument-class iden- tity (MUSDB18-HQ) and on dance-style rhythmic signa- tures (Ballroom), and the cross-factor profile recovered on cov...

[10] [10]

SUTD SKI 2021_04_06 and from MOE grant no

Acknowledgments This work has received funding from grant no. SUTD SKI 2021_04_06 and from MOE grant no. MOE-T2EP20124- 0014

[11] [11]

AI Usage Statement We acknowledge the use of Gemini and ChatGPT for para- phrasing and grammar improvements

[12] [12]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5

2023

[13] [13]

Mulan: A joint embedding of music audio and natural language.arXiv preprint arXiv:2208.12415,

Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y . Li, and D. P. Ellis, “Mulan: A joint embedding of music audio and natural language,”arXiv preprint arXiv:2208.12415, 2022

work page arXiv 2022

[14] [14]

Mert: Acoustic music understanding model with large-scale self-supervised training,

Y . Li, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetoset al., “Mert: Acoustic music understanding model with large-scale self-supervised training,”arXiv preprint arXiv:2306.00107, 2023

work page arXiv 2023

[15] [15]

Learning a rep- resentation for cover song identification using convo- lutional neural network,

Z. Yu, X. Xu, X. Chen, and D. Yang, “Learning a rep- resentation for cover song identification using convo- lutional neural network,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2020, pp. 541–545

2020

[16] [16]

Melodysim: measuring melody-aware mu- sic similarity for plagiarism detection,

T. Lu, C.-M. Geist, J. Melechovsky, A. Roy, and D. Her- remans, “Melodysim: measuring melody-aware mu- sic similarity for plagiarism detection,”arXiv preprint arXiv:2505.20979, 2025

work page arXiv 2025

[17] [17]

Sonicverse: Multi-task learning for music feature-informed caption- ing,

A. Chopra, A. Roy, and D. Herremans, “Sonicverse: Multi-task learning for music feature-informed caption- ing,”arXiv preprint arXiv:2506.15154, 2025

work page arXiv 2025

[18] [18]

In Defense of the Triplet Loss for Person Re-Identification

A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,”arXiv preprint arXiv:1703.07737, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Circle loss: A unified perspective of pair similarity optimization,

Y . Sun, C. Cheng, Y . Zhang, C. Zhang, L. Zheng, Z. Wang, and Y . Wei, “Circle loss: A unified perspective of pair similarity optimization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6398–6407

2020

[20] [20]

Neural audio fingerprint for high-specific au- dio retrieval based on contrastive learning,

S. Chang, D. Lee, J. Park, H. Lim, K. Lee, K. Ko, and Y . Han, “Neural audio fingerprint for high-specific au- dio retrieval based on contrastive learning,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3025–3029

2021

[21] [21]

Contrastive self- supervised learning for text-independent speaker ver- ification,

H. Zhang, Y . Zou, and H. Wang, “Contrastive self- supervised learning for text-independent speaker ver- ification,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6713–6717

2021

[22] [22]

Contrastive learning of musical representations,

J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,”arXiv preprint arXiv:2103.09410, 2021

work page arXiv 2021

[23] [23]

An experimental comparison of multi-view self-supervised methods for music tagging,

G. Meseguer-Brocal, D. Desblancs, and R. Hen- nequin, “An experimental comparison of multi-view self-supervised methods for music tagging,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1141–1145

2024

[24] [24]

Learning disen- tangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,

Y .-J. Luo, K. Agres, and D. Herremans, “Learning disen- tangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,” in20th Conference of the International Society for Music Information Retrieval (ISMIR). IS- MIR, 2019

2019

[25] [25]

Moisesdb: A dataset for source separation beyond 4- stems,

I. Pereira, F. Araújo, F. Korzeniowski, and R. V ogl, “Moisesdb: A dataset for source separation beyond 4- stems,”arXiv preprint arXiv:2307.15913, 2023

work page arXiv 2023

[26] [26]

Joint audio and symbolic conditioning for temporally controlled text-to-music generation,

O. Tal, A. Ziv, I. Gat, F. Kreuk, and Y . Adi, “Joint audio and symbolic conditioning for temporally controlled text-to-music generation,”arXiv preprint arXiv:2406.10970, 2024

work page arXiv 2024

[27] [27]

Leveraging llm embeddings for cross dataset label alignment and zero shot music emotion prediction,

R. Liu, A. Roy, and D. Herremans, “Leveraging llm embeddings for cross dataset label alignment and zero shot music emotion prediction,”arXiv preprint arXiv:2410.11522, 2024. 7

work page arXiv 2024

[28] [28]

The faiss library,

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou, “The faiss library,”IEEE Transactions on Big Data, 2025

2025

[29] [29]

Musdb18-hq-an uncompressed version of musdb18,

Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “Musdb18-hq-an uncompressed version of musdb18,”(No Title), 2019

2019

[30] [30]

Rhythmic pattern modeling for beat and downbeat tracking in musical audio

F. Krebs, S. Böck, and G. Widmer, “Rhythmic pattern modeling for beat and downbeat tracking in musical audio.” inIsmir, 2013, pp. 227–232

2013

[31] [31]

The 2007 labrosa cover song detection system,

D. P. Ellis and C. V . Cotton, “The 2007 labrosa cover song detection system,” 2007. 8

2007