MERIT: Learning Disentangled Music Representations for Audio Similarity

· 2026 · cs.SD · arXiv 2605.27346

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

representative citing papers

MERIT: Learning Disentangled Music Representations for Audio Similarity

cs.SD · 2026-05-26 · unverdicted · novelty 6.0

MERIT trains disentangled heads for melody, rhythm, and timbre via conditional audio generation and stem separation, with evaluations showing each head responds strongly to its target dimension and near chance on others across synthetic and real audio.

citing papers explorer

Showing 1 of 1 citing paper after filters.

MERIT: Learning Disentangled Music Representations for Audio Similarity cs.SD · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
MERIT trains disentangled heads for melody, rhythm, and timbre via conditional audio generation and stem separation, with evaluations showing each head responds strongly to its target dimension and near chance on others across synthetic and real audio.

MERIT: Learning Disentangled Music Representations for Audio Similarity

fields

years

verdicts

representative citing papers

citing papers explorer