MERIT trains disentangled heads for melody, rhythm, and timbre via conditional audio generation and stem separation, with evaluations showing each head responds strongly to its target dimension and near chance on others across synthetic and real audio.
MERIT: Learning Disentangled Music Representations for Audio Similarity
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MERIT: Learning Disentangled Music Representations for Audio Similarity
MERIT trains disentangled heads for melody, rhythm, and timbre via conditional audio generation and stem separation, with evaluations showing each head responds strongly to its target dimension and near chance on others across synthetic and real audio.