pith. sign in

arxiv: 2605.27346 · v1 · pith:TXLB2H3Snew · submitted 2026-05-26 · 💻 cs.SD

MERIT: Learning Disentangled Music Representations for Audio Similarity

classification 💻 cs.SD
keywords audiomusictrainingdimensionsdisentangledlearningmeritmusical
0
0 comments X
read the original abstract

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.