Controllable Embedding Transformation for Mood-Guided Music Retrieval

Jaehun Kim; Juan Pablo Bello; Julia Wilkins; Matthew C. McCallum; Matthew E. P. Davies

arxiv: 2510.20759 · v2 · submitted 2025-10-23 · 💻 cs.SD

Controllable Embedding Transformation for Mood-Guided Music Retrieval

Julia Wilkins , Jaehun Kim , Matthew E. P. Davies , Juan Pablo Bello , Matthew C. McCallum This is my paper

Pith reviewed 2026-05-18 04:44 UTC · model grok-4.3

classification 💻 cs.SD

keywords music embeddingsmood transformationcontrollable retrievalaudio embeddingsmusic recommendationembedding mappingproxy samplingattribute preservation

0 comments

The pith

A learned mapping can adjust mood in music embeddings while keeping genre and instrumentation intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to make music recommendation systems more flexible by letting users change just one quality, such as mood, in a song's mathematical representation without altering its genre or instruments. It does this by training a small model to map an original song embedding toward a new one that matches a desired mood, using similar but varied example tracks as guides during learning and a combined training goal that pushes both the mood shift and the retention of other traits. A reader would care because most current song embeddings offer no such targeted control, which limits how personal recommendations can feel. The approach is tested on two music collections and shown to hold onto non-mood features more reliably than methods that skip training.

Core claim

The authors claim that controllable embedding transformation for mood-guided retrieval is realized by learning a direct mapping from a seed audio embedding to a mood-conditioned target embedding, supported by a proxy sampling step that selects diverse yet similar reference tracks and by a joint objective that simultaneously drives the mood change and preserves other musical attributes, yielding stronger mood alignment and better retention of genre and instrumentation than training-free baselines on two datasets.

What carries the argument

The lightweight translation model trained via proxy target sampling and a joint objective that balances mood transformation against preservation of remaining attributes.

If this is right

Music retrieval systems could let users request tracks with a chosen mood while the original genre and instrumentation stay close to the seed track.
Playlist creation tools could apply the same transformation repeatedly to generate sets that vary along one chosen dimension at a time.
Embedding spaces used for similarity search would support fine-grained personalization without requiring new audio processing for each adjustment.
Training-free methods would be replaced in practice because the learned mapping demonstrably improves attribute retention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling and joint-objective structure might be reused to control other single attributes such as energy or tempo once suitable labels are available.
Real-time user mood preferences could be fed into the mapping at query time to produce on-the-fly adjusted retrieval results.
The framework could be combined with existing large-scale recommendation pipelines to add controllable sliders without retraining the base embeddings.

Load-bearing premise

Mood can be isolated from other musical properties inside the embedding space and shifted on its own by the learned mapping and proxy sampling without side effects on genre or instrumentation.

What would settle it

Measuring that genre labels or instrumentation features of the output embeddings change at roughly the same rate as the intended mood shift, or that retention scores fall below those of the training-free baselines, would show the central claim does not hold.

read the original abstract

Music representations are the backbone of modern recommendation systems, powering playlist generation, similarity search, and personalized discovery. Yet most embeddings offer little control for adjusting a single musical attribute, e.g., changing only the mood of a track while preserving its genre or instrumentation. In this work, we address the problem of controllable music retrieval through embedding-based transformation, where the objective is to retrieve songs that remain similar to a seed track but are modified along one chosen dimension. We propose a novel framework for mood-guided music embedding transformation, which learns a mapping from a seed audio embedding to a target embedding guided by mood labels, while preserving other musical attributes. Because mood cannot be directly altered in the seed audio, we introduce a sampling mechanism that retrieves proxy targets to balance diversity with similarity to the seed. We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation. Extensive experiments on two datasets show strong mood transformation performance while retaining genre and instrumentation far better than training-free baselines, establishing controllable embedding transformation as a promising paradigm for personalized music retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy sampling plus a joint objective gives a workable route to mood shifts in music embeddings, but correlations with genre and instrumentation likely limit true isolation.

read the letter

The core contribution here is a lightweight translation model that takes a seed embedding and maps it toward a mood target using proxy samples, backed by a joint loss that pushes both the mood change and retention of other attributes. The proxy mechanism samples targets that match the seed somewhat while hitting the desired mood label, which sidesteps the problem of directly editing raw audio. That setup plus the combined objective is the main new piece, and it feels like a reasonable engineering response to the controllability gap in existing music embeddings. The reported experiments on two datasets show clearer mood movement and stronger preservation of genre and instrumentation than the training-free baselines, which is a positive signal if the numbers are measured cleanly. The model staying small is also a practical plus for anyone plugging this into a recommendation pipeline. The soft spot is the isolation claim. Mood labels in music data routinely correlate with genre and instrumentation, so the proxies could carry those attributes along even when the preservation term is active. Nothing in the abstract points to adversarial disentanglement, orthogonal penalties, or explicit correlation checks, which leaves open the chance that the retention gains come more from weak baselines or dataset quirks than from genuine attribute separation. Without those details or ablations showing how much unintended drift actually occurs, the controllability story stays provisional. This is worth a look for people working on attribute-controlled music retrieval or embedding edits in recsys. A reader who needs a concrete sampling trick and joint objective to try on their own data will find usable ideas. It deserves peer review because the framing is distinct enough and the directional results are there to justify referee time, even if revisions will probably be needed around validation of independence.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework for controllable embedding transformation in music retrieval, where a lightweight translation model learns to map seed audio embeddings to target embeddings conditioned on mood labels. Proxy sampling retrieves diverse yet similar targets to enable mood adjustment without direct audio modification, and a joint objective balances transformation accuracy with preservation of other attributes like genre and instrumentation. Experiments on two datasets reportedly outperform training-free baselines in mood control while better retaining genre and instrumentation.

Significance. If the central claim of isolated mood control holds, the work offers a practical paradigm for attribute-specific editing in music embeddings, with potential applications in personalized recommendation systems. The use of proxy sampling and joint objectives provides a concrete implementation that could be extended to other attributes, though its advantage over standard supervised approaches requires further validation against dataset correlations.

major comments (2)

[§3.2] §3.2 (Proxy Sampling Mechanism): The sampling retrieves proxies balancing diversity and seed similarity using mood labels, but the description does not include explicit mechanisms such as adversarial disentanglement or orthogonal constraints to prevent mood from entangling with correlated attributes like genre or instrumentation. If mood labels in the datasets correlate with these attributes, the learned mapping may shift them despite the preservation term in the joint objective.
[§4] §4 (Experiments): The reported retention of genre and instrumentation 'far better than training-free baselines' is central to the controllability claim, yet the manuscript provides limited details on data splits, exact quantitative metrics (e.g., specific similarity scores or classification accuracies), and ablations isolating the contribution of proxy sampling versus the joint objective. This makes it difficult to rule out dataset biases as the source of observed preservation.

minor comments (2)

[§3.1] The notation for the translation model and embedding spaces could be clarified with an explicit equation defining the mapping function f and the role of proxy targets.
[Figure 1] Figure 1 (framework diagram) would benefit from labeling the proxy sampling step and the components of the joint loss to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the changes we will make in the revised version.

read point-by-point responses

Referee: [§3.2] §3.2 (Proxy Sampling Mechanism): The sampling retrieves proxies balancing diversity and seed similarity using mood labels, but the description does not include explicit mechanisms such as adversarial disentanglement or orthogonal constraints to prevent mood from entangling with correlated attributes like genre or instrumentation. If mood labels in the datasets correlate with these attributes, the learned mapping may shift them despite the preservation term in the joint objective.

Authors: We thank the referee for highlighting this potential issue. Our framework does not employ adversarial disentanglement or orthogonal constraints; it instead relies on the proxy sampling mechanism to select targets that are similar to the seed in non-mood attributes and on the joint objective to enforce preservation during training. We acknowledge that correlations between mood labels and attributes such as genre or instrumentation in the datasets could influence the learned mapping. To address this, we will revise §3.2 to include a discussion of dataset correlations and add supporting analysis or experiments quantifying the degree of preservation achieved by the joint objective. revision: partial
Referee: [§4] §4 (Experiments): The reported retention of genre and instrumentation 'far better than training-free baselines' is central to the controllability claim, yet the manuscript provides limited details on data splits, exact quantitative metrics (e.g., specific similarity scores or classification accuracies), and ablations isolating the contribution of proxy sampling versus the joint objective. This makes it difficult to rule out dataset biases as the source of observed preservation.

Authors: We agree that additional experimental details are required for reproducibility and to strengthen the controllability claims. In the revised manuscript we will expand §4 with explicit descriptions of the data splits, report the precise quantitative metrics (including similarity scores and classification accuracies for genre and instrumentation), and present ablation studies that isolate the individual contributions of proxy sampling and the joint objective. These additions will help demonstrate that the observed preservation is attributable to the proposed components rather than dataset biases alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a standard supervised learning setup: a lightweight translation model is trained on audio embeddings using mood labels, proxy sampling to select targets, and a joint objective balancing transformation with attribute preservation. No equations, derivations, or first-principles results are described that reduce outputs to inputs by construction, nor are there self-citations, uniqueness theorems, or ansatzes that load-bear the central claim. The method relies on external datasets and empirical validation rather than tautological redefinitions or fitted parameters renamed as predictions. This is the most common honest outcome for a training-based retrieval paper and qualifies as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions from representation learning and contrastive or translation models in audio, with no explicitly invented physical entities.

free parameters (1)

sampling parameters for proxy targets
Hyperparameters controlling diversity versus similarity in proxy target selection are chosen to balance the transformation.

axioms (1)

domain assumption Music embeddings encode separable attributes such as mood, genre, and instrumentation.
Invoked when assuming the mapping can alter mood while preserving other attributes.

pith-pipeline@v0.9.0 · 5727 in / 1107 out tokens · 24972 ms · 2026-05-18T04:44:27.972070+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

similar, but happier,

INTRODUCTION Music consumption behavior on streaming platforms can range from passive background listening to active playlist creation and explicit recommendation feedback [1]. A promising direction within this continuum targets the discovery of music which shares many un- derlying musical properties of some seed track(s), but differs in one or two target...

work page
[2]

There is little prior work on manipulating music embeddings in an audio-only latent space for semantically guided retrieval tasks

leverages an audio-text embedding space to manipulate audio effects using natural language prompts, and [12] uses diffusion to generate audio queries conditioned on text for text-music retrieval. There is little prior work on manipulating music embeddings in an audio-only latent space for semantically guided retrieval tasks. Disentanglement-based approach...

work page
[3]

Controllable Embedding Transformation for Mood-Guided Music Retrieval

METHOD We propose a novel framework for controllable music embedding transformation. The goal of our system is to learn a transforma- tion purely in the embedding space that shifts a single, controllable attribute of an input audio track, while preserving other musical at- tributes. We usemoodas the transformation attribute andgenreand instrumentationas m...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

energetic

EXPERIMENTAL DESIGN 3.1. Datasets We use a large-scale proprietary music dataset for our study that con- tains1.3M songs with high-quality mood and genre annotations. This dataset contains songs from a set of four moods pertaining to high and low-energy and positive and negative sentiment, which ap- proximately align with the main dimensions of Russell’s ...

work page
[5]

RESULTS AND DISCUSSION 4.1. Core Results Our key results, shown in Table 1, demonstrate that our method con- sistently outperforms random baselines by a wide margin, achieving high mood transformation accuracy while simultaneously preserving genre and instrumentation. On the large-scale dataset, our approach reaches Mood P@1 of 0.96and Genre P@1 of0.32, f...

work page
[6]

CONCLUSION In this work, we introduce a framework for controllable music em- bedding transformation, enabling retrieval of tracks of a different mood but similar in other musical dimensions such as genre and in- strumentation. We utilize a novel nearest-neighbor data sampling scheme to create seed-target embedding pairs to train our transfor- mation model...

work page
[7]

Music recommendation systems: Techniques, use cases, and chal- lenges,

M. Schedl, P. Knees, B. McFee, and D. Bogdanov, “Music recommendation systems: Techniques, use cases, and chal- lenges,” inRecommender systems handbook, pp. 927–971. Springer, 2021

work page 2021
[8]

Current challenges and visions in music recommender sys- tems research,

M. Schedl, H. Zamani, C.-W. Chen, Y . Deldjoo, and M. Elahi, “Current challenges and visions in music recommender sys- tems research,”International Journal of Multimedia Informa- tion Retrieval, vol. 7, no. 2, pp. 95–116, 2018

work page 2018
[9]

Be- yond the trends: Evolution and future directions in music recommender systems research,

B. Amiri, N. Shahverdi, A. Haddadi, and Y . Ghahremani, “Be- yond the trends: Evolution and future directions in music recommender systems research,”IEEE Access, vol. 12, pp. 51500–51522, 2024

work page 2024
[10]

Content-driven music recommendation: Evolution, state of the art, and challenges,

Y . Deldjoo, M. Schedl, and P. Knees, “Content-driven music recommendation: Evolution, state of the art, and challenges,” Computer Science Review, vol. 51, pp. 100618, 2024

work page 2024
[11]

Music Style Transfer: A Position Paper

S. Dai, Z. Zhang, and G. G. Xia, “Music style transfer: A position paper,”arXiv preprint arXiv:1803.06841, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Mu- sic style transfer with time-varying inversion of diffusion mod- els,

S. Li, Y . Zhang, F. Tang, C. Ma, W. Dong, and C. Xu, “Mu- sic style transfer with time-varying inversion of diffusion mod- els,” inProceedings of the AAAI Conference on Artificial In- telligence, 2024, vol. 38, pp. 547–555

work page 2024
[13]

Make your favorite music curative: Music style transfer for anxiety reduction,

Z. Hu, Y . Liu, G. Chen, S.-h. Zhong, and A. Zhang, “Make your favorite music curative: Music style transfer for anxiety reduction,” inProceedings of the 28th ACM international con- ference on multimedia, 2020, pp. 1189–1197

work page 2020
[14]

Simple and controllable music gen- eration,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez, “Simple and controllable music gen- eration,”Advances in Neural Information Processing Systems, vol. 36, pp. 47704–47720, 2023

work page 2023
[15]

Diff-a-riff: Musical accompaniment co-creation via latent dif- fusion models,

J. Nistal, M. Pasini, C. Aouameur, M. Grachten, and S. Lattner, “Diff-a-riff: Musical accompaniment co-creation via latent dif- fusion models,” inProceedings of the 25th International So- ciety for Music Information Retrieval Conference. Nov. 2024, pp. 272–280, ISMIR

work page 2024
[16]

Groove2groove: One-shot music style transfer with supervision from synthetic data,

O. C ´ıfka, U. S ¸ims ¸ekli, and G. Richard, “Groove2groove: One-shot music style transfer with supervision from synthetic data,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 28, pp. 2638–2650, 2020

work page 2020
[17]

Text2fx: Har- nessing clap embeddings for text-guided audio effects,

A. Chu, P. O’Reilly, J. Barnett, and B. Pardo, “Text2fx: Har- nessing clap embeddings for text-guided audio effects,” in IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[18]

GD-Retriever: Con- trollable generative text-music retrieval with diffusion models,

J. Guinot, E. Quinton, and G. Fazekas, “GD-Retriever: Con- trollable generative text-music retrieval with diffusion models,” inProceedings of the 26th International Society for Music In- formation Retrieval Conference (ISMIR), 2025

work page 2025
[19]

Leave-one- equivariant: Alleviating invariance-related information loss in contrastive music representations,

J. Guinot, E. Quinton, and G. Fazekas, “Leave-one- equivariant: Alleviating invariance-related information loss in contrastive music representations,” inIEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[20]

Disentan- gled multidimensional metric learning for music similarity,

J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentan- gled multidimensional metric learning for music similarity,” in IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2020, pp. 6–10

work page 2020
[21]

Bal- ancing information preservation and disentanglement in self- supervised music representation learning,

J. Wilkins, S. Ding, M. Fuentes, and J. P. Bello, “Bal- ancing information preservation and disentanglement in self- supervised music representation learning,”arXiv preprint arXiv:2507.22995, 2025

work page arXiv 2025
[22]

Unsuper- vised pitch-timbre-variation disentanglement of monophonic music signals based on random perturbation and re-entry train- ing,

K. Tanaka, K. Yoshii, S. Dixon, and S. Morishima, “Unsuper- vised pitch-timbre-variation disentanglement of monophonic music signals based on random perturbation and re-entry train- ing,”APSIPA Transactions on Signal and Information Pro- cessing, 2025

work page 2025
[23]

Similar but faster: Manipulation of tempo in music audio embeddings for tempo prediction and search,

M. C. McCallum, F. Henkel, J. Kim, S. E. Sandberg, and M. E. P. Davies, “Similar but faster: Manipulation of tempo in music audio embeddings for tempo prediction and search,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 686–690

work page 2024
[24]

The MTG-Jamendo dataset for automatic music tagging,

D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, Interna- tional Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019

work page 2019
[25]

Supervised and unsupervised learning of audio representations for music understanding,

M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” inProceedings of the 23rd International Society for Music Information Retrieval Conference. Dec. 2022, pp. 256–263, ISMIR

work page 2022
[26]

Facenet: A uni- fied embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uni- fied embedding for face recognition and clustering,” inPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, 2015, pp. 815–823

work page 2015
[27]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PmLR, 2020, pp. 1597–1607

work page 2020
[28]

A circumplex model of affect,

J. A. Russell, “A circumplex model of affect,”Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980

work page 1980

[1] [1]

similar, but happier,

INTRODUCTION Music consumption behavior on streaming platforms can range from passive background listening to active playlist creation and explicit recommendation feedback [1]. A promising direction within this continuum targets the discovery of music which shares many un- derlying musical properties of some seed track(s), but differs in one or two target...

work page

[2] [2]

There is little prior work on manipulating music embeddings in an audio-only latent space for semantically guided retrieval tasks

leverages an audio-text embedding space to manipulate audio effects using natural language prompts, and [12] uses diffusion to generate audio queries conditioned on text for text-music retrieval. There is little prior work on manipulating music embeddings in an audio-only latent space for semantically guided retrieval tasks. Disentanglement-based approach...

work page

[3] [3]

Controllable Embedding Transformation for Mood-Guided Music Retrieval

METHOD We propose a novel framework for controllable music embedding transformation. The goal of our system is to learn a transforma- tion purely in the embedding space that shifts a single, controllable attribute of an input audio track, while preserving other musical at- tributes. We usemoodas the transformation attribute andgenreand instrumentationas m...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

energetic

EXPERIMENTAL DESIGN 3.1. Datasets We use a large-scale proprietary music dataset for our study that con- tains1.3M songs with high-quality mood and genre annotations. This dataset contains songs from a set of four moods pertaining to high and low-energy and positive and negative sentiment, which ap- proximately align with the main dimensions of Russell’s ...

work page

[5] [5]

RESULTS AND DISCUSSION 4.1. Core Results Our key results, shown in Table 1, demonstrate that our method con- sistently outperforms random baselines by a wide margin, achieving high mood transformation accuracy while simultaneously preserving genre and instrumentation. On the large-scale dataset, our approach reaches Mood P@1 of 0.96and Genre P@1 of0.32, f...

work page

[6] [6]

CONCLUSION In this work, we introduce a framework for controllable music em- bedding transformation, enabling retrieval of tracks of a different mood but similar in other musical dimensions such as genre and in- strumentation. We utilize a novel nearest-neighbor data sampling scheme to create seed-target embedding pairs to train our transfor- mation model...

work page

[7] [7]

Music recommendation systems: Techniques, use cases, and chal- lenges,

M. Schedl, P. Knees, B. McFee, and D. Bogdanov, “Music recommendation systems: Techniques, use cases, and chal- lenges,” inRecommender systems handbook, pp. 927–971. Springer, 2021

work page 2021

[8] [8]

Current challenges and visions in music recommender sys- tems research,

M. Schedl, H. Zamani, C.-W. Chen, Y . Deldjoo, and M. Elahi, “Current challenges and visions in music recommender sys- tems research,”International Journal of Multimedia Informa- tion Retrieval, vol. 7, no. 2, pp. 95–116, 2018

work page 2018

[9] [9]

Be- yond the trends: Evolution and future directions in music recommender systems research,

B. Amiri, N. Shahverdi, A. Haddadi, and Y . Ghahremani, “Be- yond the trends: Evolution and future directions in music recommender systems research,”IEEE Access, vol. 12, pp. 51500–51522, 2024

work page 2024

[10] [10]

Content-driven music recommendation: Evolution, state of the art, and challenges,

Y . Deldjoo, M. Schedl, and P. Knees, “Content-driven music recommendation: Evolution, state of the art, and challenges,” Computer Science Review, vol. 51, pp. 100618, 2024

work page 2024

[11] [11]

Music Style Transfer: A Position Paper

S. Dai, Z. Zhang, and G. G. Xia, “Music style transfer: A position paper,”arXiv preprint arXiv:1803.06841, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Mu- sic style transfer with time-varying inversion of diffusion mod- els,

S. Li, Y . Zhang, F. Tang, C. Ma, W. Dong, and C. Xu, “Mu- sic style transfer with time-varying inversion of diffusion mod- els,” inProceedings of the AAAI Conference on Artificial In- telligence, 2024, vol. 38, pp. 547–555

work page 2024

[13] [13]

Make your favorite music curative: Music style transfer for anxiety reduction,

Z. Hu, Y . Liu, G. Chen, S.-h. Zhong, and A. Zhang, “Make your favorite music curative: Music style transfer for anxiety reduction,” inProceedings of the 28th ACM international con- ference on multimedia, 2020, pp. 1189–1197

work page 2020

[14] [14]

Simple and controllable music gen- eration,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez, “Simple and controllable music gen- eration,”Advances in Neural Information Processing Systems, vol. 36, pp. 47704–47720, 2023

work page 2023

[15] [15]

Diff-a-riff: Musical accompaniment co-creation via latent dif- fusion models,

J. Nistal, M. Pasini, C. Aouameur, M. Grachten, and S. Lattner, “Diff-a-riff: Musical accompaniment co-creation via latent dif- fusion models,” inProceedings of the 25th International So- ciety for Music Information Retrieval Conference. Nov. 2024, pp. 272–280, ISMIR

work page 2024

[16] [16]

Groove2groove: One-shot music style transfer with supervision from synthetic data,

O. C ´ıfka, U. S ¸ims ¸ekli, and G. Richard, “Groove2groove: One-shot music style transfer with supervision from synthetic data,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 28, pp. 2638–2650, 2020

work page 2020

[17] [17]

Text2fx: Har- nessing clap embeddings for text-guided audio effects,

A. Chu, P. O’Reilly, J. Barnett, and B. Pardo, “Text2fx: Har- nessing clap embeddings for text-guided audio effects,” in IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[18] [18]

GD-Retriever: Con- trollable generative text-music retrieval with diffusion models,

J. Guinot, E. Quinton, and G. Fazekas, “GD-Retriever: Con- trollable generative text-music retrieval with diffusion models,” inProceedings of the 26th International Society for Music In- formation Retrieval Conference (ISMIR), 2025

work page 2025

[19] [19]

Leave-one- equivariant: Alleviating invariance-related information loss in contrastive music representations,

J. Guinot, E. Quinton, and G. Fazekas, “Leave-one- equivariant: Alleviating invariance-related information loss in contrastive music representations,” inIEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[20] [20]

Disentan- gled multidimensional metric learning for music similarity,

J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentan- gled multidimensional metric learning for music similarity,” in IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2020, pp. 6–10

work page 2020

[21] [21]

Bal- ancing information preservation and disentanglement in self- supervised music representation learning,

J. Wilkins, S. Ding, M. Fuentes, and J. P. Bello, “Bal- ancing information preservation and disentanglement in self- supervised music representation learning,”arXiv preprint arXiv:2507.22995, 2025

work page arXiv 2025

[22] [22]

Unsuper- vised pitch-timbre-variation disentanglement of monophonic music signals based on random perturbation and re-entry train- ing,

K. Tanaka, K. Yoshii, S. Dixon, and S. Morishima, “Unsuper- vised pitch-timbre-variation disentanglement of monophonic music signals based on random perturbation and re-entry train- ing,”APSIPA Transactions on Signal and Information Pro- cessing, 2025

work page 2025

[23] [23]

Similar but faster: Manipulation of tempo in music audio embeddings for tempo prediction and search,

M. C. McCallum, F. Henkel, J. Kim, S. E. Sandberg, and M. E. P. Davies, “Similar but faster: Manipulation of tempo in music audio embeddings for tempo prediction and search,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 686–690

work page 2024

[24] [24]

The MTG-Jamendo dataset for automatic music tagging,

D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, Interna- tional Conference on Machine Learning (ICML 2019), Long Beach, CA, United States, 2019

work page 2019

[25] [25]

Supervised and unsupervised learning of audio representations for music understanding,

M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” inProceedings of the 23rd International Society for Music Information Retrieval Conference. Dec. 2022, pp. 256–263, ISMIR

work page 2022

[26] [26]

Facenet: A uni- fied embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uni- fied embedding for face recognition and clustering,” inPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, 2015, pp. 815–823

work page 2015

[27] [27]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PmLR, 2020, pp. 1597–1607

work page 2020

[28] [28]

A circumplex model of affect,

J. A. Russell, “A circumplex model of affect,”Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980

work page 1980