Learning a Joint Embedding Space of Monophonic and Mixed Music Signals for Singing Voice
Pith reviewed 2026-05-25 14:59 UTC · model grok-4.3
The pith
Metric learning maps monophonic vocal tracks and mixed music tracks of the same singer into a shared embedding space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a metric learning system that produces a joint embedding space for monophonic and mixed tracks such that tracks from the same singer are closer together than tracks from different singers, trained on synthetic mashup data to enable cross-domain singer identification and query-by-singer without vocal enhancement.
What carries the argument
Metric learning objective that minimizes distance between same-singer monophonic and mixed pairs while maximizing distance to different-singer pairs.
If this is right
- Cross-domain retrieval becomes possible: monophonic query retrieves mixed tracks of same singer.
- Singer identification works across monophonic and mixed domains.
- No source separation required for mixed track processing.
- System trained only on synthetic data generalizes to real recordings.
Where Pith is reading between the lines
- The approach could extend to other audio domains where clean and noisy versions need alignment, such as speech in noise.
- Performance might improve with larger real-world mixed datasets for fine-tuning.
- Similar joint spaces could support style transfer or singer conversion between domains.
Load-bearing premise
Embeddings trained exclusively on synthetic mashups of monophonic vocals with random accompaniments transfer directly to genuine commercial mixed recordings.
What would settle it
Measure the precision of retrieving the correct mixed track when querying with a monophonic vocal from the same singer on a held-out set of real commercial recordings; if accuracy matches or exceeds same-domain baselines, the claim holds.
read the original abstract
Previous approaches in singer identification have used one of monophonic vocal tracks or mixed tracks containing multiple instruments, leaving a semantic gap between these two domains of audio. In this paper, we present a system to learn a joint embedding space of monophonic and mixed tracks for singing voice. We use a metric learning method, which ensures that tracks from both domains of the same singer are mapped closer to each other than those of different singers. We train the system on a large synthetic dataset generated by music mashup to reflect real-world music recordings. Our approach opens up new possibilities for cross-domain tasks, e.g., given a monophonic track of a singer as a query, retrieving mixed tracks sung by the same singer from the database. Also, it requires no additional vocal enhancement steps such as source separation. We show the effectiveness of our system for singer identification and query-by-singer in both the same-domain and cross-domain tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a metric learning system to learn a joint embedding space for monophonic vocal tracks and mixed music tracks containing multiple instruments. Trained exclusively on synthetic mashups, the approach is claimed to support same-domain and cross-domain singer identification and query-by-singer retrieval without source separation.
Significance. If the cross-domain transfer holds, the work would enable new retrieval applications in music information retrieval by bridging monophonic and mixed domains. No machine-checked proofs, reproducible code, parameter-free derivations, or falsifiable predictions are present to strengthen the assessment.
major comments (2)
- [Abstract] Abstract: The claim that 'We show the effectiveness of our system for singer identification and query-by-singer in both the same-domain and cross-domain tasks' is unsupported because the manuscript supplies no quantitative results, ablation studies, real-data validation, or error analysis.
- [Evaluation / Experiments] The central cross-domain claim requires that embeddings learned on synthetic mashups (monophonic vocals + random instrument overlays) generalize to genuine commercial mixed recordings. No held-out evaluation on real commercial recordings distinct from the mashup construction process is reported, leaving domain-shift concerns unaddressed.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We address each major comment below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'We show the effectiveness of our system for singer identification and query-by-singer in both the same-domain and cross-domain tasks' is unsupported because the manuscript supplies no quantitative results, ablation studies, real-data validation, or error analysis.
Authors: While the abstract makes a general claim, the manuscript includes quantitative evaluations in the experimental section demonstrating the performance of the metric learning approach on synthetic mashup data for singer identification and query-by-singer retrieval in same-domain and cross-domain settings. We agree that the abstract should be revised to more precisely indicate the scope of the experiments (i.e., on synthetic data) and that including ablation studies and error analysis would improve the paper. We will update the abstract and expand the relevant sections in the revision. revision: partial
-
Referee: [Evaluation / Experiments] The central cross-domain claim requires that embeddings learned on synthetic mashups (monophonic vocals + random instrument overlays) generalize to genuine commercial mixed recordings. No held-out evaluation on real commercial recordings distinct from the mashup construction process is reported, leaving domain-shift concerns unaddressed.
Authors: We recognize the importance of validating the approach on real commercial recordings to address potential domain shifts. Our work focuses on synthetic mashups as a controlled way to generate paired monophonic and mixed data for training and evaluation, allowing us to study the joint embedding without the need for source separation. We will revise the manuscript to include a more explicit discussion of this limitation and the assumptions underlying the mashup-based training data. revision: partial
Circularity Check
No significant circularity; standard metric learning on external synthetic data
full rationale
The paper applies a conventional metric-learning objective (triplet or contrastive loss) to embeddings of monophonic vocals and synthetic mashups; the claimed cross-domain retrieval performance is an empirical outcome of that training rather than a quantity defined by the fitted parameters themselves. No equations reduce the reported metrics to self-referential quantities, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via self-citation. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning a joint embedding space of monophonic and mixed music signals for singing voice
INTRODUCTION Singing voice is often at the center of attention in popu- lar music. We can easily observe large public interest in singing voice and singers through the popularity of karaoke industry and singing-oriented television shows. A recent study also showed that some of the most salient compo- nents of music are singers (vocals, voice) and lyrics [...
work page 2019
-
[2]
RELATED WORK Cross-domain systems have not yet been examined regard- ing singing voice analysis. Nonetheless, a common chal- lenge in singer information processing systems is to ex- tract singing voice characteristics from music signals in the presence of background accompaniment music. The most direct way to obtain vocal information is to use mono- phoni...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[3]
METHODS In this section, we describe the data generation pipeline, model configuration and training strategy for learning a joint representation of monophonic and mixed tracks for singing voice. 3.1 Data generation For training cross-domain singer-ID and retrieval systems, a sufficiently large number of monophonic and mixed track pairs per singer is needed....
-
[4]
EXPERIMENTS & EV ALUATION 4.1 Test scenarios Two main tasks for evaluation are singer identification and query-by-singer. In both tasks, a music signal to be ana- lyzed (source) is queried to a collection of data (target) to retrieve desired information. Depending on the domain of source and target data, we design three test scenarios: • Mono2Mono: both so...
-
[5]
EMBEDDING SPACE VISUALIZATION We visualize the embedding space learned by the MIXED and CROSS models to understand how they each process monophonic and mixed tracks. From DAMP-V oice and DAMP-Mash dataset, we select 25 singers unseen from the training stage and highlight 10 with colors for better visu- alization. 20 tracks are plotted for each singer: 10 ...
-
[6]
MOTIV ATION FOR FUTURE WORK Improvement on music mashup : Our mashup pipeline has a large room for improvement. Besides errors produced from existing algorithms, such as key detection, more ef- forts can be put towards mixing two tracks with a good balance as in real-world recordings. A good automatic mashup system can benefit many areas of research in MIR...
-
[7]
CONCLUSION In this paper, we introduced a new problem of cross- domain singer identification and singer-based music re- trieval to allow information transfer between monophonic and mixed tracks. Through data generation using music mashup, we were able to train an embedding model to out- put a joint representation for singing voice from tracks re- gardless ...
-
[8]
ACKNOWLEDGEMENTS We thank Keunwoo Choi for valuable comments and re- views. This work was supported by Basic Science Re- search Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT & Fu- ture Planning (2015R1C1A1A02036962), and by NA VER Corp
-
[9]
Kara1k: a karaoke dataset for cover song identification and singing voice analy- sis
Yann Bayle, Ladislav Maršík, Martin Rusek, Matthias Robine, Pierre Hanna, Katerina Slaninová, Jan Marti- novic, and Jaroslav Pokorn`y. Kara1k: a karaoke dataset for cover song identification and singing voice analy- sis. In IEEE International Symposium on Multimedia (ISM), pages 177–184, 2017
work page 2017
-
[10]
Automatic singer identification based on auditory features
Wei Cai, Qiang Li, and Xin Guan. Automatic singer identification based on auditory features. In 2011 Sev- enth International Conference on Natural Computa- tion, volume 3, pages 1624–1628, 2011
work page 2011
-
[11]
V ocal activity informed singing voice separation with the iKALA dataset
Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan, Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and Roger Jang. V ocal activity informed singing voice separation with the iKALA dataset. In IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 718–722. IEEE, 2015
work page 2015
-
[12]
Automashupper: Au- tomatic creation of multi-song music mashups
Matthew EP Davies, Philippe Hamel, Kazuyoshi Yoshii, and Masataka Goto. Automashupper: Au- tomatic creation of multi-song music mashups. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 22(12):1726–1737, 2014
work page 2014
-
[13]
V ocals in music matter: The relevance of vocals in the minds of listeners
Andrew Demetriou, Andreas Jansson, Aparna Kumar, and R Bittner. V ocals in music matter: The relevance of vocals in the minds of listeners. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pages 514–520, 2018
work page 2018
-
[14]
A multi-view deep learning approach for cross do- main user modeling in recommendation systems
Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. A multi-view deep learning approach for cross do- main user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web, pages 278–288, 2015
work page 2015
-
[15]
Devise: A deep visual-semantic embedding model
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Ad- vances in Neural Information Processing Systems , pages 2121–2129, 2013
work page 2013
-
[16]
Singer identification based on accompaniment sound reduction and reliable frame selection
Hiromasa Fujihara, Tetsuro Kitahara, Masataka Goto, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. Singer identification based on accompaniment sound reduction and reliable frame selection. In Pro- ceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pages 329–336, 2005
work page 2005
-
[17]
Singing information processing
Masataka Goto. Singing information processing. In Proceedings of the 12th IEEE International Confer- ence on Signal Processing (ICSP) , volume 10, pages 2431–2438, 2014
work page 2014
-
[18]
On the improvement of singing voice separation for monau- ral recordings using the mir-1k dataset
Chao-Ling Hsu and Jyh-Shing Roger Jang. On the improvement of singing voice separation for monau- ral recordings using the mir-1k dataset. IEEE Trans- actions on Audio, Speech, and Language Processing , 18(2):310–319, 2010
work page 2010
-
[19]
Eric J Humphrey, Sravana Reddy, Prem Seetharaman, Aparna Kumar, Rachel M Bittner, Andrew Demetriou, Sankalp Gulati, Andreas Jansson, Tristan Jehan, Bern- hard Lehner, et al. An introduction to signal processing for singing-voice analysis: High notes in the effort to automate the understanding of vocals in music. IEEE Signal Processing Magazine, 36(1):82–94, 2019
work page 2019
-
[20]
Singer identifica- tion in popular music recordings using voice coding features
Youngmu Kim and Brian Whitman. Singer identifica- tion in popular music recordings using voice coding features. In Proceedings of the 3rd International Con- ference on Music Information Retrieval, 2002
work page 2002
- [21]
-
[22]
Sangeun Kum and Juhan Nam. Joint detection and classification of singing voice melody using convo- lutional recurrent neural networks. Applied Sciences , 9(7), 2019
work page 2019
-
[23]
Mathieu Lagrange, Alexey Ozerov, and Emmanuel Vincent. Robust singer identification in polyphonic music using melody enhancement and uncertainty- based learning. In Proceedings of the 13th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), 2012
work page 2012
-
[24]
Re- visiting singing voice detection: a quantitative review and the future outlook
Kyungyun Lee, Keunwoo Choi, and Juhan Nam. Re- visiting singing voice detection: a quantitative review and the future outlook. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018
work page 2018
-
[25]
Rectifier nonlinearities improve neural network acous- tic models
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acous- tic models. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013
work page 2013
-
[26]
Visu- alizing data using t-sne
Laurens van der Maaten and Geoffrey Hinton. Visu- alizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008
work page 2008
-
[27]
Singer identification based on vocal and in- strumental models
Namunu Chinthaka Maddage, Changsheng Xu, and Ye Wang. Singer identification based on vocal and in- strumental models. In Proceedings of the 17th Interna- tional Conference on Pattern Recognition (ICPR), vol- ume 2, pages 375–378, 2004
work page 2004
-
[28]
librosa/librosa: 0.6.2, Au- gust 2018
Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Vincent Lostanlen, Colin Raffel, Dana Lee, Oriol Nieto, Eric Battenberg, Dan Ellis, Ryuichi Ya- mamoto, Josh Moore, WZY , Rachel Bittner, Keunwoo Choi, Pius Friesch, Fabian-Robert Stöter, Matt V oll- rath, Siddhartha Kumar, nehz, Simon Waloschek, Seth, Rimvydas Naktinis, Douglas Repetto, Curtis "Fjord" ...
work page 2018
-
[29]
Singer identification in polyphonic music using vocal separation and pattern recognition methods
Annamaria Mesaros, Tuomas Virtanen, and Anssi Kla- puri. Singer identification in polyphonic music using vocal separation and pattern recognition methods. In Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR), pages 375–378, 2007
work page 2007
-
[30]
V ocal timbre analysis using latent dirichlet allocation and cross-gender vocal timbre similarity
Tomoyasu Nakano, Kazuyoshi Yoshii, and Masataka Goto. V ocal timbre analysis using latent dirichlet allocation and cross-gender vocal timbre similarity. In Acoustics, Speech and Signal Processing, 2014. ICASSP 2014. IEEE International Conference on , pages 5202–5206, 2014
work page 2014
-
[31]
A hybrid of deep audio feature and i-vector for artist recognition
Jiyoung Park, Donghyun Kim, Jongpil Lee, Sangeun Kum, and Juhan Nam. A hybrid of deep audio feature and i-vector for artist recognition. InJoint Workshop on Machine Learning for Music, International Conference on Machine Learning, 2018
work page 2018
-
[32]
Representation learning of music using artist labels
Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha, and Juhan Nam. Representation learning of music using artist labels. In Proceedings of the 19th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), 2018
work page 2018
-
[33]
The MUSDB18 corpus for music separation, December 2017
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017
work page 2017
-
[34]
Speaker verification using adapted gaussian mixture models
Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1-3):19– 41, 2000
work page 2000
-
[35]
Disambiguating music artists at scale with audio metric learning
Jimena Royo-Letelier, Romain Hennequin, Viet-Anh Tran, and Manuel Moussallam. Disambiguating music artists at scale with audio metric learning. In Proceed- ings of the 19th International Society for Music Infor- mation Retrieval Conference (ISMIR) , Paris, France, 2018
work page 2018
-
[36]
Melody extraction from polyphonic mu- sic signals: Approaches, applications, and challenges
Justin Salamon, Emilia Gómez, Daniel PW Ellis, and Gaël Richard. Melody extraction from polyphonic mu- sic signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine , 31(2):118–134, 2014
work page 2014
-
[37]
Data-driven visual similar- ity for cross-domain image matching
Abhinav Shrivastava, Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Data-driven visual similar- ity for cross-domain image matching. InACM Transac- tions on Graphics (ToG), volume 30, page 154, 2011
work page 2011
-
[38]
Correlation analyses of encoded mu- sic performance
Jeffrey C Smith. Correlation analyses of encoded mu- sic performance. 2013
work page 2013
-
[39]
Improved deep metric learning with multi-class n-pair loss objective
Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neu- ral Information Processing Systems, pages 1857–1865, 2016
work page 2016
-
[40]
Wave-u-net: A multi-scale neural network for end-to- end audio source separation
Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-u-net: A multi-scale neural network for end-to- end audio source separation. InProceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 2018
work page 2018
-
[41]
Learning from between-class examples for deep sound recognition
Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Learning from between-class examples for deep sound recognition. In International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[42]
Singing style investigation by residual siamese convolutional neu- ral networks
Cheng-i Wang and George Tzanetakis. Singing style investigation by residual siamese convolutional neu- ral networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 116–120, 2018
work page 2018
-
[43]
Deep metric learning with angular loss
Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pages 2593–2601, 2017
work page 2017
-
[44]
Embedding label structures for fine-grained feature representation
Xiaofan Zhang, Feng Zhou, Yuanqing Lin, and Shaot- ing Zhang. Embedding label structures for fine-grained feature representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1114–1123, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.