Recognition: unknown
Masked Contrastive Pre-Training Improves Music Audio Key Detection
Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3
The pith
Masked contrastive pre-training on Mel spectrograms produces embeddings that reach state-of-the-art music key detection with simple classifiers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked contrastive pre-training on Mel spectrograms yields embeddings with high pitch sensitivity. These embeddings support competitive key detection accuracy under linear evaluation and enable state-of-the-art supervised performance when paired with shallow but wide multi-layer perceptrons. The same representations exhibit built-in robustness to common audio augmentations, establishing masked contrastive pre-training as an effective route for pitch-sensitive music information retrieval tasks.
What carries the argument
Masked contrastive pre-training applied to Mel spectrograms, which generates pitch-sensitive embeddings by contrasting masked and unmasked views of the same audio.
If this is right
- Linear evaluation after masked contrastive pre-training yields competitive key detection accuracy without further training.
- Shallow wide MLPs trained on the extracted embeddings achieve state-of-the-art results while avoiding complex augmentation strategies.
- The learned representations naturally encode robustness to standard music audio augmentations.
- Self-supervised pre-training can serve as an effective foundation for other pitch-sensitive music information retrieval tasks.
Where Pith is reading between the lines
- Music foundation model designers could prioritize masked contrastive objectives when targeting tasks that require fine-grained pitch information.
- The same embeddings may transfer to neighboring tasks such as chord recognition or melody extraction with minimal additional tuning.
- Probing experiments could isolate which acoustic features the masking-plus-contrastive process captures most effectively.
- Similar pre-training patterns might improve pitch sensitivity in non-music audio domains that rely on precise frequency content.
Load-bearing premise
The performance gains in key detection stem specifically from the masked contrastive pre-training design rather than from unmeasured differences in model size, training data, or evaluation details.
What would settle it
Apply a non-masked contrastive or reconstruction-based pre-training method to the same base model and data, then measure whether the resulting embeddings reach equal or higher key detection accuracy under the identical supervised evaluation protocol.
read the original abstract
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that masked contrastive pre-training on Mel spectrograms produces pitch-sensitive representations enabling competitive key detection performance under linear evaluation out of the box, and state-of-the-art results when a shallow wide MLP is trained on the frozen embeddings in the supervised setting. It further claims this approach requires no sophisticated data augmentations, that the representations naturally encode common augmentations, and that the pre-training design directly impacts pitch sensitivity, establishing self-supervised pre-training as effective for pitch-sensitive MIR tasks.
Significance. If the central empirical claims hold after proper controls, the work would be significant for music information retrieval by providing the first systematic evidence that pre-training objective choice affects pitch sensitivity in music foundation models, with direct implications for key detection and related tasks where prior self-supervised models have underperformed. The robustness analysis and observation that representations encode augmentations without explicit training would offer useful design insights for future models.
major comments (1)
- [Abstract] Abstract: The claim that masked contrastive embeddings 'uniquely enable' SOTA performance in the supervised setting is load-bearing for the paper's contribution but rests on an unablated attribution. No controls are described that hold model capacity, training data volume, evaluation splits, and downstream protocol fixed while varying only the pre-training objective (e.g., masked contrastive vs. reconstruction or supervised baselines). Without these isolations, performance deltas cannot be attributed specifically to the masked contrastive design rather than the shallow wide MLP, data scale, or other unstated factors.
minor comments (1)
- [Abstract] Abstract: The description of 'competitive performance' and 'SOTA' would be strengthened by immediate mention of the exact metrics (e.g., accuracy or weighted accuracy), datasets, and number of runs or statistical tests used, as these details are required to evaluate the linear-evaluation and MLP results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment on the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that masked contrastive embeddings 'uniquely enable' SOTA performance in the supervised setting is load-bearing for the paper's contribution but rests on an unablated attribution. No controls are described that hold model capacity, training data volume, evaluation splits, and downstream protocol fixed while varying only the pre-training objective (e.g., masked contrastive vs. reconstruction or supervised baselines). Without these isolations, performance deltas cannot be attributed specifically to the masked contrastive design rather than the shallow wide MLP, data scale, or other unstated factors.
Authors: We acknowledge the referee's point that the term 'uniquely enable' implies a stronger causal attribution than our experiments strictly isolate. Our manuscript presents comparisons to prior self-supervised approaches via linear evaluation on the same key detection benchmarks and shows that masked contrastive pre-training on Mel spectrograms yields competitive out-of-the-box performance without augmentation policies used in other works. The supervised MLP results build on frozen embeddings from this pre-training. However, we did not include a controlled ablation that fixes model capacity, exact training data volume, splits, and protocol while swapping only the pre-training objective against reconstruction or supervised baselines. To address this, we will revise the abstract to replace 'uniquely enable' with 'enable' and add a limitations paragraph discussing potential confounding factors such as data scale and model architecture. We maintain that the systematic analysis of pitch sensitivity through masking and contrastive objectives, together with the robustness results showing natural encoding of augmentations, supports the broader claim that pre-training design impacts pitch-sensitive MIR tasks. revision: partial
Circularity Check
No circularity; claims rest on reported empirical comparisons without self-referential definitions or load-bearing self-citations
full rationale
The paper describes an empirical pipeline: masked contrastive pre-training on Mel spectrograms produces embeddings, linear evaluation shows competitive key detection, and a shallow wide MLP on frozen features reaches SOTA. No equations, uniqueness theorems, or ansatzes are invoked that reduce the performance result to a fit or prior self-citation by construction. The abstract and described study contain no self-definitional steps (e.g., no parameter fitted to the target metric then renamed as prediction) and no load-bearing self-citations that presuppose the SOTA outcome. Attribution of gains to the pre-training objective versus model capacity or protocol is an experimental-design question, not a circularity in the derivation chain. The work is self-contained against external benchmarks via reported comparisons.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The key of a musical piece defines its tonal center and har- monic structure, shaping tension, resolution, and overall co- herence. Accurate key detection is thus a fundamental task in Music Information Retrieval (MIR), with applications in playlist generation, DJ mixing, and large-scale music simi- larity search. These use cases demand robus...
-
[2]
Masked Contrastive Pre-Training Improves Music Audio Key Detection
RELATED WORK Music key detection has long been a core challenge in MIR. Prior work can be grouped into traditional template matching methods, end-to-end deep learning models, and more recent foundation models. 2.1. Traditional Approaches Early methods relied on template matching, where time- frequency features such as chromagrams or spectrograms are compa...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
METHOD 3.1. Myna Framework Myna is a simple contrastive learning framework that uses token masking as its sole augmentation, originally designed for efficient music representation learning [15]. It replaces traditional augmentations (e.g., pitch shifting, delay, reverb) with random patch masking (Figure 1). This strategy pre- serves pitch while improving ...
2048
-
[4]
To compare with other 1http://www.cp.jku.at/people/korzeniowski/bb.zip Fig
RESULTS As shown in Table 1, KeyMyna outperforms InceptionKeyNet despite using less data, simpler architecture, and minimal augmentation (only pitch shifting). To compare with other 1http://www.cp.jku.at/people/korzeniowski/bb.zip Fig. 2. Myna-Vertical is robust to augmentations: we show T- SNE projections of 100 randomly-selected samples from the GTZAN d...
-
[5]
LIMITATIONS AND FUTURE WORK 5.1. Limitations KeyMyna in its current form is only able to track aglobal key- meaning, it is unable to track key modulations within a song, as its predictions are aggregated via averaging. This limitation is manageable for many genres, such as pop, rock, and electronic music, but struggles with pieces that feature key modulat...
-
[6]
CONCLUSION We presented KeyMyna, a systematic study of self-supervised pretraining for music key detection. Using Myna-Vertical, a ViT model trained on Mel spectrograms with vertical patches, we showed that shallow MLPs trained on frozen embeddings achieve state-of-the-art results on key detection benchmarks. Our findings demonstrate that masked contrasti...
-
[7]
Estimation of key in digital music recordings,
Ibrahim Sha’ath, “Estimation of key in digital music recordings,”Master’s Thesis, 2011
2011
-
[8]
End-to-end musical key estimation using a convolutional neural net- work,
Filip Korzeniowski and Gerhard Widmer, “End-to-end musical key estimation using a convolutional neural net- work,” inProceedings of the 25th European Signal Pro- cessing Conference (EUSIPCO), 2017, pp. 966–970
2017
-
[9]
Genre- agnostic key classification with convolutional neural networks,
Filip Korzeniowski and Gerhard Widmer, “Genre- agnostic key classification with convolutional neural networks,” inProceedings of the International Soci- ety for Music Information Retrieval Conference, Paris, France, 2018
2018
-
[10]
Deeper Convolutional Neural Net- works and Broad Augmentation Policies Improve Per- formance in Musical Key Estimation,
Stefan A Baumann, “Deeper Convolutional Neural Net- works and Broad Augmentation Policies Improve Per- formance in Musical Key Estimation,” inProceedings of the International Society for Music Information Re- trieval Conference, Online, Nov. 2021, pp. 42–49, IS- MIR
2021
-
[11]
Musical key extraction from audio.,
Steffen Pauws, “Musical key extraction from audio.,” inProceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2004
2004
-
[12]
What’s key for key? the krumhansl- schmuckler key-finding algorithm reconsidered,
David Temperley, “What’s key for key? the krumhansl- schmuckler key-finding algorithm reconsidered,”Music Perception, vol. 17, no. 1, pp. 65–100, 1999
1999
-
[13]
Signal processing parameters for tonality estimation,
Katy Noland and Mark Sandler, “Signal processing parameters for tonality estimation,” inProceedings of the Audio Engineering Society (AES) Convention. Au- dio Engineering Society, 2007
2007
-
[14]
Key estimation in electronic dance music,
´Angel Faraldo, Emilia G´omez, Sergi Jord`a, and Perfecto Herrera, “Key estimation in electronic dance music,” in Proc. European Conf. on Information Retrieval (ECIR), Padua, Italy. Springer, 2016, pp. 335–347
2016
-
[15]
The use of large corpora to train a new type of key-finding algorithm: An improved treatment of the minor mode,
Joshua Albrecht and Daniel Shanahan, “The use of large corpora to train a new type of key-finding algorithm: An improved treatment of the minor mode,”Music Percep- tion: An Interdisciplinary Journal, vol. 31, no. 1, pp. 59–67, 2013
2013
-
[16]
MERT: Acoustic music understanding model with large-scale self-supervised training,
Yizhi LI, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, An- ton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, and Jie Fu, “MERT: Acoustic music understanding model with large-scale self-supervised training,” inProceedings...
2024
-
[17]
Su- pervised and unsupervised learning of audio representa- tions for music understanding,
Matthew C. McCallum, Filip Korzeniowski, Sergio Ora- mas, Fabien Gouyon, and Andreas F. Ehmann, “Su- pervised and unsupervised learning of audio representa- tions for music understanding,” 2022
2022
-
[18]
A foundation model for music informatics,
Minz Won, Yun-Ning Hung, and Duc Le, “A foundation model for music informatics,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1226–1230
2024
-
[19]
Con- trastive learning of musical representations,
Janne Spijkervet and John Ashley Burgoyne, “Con- trastive learning of musical representations,” inPro- ceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 673–681
2021
-
[20]
Codified audio language modeling learns useful repre- sentations for music information retrieval,
Rodrigo Castellon, Chris Donahue, and Percy Liang, “Codified audio language modeling learns useful repre- sentations for music information retrieval,” inProceed- ings of the International Society for Music Information Retrieval Conference, 2021
2021
-
[21]
Myna: Masking-based contrastive learning of musical repre- sentations,
Ori Yonay, Tracy Hammond, and Tianbao Yang, “Myna: Masking-based contrastive learning of musical repre- sentations,”arXiv preprint arXiv:2502.12511, 2025
-
[22]
A Simple Framework for Contrastive Learning of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton, “A simple framework for con- trastive learning of visual representations,”CoRR, vol. abs/2002.05709, 2020
work page internal anchor Pith review arXiv 2002
-
[23]
Better plain vit baselines for imagenet-1k,
Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov, “Better plain vit baselines for imagenet-1k,” 2022
2022
-
[24]
An image is worth 16x16 words: Trans- formers for image recognition at scale,
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021
2021
-
[25]
Scaling vision transformers, 2022
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer, “Scaling vision transformers,”CoRR, vol. abs/2106.04560, 2021
-
[26]
Musical genre classifica- tion of audio signals,
G. Tzanetakis and P. Cook, “Musical genre classifica- tion of audio signals,”IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002
2002
-
[27]
Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,
Peter Knees, ´Angel Faraldo, Perfecto Herrera, Richard V ogl, Sebastian B¨ock, Florian H¨orschl¨ager, and Mickael Le Goff, “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” inProceedings of the International Society for Music Information Retrieval Conference, M ´alaga, Spain, October 2015
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.