arxiv: 2604.10021 · v1 · submitted 2026-04-11 · 💻 cs.SD · cs.LG

Recognition: unknown

Masked Contrastive Pre-Training Improves Music Audio Key Detection

Ori Yonay , Tracy Hammond , Tianbao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords music key detectionself-supervised pre-trainingcontrastive learningmasked pre-trainingMel spectrogramspitch sensitivitymusic information retrievalaudio embeddings

0 comments

The pith

Masked contrastive pre-training on Mel spectrograms produces embeddings that reach state-of-the-art music key detection with simple classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why many self-supervised music models fall short on key detection, a task that demands accurate pitch tracking. It finds that the pre-training method strongly influences pitch sensitivity, with masked contrastive learning on Mel spectrograms standing out by delivering competitive results even through basic linear evaluation. The authors then use those embeddings to train shallow wide multi-layer perceptrons, reaching top accuracy without elaborate data augmentation policies. This matters because it positions self-supervised pre-training as a practical route to strong pitch-aware performance, potentially simplifying training for many music analysis tasks.

Core claim

Masked contrastive pre-training on Mel spectrograms yields embeddings with high pitch sensitivity. These embeddings support competitive key detection accuracy under linear evaluation and enable state-of-the-art supervised performance when paired with shallow but wide multi-layer perceptrons. The same representations exhibit built-in robustness to common audio augmentations, establishing masked contrastive pre-training as an effective route for pitch-sensitive music information retrieval tasks.

What carries the argument

Masked contrastive pre-training applied to Mel spectrograms, which generates pitch-sensitive embeddings by contrasting masked and unmasked views of the same audio.

If this is right

Linear evaluation after masked contrastive pre-training yields competitive key detection accuracy without further training.
Shallow wide MLPs trained on the extracted embeddings achieve state-of-the-art results while avoiding complex augmentation strategies.
The learned representations naturally encode robustness to standard music audio augmentations.
Self-supervised pre-training can serve as an effective foundation for other pitch-sensitive music information retrieval tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Music foundation model designers could prioritize masked contrastive objectives when targeting tasks that require fine-grained pitch information.
The same embeddings may transfer to neighboring tasks such as chord recognition or melody extraction with minimal additional tuning.
Probing experiments could isolate which acoustic features the masking-plus-contrastive process captures most effectively.
Similar pre-training patterns might improve pitch sensitivity in non-music audio domains that rely on precise frequency content.

Load-bearing premise

The performance gains in key detection stem specifically from the masked contrastive pre-training design rather than from unmeasured differences in model size, training data, or evaluation details.

What would settle it

Apply a non-masked contrastive or reconstruction-based pre-training method to the same base model and data, then measure whether the resulting embeddings reach equal or higher key detection accuracy under the identical supervised evaluation protocol.

read the original abstract

Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Masked contrastive pre-training gives usable pitch-sensitive embeddings for key detection, but the SOTA claim needs tighter controls to pin the gains on the pre-training choice rather than the MLP or other variables.

read the letter

The core result here is that masked contrastive pre-training on Mel spectrograms produces embeddings that already do reasonably well on key detection with a linear probe, and then a shallow wide MLP on frozen features reaches SOTA without heavy augmentation. The authors frame this as the first systematic look at how pre-training design influences pitch sensitivity, and they add a robustness check showing the representations pick up common augmentations on their own. That practical recipe and the link between objective and pitch tasks are the useful parts for MIR work. The experiments appear straightforward and the linear probe result is a clean starting point. The main soft spot is attribution. The central claim that the masked contrastive design is what enables the SOTA performance assumes the gains are not coming from the downstream MLP width, the exact training data volume, or evaluation protocol differences. Without ablations that hold those fixed while swapping only the pre-training objective against reconstruction or other contrastive baselines, the delta is hard to credit cleanly to the masking-plus-contrastive choice. The abstract does not spell out statistical tests or full baseline tables, so the strength of the improvement is still provisional. This paper is for people working on self-supervised audio models or music key detection specifically. A reader who wants a working recipe for pitch-sensitive tasks or ideas for probing foundation models will get concrete value from the results and the analysis. It is worth sending to peer review so referees can check the controls and ask for the missing ablations; the idea is clear enough and the empirical direction is worth tightening.

Referee Report

1 major / 1 minor

Summary. The paper claims that masked contrastive pre-training on Mel spectrograms produces pitch-sensitive representations enabling competitive key detection performance under linear evaluation out of the box, and state-of-the-art results when a shallow wide MLP is trained on the frozen embeddings in the supervised setting. It further claims this approach requires no sophisticated data augmentations, that the representations naturally encode common augmentations, and that the pre-training design directly impacts pitch sensitivity, establishing self-supervised pre-training as effective for pitch-sensitive MIR tasks.

Significance. If the central empirical claims hold after proper controls, the work would be significant for music information retrieval by providing the first systematic evidence that pre-training objective choice affects pitch sensitivity in music foundation models, with direct implications for key detection and related tasks where prior self-supervised models have underperformed. The robustness analysis and observation that representations encode augmentations without explicit training would offer useful design insights for future models.

major comments (1)

[Abstract] Abstract: The claim that masked contrastive embeddings 'uniquely enable' SOTA performance in the supervised setting is load-bearing for the paper's contribution but rests on an unablated attribution. No controls are described that hold model capacity, training data volume, evaluation splits, and downstream protocol fixed while varying only the pre-training objective (e.g., masked contrastive vs. reconstruction or supervised baselines). Without these isolations, performance deltas cannot be attributed specifically to the masked contrastive design rather than the shallow wide MLP, data scale, or other unstated factors.

minor comments (1)

[Abstract] Abstract: The description of 'competitive performance' and 'SOTA' would be strengthened by immediate mention of the exact metrics (e.g., accuracy or weighted accuracy), datasets, and number of runs or statistical tests used, as these details are required to evaluate the linear-evaluation and MLP results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that masked contrastive embeddings 'uniquely enable' SOTA performance in the supervised setting is load-bearing for the paper's contribution but rests on an unablated attribution. No controls are described that hold model capacity, training data volume, evaluation splits, and downstream protocol fixed while varying only the pre-training objective (e.g., masked contrastive vs. reconstruction or supervised baselines). Without these isolations, performance deltas cannot be attributed specifically to the masked contrastive design rather than the shallow wide MLP, data scale, or other unstated factors.

Authors: We acknowledge the referee's point that the term 'uniquely enable' implies a stronger causal attribution than our experiments strictly isolate. Our manuscript presents comparisons to prior self-supervised approaches via linear evaluation on the same key detection benchmarks and shows that masked contrastive pre-training on Mel spectrograms yields competitive out-of-the-box performance without augmentation policies used in other works. The supervised MLP results build on frozen embeddings from this pre-training. However, we did not include a controlled ablation that fixes model capacity, exact training data volume, splits, and protocol while swapping only the pre-training objective against reconstruction or supervised baselines. To address this, we will revise the abstract to replace 'uniquely enable' with 'enable' and add a limitations paragraph discussing potential confounding factors such as data scale and model architecture. We maintain that the systematic analysis of pitch sensitivity through masking and contrastive objectives, together with the robustness results showing natural encoding of augmentations, supports the broader claim that pre-training design impacts pitch-sensitive MIR tasks. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on reported empirical comparisons without self-referential definitions or load-bearing self-citations

full rationale

The paper describes an empirical pipeline: masked contrastive pre-training on Mel spectrograms produces embeddings, linear evaluation shows competitive key detection, and a shallow wide MLP on frozen features reaches SOTA. No equations, uniqueness theorems, or ansatzes are invoked that reduce the performance result to a fit or prior self-citation by construction. The abstract and described study contain no self-definitional steps (e.g., no parameter fitted to the target metric then renamed as prediction) and no load-bearing self-citations that presuppose the SOTA outcome. Attribution of gains to the pre-training objective versus model capacity or protocol is an experimental-design question, not a circularity in the derivation chain. The work is self-contained against external benchmarks via reported comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5453 in / 964 out tokens · 43571 ms · 2026-05-10T16:20:07.532339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 2 internal anchors

[1]

INTRODUCTION The key of a musical piece defines its tonal center and har- monic structure, shaping tension, resolution, and overall co- herence. Accurate key detection is thus a fundamental task in Music Information Retrieval (MIR), with applications in playlist generation, DJ mixing, and large-scale music simi- larity search. These use cases demand robus...
[2]

Masked Contrastive Pre-Training Improves Music Audio Key Detection

RELATED WORK Music key detection has long been a core challenge in MIR. Prior work can be grouped into traditional template matching methods, end-to-end deep learning models, and more recent foundation models. 2.1. Traditional Approaches Early methods relied on template matching, where time- frequency features such as chromagrams or spectrograms are compa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

METHOD 3.1. Myna Framework Myna is a simple contrastive learning framework that uses token masking as its sole augmentation, originally designed for efficient music representation learning [15]. It replaces traditional augmentations (e.g., pitch shifting, delay, reverb) with random patch masking (Figure 1). This strategy pre- serves pitch while improving ...

2048
[4]

To compare with other 1http://www.cp.jku.at/people/korzeniowski/bb.zip Fig

RESULTS As shown in Table 1, KeyMyna outperforms InceptionKeyNet despite using less data, simpler architecture, and minimal augmentation (only pitch shifting). To compare with other 1http://www.cp.jku.at/people/korzeniowski/bb.zip Fig. 2. Myna-Vertical is robust to augmentations: we show T- SNE projections of 100 randomly-selected samples from the GTZAN d...
[5]

LIMITATIONS AND FUTURE WORK 5.1. Limitations KeyMyna in its current form is only able to track aglobal key- meaning, it is unable to track key modulations within a song, as its predictions are aggregated via averaging. This limitation is manageable for many genres, such as pop, rock, and electronic music, but struggles with pieces that feature key modulat...
[6]

CONCLUSION We presented KeyMyna, a systematic study of self-supervised pretraining for music key detection. Using Myna-Vertical, a ViT model trained on Mel spectrograms with vertical patches, we showed that shallow MLPs trained on frozen embeddings achieve state-of-the-art results on key detection benchmarks. Our findings demonstrate that masked contrasti...
[7]

Estimation of key in digital music recordings,

Ibrahim Sha’ath, “Estimation of key in digital music recordings,”Master’s Thesis, 2011

2011
[8]

End-to-end musical key estimation using a convolutional neural net- work,

Filip Korzeniowski and Gerhard Widmer, “End-to-end musical key estimation using a convolutional neural net- work,” inProceedings of the 25th European Signal Pro- cessing Conference (EUSIPCO), 2017, pp. 966–970

2017
[9]

Genre- agnostic key classification with convolutional neural networks,

Filip Korzeniowski and Gerhard Widmer, “Genre- agnostic key classification with convolutional neural networks,” inProceedings of the International Soci- ety for Music Information Retrieval Conference, Paris, France, 2018

2018
[10]

Deeper Convolutional Neural Net- works and Broad Augmentation Policies Improve Per- formance in Musical Key Estimation,

Stefan A Baumann, “Deeper Convolutional Neural Net- works and Broad Augmentation Policies Improve Per- formance in Musical Key Estimation,” inProceedings of the International Society for Music Information Re- trieval Conference, Online, Nov. 2021, pp. 42–49, IS- MIR

2021
[11]

Musical key extraction from audio.,

Steffen Pauws, “Musical key extraction from audio.,” inProceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2004

2004
[12]

What’s key for key? the krumhansl- schmuckler key-finding algorithm reconsidered,

David Temperley, “What’s key for key? the krumhansl- schmuckler key-finding algorithm reconsidered,”Music Perception, vol. 17, no. 1, pp. 65–100, 1999

1999
[13]

Signal processing parameters for tonality estimation,

Katy Noland and Mark Sandler, “Signal processing parameters for tonality estimation,” inProceedings of the Audio Engineering Society (AES) Convention. Au- dio Engineering Society, 2007

2007
[14]

Key estimation in electronic dance music,

´Angel Faraldo, Emilia G´omez, Sergi Jord`a, and Perfecto Herrera, “Key estimation in electronic dance music,” in Proc. European Conf. on Information Retrieval (ECIR), Padua, Italy. Springer, 2016, pp. 335–347

2016
[15]

The use of large corpora to train a new type of key-finding algorithm: An improved treatment of the minor mode,

Joshua Albrecht and Daniel Shanahan, “The use of large corpora to train a new type of key-finding algorithm: An improved treatment of the minor mode,”Music Percep- tion: An Interdisciplinary Journal, vol. 31, no. 1, pp. 59–67, 2013

2013
[16]

MERT: Acoustic music understanding model with large-scale self-supervised training,

Yizhi LI, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, An- ton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, and Jie Fu, “MERT: Acoustic music understanding model with large-scale self-supervised training,” inProceedings...

2024
[17]

Su- pervised and unsupervised learning of audio representa- tions for music understanding,

Matthew C. McCallum, Filip Korzeniowski, Sergio Ora- mas, Fabien Gouyon, and Andreas F. Ehmann, “Su- pervised and unsupervised learning of audio representa- tions for music understanding,” 2022

2022
[18]

A foundation model for music informatics,

Minz Won, Yun-Ning Hung, and Duc Le, “A foundation model for music informatics,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1226–1230

2024
[19]

Con- trastive learning of musical representations,

Janne Spijkervet and John Ashley Burgoyne, “Con- trastive learning of musical representations,” inPro- ceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 673–681

2021
[20]

Codified audio language modeling learns useful repre- sentations for music information retrieval,

Rodrigo Castellon, Chris Donahue, and Percy Liang, “Codified audio language modeling learns useful repre- sentations for music information retrieval,” inProceed- ings of the International Society for Music Information Retrieval Conference, 2021

2021
[21]

Myna: Masking-based contrastive learning of musical repre- sentations,

Ori Yonay, Tracy Hammond, and Tianbao Yang, “Myna: Masking-based contrastive learning of musical repre- sentations,”arXiv preprint arXiv:2502.12511, 2025

work page arXiv 2025
[22]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton, “A simple framework for con- trastive learning of visual representations,”CoRR, vol. abs/2002.05709, 2020

work page internal anchor Pith review arXiv 2002
[23]

Better plain vit baselines for imagenet-1k,

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov, “Better plain vit baselines for imagenet-1k,” 2022

2022
[24]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

2021
[25]

Scaling vision transformers, 2022

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer, “Scaling vision transformers,”CoRR, vol. abs/2106.04560, 2021

work page arXiv 2021
[26]

Musical genre classifica- tion of audio signals,

G. Tzanetakis and P. Cook, “Musical genre classifica- tion of audio signals,”IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002

2002
[27]

Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,

Peter Knees, ´Angel Faraldo, Perfecto Herrera, Richard V ogl, Sebastian B¨ock, Florian H¨orschl¨ager, and Mickael Le Goff, “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” inProceedings of the International Society for Music Information Retrieval Conference, M ´alaga, Spain, October 2015

2015