Frequency-Aware Self-Supervised Music Representation Learning

Jerry Li; Junan Zhang; Lauri Juvela; Yicheng Gu; Zhizheng Wu

arxiv: 2606.25713 · v2 · pith:OF75XY3Knew · submitted 2026-06-24 · 💻 cs.SD

Frequency-Aware Self-Supervised Music Representation Learning

Yicheng Gu , Junan Zhang , Jerry Li , Zhizheng Wu , Lauri Juvela This is my paper

Pith reviewed 2026-06-25 19:23 UTC · model grok-4.3

classification 💻 cs.SD

keywords self-supervised learningmusic information retrievalspectrogramsJEPA2D audio representationsMARBLE benchmarkrepresentation learning

0 comments

The pith

A visual JEPA trained on 2D spectrograms outperforms 1D sequence models on music retrieval tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that music audio should be processed as 2D time-frequency grids rather than flattened 1D sequences because this structure matches how music is produced in MIDI workflows and preserves spatial relationships that current SSL models discard. It adapts a Joint-Embedding Predictive Architecture to directly predict latent embeddings of masked 2D spectrogram patches from visible context, adding music-specific changes to the architecture, training scheme, and inference. The resulting model, PupuJEPA, is evaluated through linear probing on the MARBLE benchmark. Ablation studies and attention-map case studies are presented to show that the 2D approach and modifications yield more effective representations than prior 1D SSL baselines. The central claim is that this frequency-aware 2D predictive setup produces representations that transfer better across multiple MIR tasks.

Core claim

PupuJEPA is a visual Joint-Embedding Predictive Architecture trained directly on 2D spectrograms that learns by predicting the latent embeddings of masked 2D patches from unmasked contexts, rather than using masked language modeling on 1D sequences; domain-specific modifications are added to the model, training, and inference, and the resulting representations outperform 1D sequence-based SSL models across multiple MIR tasks under linear probing on the MARBLE benchmark while attention maps indicate capture of musically meaningful time-frequency patterns.

What carries the argument

Joint-Embedding Predictive Architecture (JEPA) that predicts latent embeddings of masked 2D spectrogram patches from visible context instead of masked language modeling on 1D sequences.

If this is right

PupuJEPA achieves higher linear-probing accuracy than 1D SSL models on multiple MIR tasks in the MARBLE benchmark.
Ablation studies indicate that the domain-specific modifications to architecture, training, and inference each contribute to the gains.
Attention maps extracted from the trained model highlight musically meaningful patterns within the 2D time-frequency domain.
The approach demonstrates that predicting embeddings of masked 2D patches can replace masked language modeling for audio SSL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same 2D patch-prediction approach could be tested on non-music audio tasks where time-frequency structure is also salient.
Because MIDI and spectrograms share a time-frequency grid layout, the learned representations may align more naturally with symbolic music data.
Scaling the context window or combining 2D spectrogram inputs with raw waveform branches might further improve results.

Load-bearing premise

That the 2D time-frequency grid structure plus music-specific adaptations to a visual JEPA are sufficient to extract more musically meaningful patterns than 1D sequence models.

What would settle it

A re-run of the MARBLE linear-probing experiments in which PupuJEPA does not exceed the performance of the strongest 1D SSL baselines on the reported tasks.

Figures

Figures reproduced from arXiv: 2606.25713 by Jerry Li, Junan Zhang, Lauri Juvela, Yicheng Gu, Zhizheng Wu.

**Figure 1.** Figure 1: Illustration of the fundamental intuition behind PupuJEPA. The top panel displays a multitrack project in a Digital [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the PupuJEPA architecture and training scheme. The target encoder encodes the masked spectrogram [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of different masking strategies. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of different inference paradigms for global tasks. “GAP” means global average pooling. The top and bottom [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of different patch aggregation strategies. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of PupuJEPA’s predictor attention maps. The top panels display a multitrack project in a DAW, while the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 1.** Figure 1: These tools were used to improve paper writing and [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

read the original abstract

Self-supervised learning (SSL) has emerged as an essential paradigm for music information retrieval (MIR). While current SSL models achieve state-of-the-art performance across various MIR tasks, they typically treat audio as 1D sequences, either operating on time-domain waveforms or on flattened time-frequency-domain spectrograms. This discards the rich spatial and structural information in time-frequency representations and overlooks a fundamental intuition in music production. In particular, music is naturally represented as time-frequency grids in MIDI-based workflows, a structure that tightly corresponds to 2D spectrograms and inherently makes many MIR tasks trivial. Motivated by this intuition, we propose PupuJEPA, a visual Joint-Embedding Predictive Architecture (JEPA) that is trained directly on 2D spectrograms. Instead of applying masked language modeling (MLM) to 1D sequences, PupuJEPA learns robust representations by predicting the latent embeddings of masked 2D spectrogram patches from unmasked contexts. To optimally adapt such a visual framework to music signals, we also apply domain-specific modifications to model architecture, training scheme, and inference paradigm, with comprehensive ablation studies showing their effectiveness. Evaluations on the MARBLE benchmark show that PupuJEPA outperforms the 1D sequence-based SSL models across multiple MIR tasks in linear probing. Additionally, case studies of the attention maps also confirm that PupuJEPA captures musically meaningful patterns within the 2D time-frequency domain. Codes and checkpoints are available at: https://www.yichenggu.com/PupuJEPA/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PupuJEPA beats 1D SSL baselines on MARBLE linear probes by running JEPA on 2D spectrograms with music tweaks, but the ablations do not rule out that the JEPA recipe itself drives the gains rather than the 2D grid.

read the letter

The main point is that this paper takes the visual JEPA predictor, points it at 2D spectrogram patches instead of 1D sequences, adds some audio-specific changes to architecture and training, and gets higher scores than prior 1D models across several MARBLE tasks.

What is actually new is the PupuJEPA setup itself—the 2D patch masking and latent prediction on time-frequency grids, plus the listed domain adaptations. The attention map examples give a bit of qualitative backing that the model attends to musically relevant patterns. Releasing code and checkpoints is also useful for anyone who wants to test it.

The softer spot is exactly the one the stress-test note flags. The 1D baselines come from earlier masked language modeling work, not a controlled 1D-JEPA run on flattened spectrograms inside the same predictive framework. The paper's ablations cover their own modifications but do not close that gap, so it remains unclear how much credit belongs to keeping the 2D structure versus simply using JEPA or the training recipe. That does not sink the results, but it does limit how strongly one can claim the 2D representation is the decisive factor.

This is for MIR people who want concrete SSL options that move past 1D sequences and who value benchmark numbers plus open artifacts. The work shows clear enough thinking and enough experimental detail to deserve referee time, even if the attribution to 2D needs tightening in revision.

Referee Report

1 major / 1 minor

Summary. The paper proposes PupuJEPA, a visual JEPA model adapted for music by training directly on 2D spectrograms to predict latent embeddings of masked 2D patches. It incorporates domain-specific modifications to architecture, training, and inference, with ablations claimed to show their value. On the MARBLE benchmark, it reports outperforming prior 1D sequence-based SSL models in linear probing across multiple MIR tasks, and attention maps are presented as evidence that it captures musically meaningful 2D time-frequency patterns. Code and checkpoints are released.

Significance. If the performance gains are attributable to the 2D spectrogram structure rather than confounding factors in the training recipe, the work would provide evidence that preserving time-frequency grid structure in SSL yields representations better aligned with MIR tasks than 1D sequence models. The release of code and checkpoints strengthens reproducibility.

major comments (1)

[§4] §4 and ablation tables: the central claim attributes gains to 2D spectrogram processing under the JEPA objective, yet the reported comparisons use 1D MLM baselines from prior work rather than a controlled 1D-JEPA variant (same predictor, masking, and latent target construction applied to flattened spectrograms). Without this ablation the attribution to 2D structure versus the JEPA framework or other modifications remains under-supported.

minor comments (1)

The abstract states that 'comprehensive ablation studies' demonstrate effectiveness of the modifications, but the main text should explicitly list which hyperparameters were varied and report the exact metric deltas for each.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the attribution of gains to the 2D spectrogram structure. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 and ablation tables: the central claim attributes gains to 2D spectrogram processing under the JEPA objective, yet the reported comparisons use 1D MLM baselines from prior work rather than a controlled 1D-JEPA variant (same predictor, masking, and latent target construction applied to flattened spectrograms). Without this ablation the attribution to 2D structure versus the JEPA framework or other modifications remains under-supported.

Authors: We agree that the current comparisons to prior 1D MLM models do not fully isolate the contribution of 2D processing from the JEPA objective itself. To address this, the revised manuscript will include a new controlled 1D-JEPA baseline: spectrograms will be flattened to 1D sequences and trained with the identical predictor architecture, masking strategy, and latent target construction as PupuJEPA. Results from this ablation will be added to §4 and the corresponding tables to better support the claim that the 2D time-frequency grid structure is a key factor. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks and ablations

full rationale

The paper advances an empirical architecture (PupuJEPA) adapted from visual JEPA to 2D spectrograms, with domain-specific modifications justified by ablation studies and evaluated on the external MARBLE benchmark. No derivation chain, equation, or prediction reduces to its own inputs by construction; there are no self-definitional relations, fitted parameters renamed as predictions, or load-bearing self-citations. The central performance claims are falsifiable against independent baselines and benchmarks, making the work self-contained against external evidence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The performance claim depends on the domain assumption that 2D spectrograms are the natural representation for music and on the effectiveness of unspecified domain modifications; no free parameters are explicitly fitted in the abstract, and the new model itself is the primary invented entity.

free parameters (1)

domain-specific modification hyperparameters
Architecture, training, and inference choices adapted for music, selected via ablations mentioned in the abstract.

axioms (1)

domain assumption Music is naturally represented as time-frequency grids that correspond to MIDI-based workflows.
Stated in the abstract as the core motivation for using 2D spectrograms.

invented entities (1)

PupuJEPA no independent evidence
purpose: A JEPA variant trained on 2D music spectrograms
New model introduced by the paper.

pith-pipeline@v0.9.1-grok · 5818 in / 1236 out tokens · 38096 ms · 2026-06-25T19:23:34.040928+00:00 · methodology

Frequency-Aware Self-Supervised Music Representation Learning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)