PupuJEPA applies a visual JEPA framework to 2D spectrograms with music-specific adaptations and outperforms 1D SSL models on the MARBLE benchmark for multiple MIR tasks.
Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SD 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A single ViT encoder with JEPA pretraining and staged specialization performs speaker diarization, phonetic encoding, and dynamic source separation in a shared latent space, reporting 15% DER and high separation accuracy on synthetic VoxCeleb2 mixtures.
citing papers explorer
-
Frequency-Aware Self-Supervised Music Representation Learning
PupuJEPA applies a visual JEPA framework to 2D spectrograms with music-specific adaptations and outperforms 1D SSL models on the MARBLE benchmark for multiple MIR tasks.
-
Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
A single ViT encoder with JEPA pretraining and staged specialization performs speaker diarization, phonetic encoding, and dynamic source separation in a shared latent space, reporting 15% DER and high separation accuracy on synthetic VoxCeleb2 mixtures.