pith. sign in

arxiv: 2510.25955 · v4 · pith:423333Z7new · submitted 2025-10-29 · 📡 eess.AS

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

classification 📡 eess.AS
keywords spearaudiospeechlearningrepresentationsmodelsteacherunified
0
0 comments X
read the original abstract

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

    cs.CV 2026-06 unverdicted novelty 6.0

    MJ EPA applies a single shared ViT encoder and one predictive objective within and across audio-visual modalities, reporting >6.8 mAP gains on AudioSet-20K and competitive video results with 10x less data.

  2. Alethia: A Foundational Encoder for Voice Deepfakes

    cs.SD 2026-04 unverdicted novelty 6.0

    Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...

  3. From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

    eess.AS 2026-07 unverdicted novelty 3.0

    A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.