SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Baoxiang Li; Chao Zhang; Phil Woodland; Wen Wu; Xiaoyu Yang; Yifan Yang; Zengrui Jin; Ziyun Cui

arxiv: 2510.25955 · v4 · pith:423333Z7new · submitted 2025-10-29 · 📡 eess.AS

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang , Yifan Yang , Zengrui Jin , Ziyun Cui , Wen Wu , Baoxiang Li , Chao Zhang , Phil Woodland This is my paper

classification 📡 eess.AS

keywords spearaudiospeechlearningrepresentationsmodelsteacherunified

0 comments

read the original abstract

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
cs.CV 2026-06 unverdicted novelty 6.0

MJ EPA applies a single shared ViT encoder and one predictive objective within and across audio-visual modalities, reporting >6.8 mAP gains on AudioSet-20K and competitive video results with 10x less data.
Alethia: A Foundational Encoder for Voice Deepfakes
cs.SD 2026-04 unverdicted novelty 6.0

Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning
eess.AS 2026-07 unverdicted novelty 3.0

A survey that organizes audio SSL into five objective paradigms, relates their demands to architectural biases, and interprets downstream applications as tests of generalization.