SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Furu Wei; Jinyu Li; Lirong Dai; Long Zhou; Sanyuan Chen; Shujie Liu; Shuo Ren; Xun Gong; Yu Wu; Zhuoyuan Yao

arxiv: 2209.15329 · v3 · pith:ZZC24FBAnew · submitted 2022-09-30 · 💻 cs.CL · cs.AI· eess.AS

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Ziqiang Zhang , Sanyuan Chen , Long Zhou , Yu Wu , Shuo Ren , Shujie Liu , Zhuoyuan Yao , Xun Gong

show 3 more authors

Lirong Dai Jinyu Li Furu Wei

This is my paper

classification 💻 cs.CL cs.AIeess.AS

keywords speechtextdatapre-trainingspeechlmdiscretetokenizersincluding

0 comments

read the original abstract

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
cs.CL 2026-07 unverdicted novelty 4.0

JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.