pith. sign in

arxiv: 2209.15329 · v3 · pith:ZZC24FBAnew · submitted 2022-09-30 · 💻 cs.CL · cs.AI· eess.AS

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

classification 💻 cs.CL cs.AIeess.AS
keywords speechtextdatapre-trainingspeechlmdiscretetokenizersincluding
0
0 comments X
read the original abstract

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

    cs.CL 2026-07 unverdicted novelty 4.0

    JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.