pith. sign in

arxiv: 2310.05919 · v1 · pith:NBDPCRYPnew · submitted 2023-10-09 · 💻 cs.CL · eess.AS

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

classification 💻 cs.CL eess.AS
keywords speechmodelsdatarepresentationstextlanguagepre-trainedshared
0
0 comments X
read the original abstract

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data. With as little as 1 hour of labeled speech data, our proposed approach achieves comparable performance on spoken language understanding tasks (specifically, sentiment analysis and named entity recognition) when compared to previous methods using speech-only pre-trained models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we also analyze the latent representations. We find that the bottom layers of speech-text models are largely task-agnostic and align speech and text representations into a shared space, while the top layers are more task-specific.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.