DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering

Abdelrahman Mohamed; Guan-Ting Lin; Ho-Lam Chung; Hsuan-Jui Chen; Hung-yi Lee; Lin-shan Lee; Shang-Wen Li; Shu-wen Yang; Shuyan Dong; Yung-Sung Chuang

arxiv: 2203.04911 · v3 · pith:VL3US4GCnew · submitted 2022-03-09 · 💻 cs.CL · cs.SD· eess.AS

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering

Guan-Ting Lin , Yung-Sung Chuang , Ho-Lam Chung , Shu-wen Yang , Hsuan-Jui Chen , Shuyan Dong , Shang-Wen Li , Abdelrahman Mohamed

show 2 more authors

Hung-yi Lee Lin-shan Lee

This is my paper

classification 💻 cs.CL cs.SDeess.AS

keywords spokendatadualquestionwordsadaptiveansweringanswers

0 comments

read the original abstract

Spoken Question Answering (SQA) is to find the answer from a spoken document given a question, which is crucial for personal assistants when replying to the queries from the users. Existing SQA methods all rely on Automatic Speech Recognition (ASR) transcripts. Not only does ASR need to be trained with massive annotated data that are time and cost-prohibitive to collect for low-resourced languages, but more importantly, very often the answers to the questions include name entities or out-of-vocabulary words that cannot be recognized correctly. Also, ASR aims to minimize recognition errors equally over all words, including many function words irrelevant to the SQA task. Therefore, SQA without ASR transcripts (textless) is always highly desired, although known to be very difficult. This work proposes Discrete Spoken Unit Adaptive Learning (DUAL), leveraging unlabeled data for pre-training and fine-tuned by the SQA downstream task. The time intervals of spoken answers can be directly predicted from spoken documents. We also release a new SQA benchmark corpus, NMSQA, for data with more realistic scenarios. We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data. Our code and model will be open-sourced.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DisSpeech: Low-Resource Controllable Mandarin Stuttered Speech Synthesis for ASR Augmentation
cs.SD 2026-06 unverdicted novelty 6.0

DisSpeech synthesizes controllable stuttered Mandarin speech via discrete tokens and stuttering event labels to augment ASR datasets, improving recognition to 4.19% CER on stuttered tasks with minimal impact on fluent speech.