End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Chenyu You; Fenglin Liu; Nuo Chen; Shen Ge; Xian Wu; Yuexian Zou

arxiv: 2204.14272 · v1 · pith:ZX7JS47Vnew · submitted 2022-04-29 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Chenyu You , Nuo Chen , Fenglin Liu , Shen Ge , Xian Wu , Yuexian Zou This is my paper

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords answeringconversationalquestionspokenspeechsystemsdatasetinformation

0 comments

read the original abstract

In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
cs.SD 2025-11 unverdicted novelty 5.0

CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.