Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Abhinav Thanda; Shankar M Venkatesan

arxiv: 1701.02477 · v1 · pith:4F442CAZnew · submitted 2017-01-10 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Abhinav Thanda , Shankar M Venkatesan This is my paper

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords modelvisualaudio-visualautomaticav-asrbase-linecomparedfeatures

0 comments

read the original abstract

Multi-task learning (MTL) involves the simultaneous training of two or more related tasks over shared representations. In this work, we apply MTL to audio-visual automatic speech recognition(AV-ASR). Our primary task is to learn a mapping between audio-visual fused features and frame labels obtained from acoustic GMM/HMM model. This is combined with an auxiliary task which maps visual features to frame labels obtained from a separate visual GMM/HMM model. The MTL model is tested at various levels of babble noise and the results are compared with a base-line hybrid DNN-HMM AV-ASR model. Our results indicate that MTL is especially useful at higher level of noise. Compared to base-line, upto 7\% relative improvement in WER is reported at -3 SNR dB

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition
cs.SD 2025-04 unverdicted novelty 4.0

MT-BCA-CNN achieves 97% accuracy and 95% F1-score on 27-class few-shot underwater acoustic target recognition by combining channel attention and multi-task learning on the Watkins Marine Life Dataset.
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
cs.CV 2019-06 unverdicted novelty 4.0

3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).