Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

· 2017 · cs.CL · arXiv 1701.02477

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Multi-task learning (MTL) involves the simultaneous training of two or more related tasks over shared representations. In this work, we apply MTL to audio-visual automatic speech recognition(AV-ASR). Our primary task is to learn a mapping between audio-visual fused features and frame labels obtained from acoustic GMM/HMM model. This is combined with an auxiliary task which maps visual features to frame labels obtained from a separate visual GMM/HMM model. The MTL model is tested at various levels of babble noise and the results are compared with a base-line hybrid DNN-HMM AV-ASR model. Our results indicate that MTL is especially useful at higher level of noise. Compared to base-line, upto 7\% relative improvement in WER is reported at -3 SNR dB

representative citing papers

A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

cs.SD · 2025-04-17 · unverdicted · novelty 4.0

MT-BCA-CNN achieves 97% accuracy and 95% F1-score on 27-class few-shot underwater acoustic target recognition by combining channel attention and multi-task learning on the Watkins Marine Life Dataset.

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

cs.CV · 2019-06-25 · unverdicted · novelty 4.0

3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).

citing papers explorer

Showing 2 of 2 citing papers.

A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition cs.SD · 2025-04-17 · unverdicted · none · ref 30 · internal anchor
MT-BCA-CNN achieves 97% accuracy and 95% F1-score on 27-class few-shot underwater acoustic target recognition by combining channel attention and multi-task learning on the Watkins Marine Life Dataset.
LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models cs.CV · 2019-06-25 · unverdicted · none · ref 8 · internal anchor
3D-2D-CNN-BLSTM with word-CTC reaches 1.3% WER on GRID seen-speaker lipreading (55% relative gain over LCANet) and 8.6% on unseen speakers (24.5% gain over LipNet).

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

fields

years

verdicts

representative citing papers

citing papers explorer