Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.
ASR Setup and Baselines We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.