Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.
Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.