Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.
Two-stream convolutional net- works for action recognition in videos,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.