Transfer Learning from Audio-Visual Grounding to Speech Recognition

David Harwath; James Glass; Wei-Ning Hsu

arxiv: 1907.04355 · v1 · pith:OGFQFWNSnew · submitted 2019-07-09 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

Transfer Learning from Audio-Visual Grounding to Speech Recognition

Wei-Ning Hsu , David Harwath , James Glass This is my paper

Pith reviewed 2026-05-25 00:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SDeess.AS

keywords transfer learningaudio visual groundingspeech recognitionphonetic featuresdomain adaptationfeature extractionmultimodal learning

0 comments

The pith

Grounding models trained on image-speech semantic correlation extract phonetic features for speech recognition without any transcripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores transferring knowledge from models that learn to associate images with spoken descriptions based on meaning alone. Because speech meaning comes mostly from its words, these models end up keeping the sound patterns of words while ignoring things like who is speaking or the recording conditions. The authors test features from different layers of these models as inputs to speech recognizers and find that early layers hold more detailed phonetic information while later layers are more stable across different recording environments. Importantly, the grounding models never see any speech recognition training data, suggesting they could work for entirely new domains.

Core claim

Transfer learning from audio-visual grounding models, trained to tell whether pairs of images and speech are semantically correlated without using textual transcripts, can distill robust phonetic features. Layers closer to the input retain more phonetic information, while deeper layers exhibit greater invariance to domain shift. These features enable speech recognition even though the grounding models were never trained on speech recognition data.

What carries the argument

Audio-visual grounding models trained to predict semantic correlation between an image and a speech utterance, which learn to preserve phonetic content tied to lexical meaning.

If this is right

Layers nearer the input of the grounding model provide features with higher phonetic content for speech recognition training.
Deeper layers of the grounding model produce features more invariant to changes in speaker or channel.
Speech recognition systems can be built using features from grounding models without access to any labeled speech recognition data.
The approach applies to new domains where no speech recognition training data exists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such features could enable speech recognition in languages or domains lacking transcribed data by leveraging readily available image-speech pairs.
Combining features from multiple layers might balance phonetic detail with domain robustness.
Similar grounding approaches could transfer to other tasks like speaker-independent emotion recognition.

Load-bearing premise

Speech semantics are largely determined by lexical content, so that models matching images to speech will keep phonetic details while discarding uncorrelated factors like speaker identity.

What would settle it

If a speech recognition model using these grounding features achieves word error rates no better than a model using random features or fails to improve on out-of-domain test sets compared to in-domain, the claim would be falsified.

Figures

Figures reproduced from arXiv: 1907.04355 by David Harwath, James Glass, Wei-Ning Hsu.

**Figure 1.** Figure 1: Graphical illustration of audio-visual grounding model training (left), ResDAVEnet architecture (center), and feature distillation pipeline for speech recognition (right). 2.2. Deep Audio-Visual Embedding Network (DAVEnet) DAVEnet is a convolutional neural network (CNN) for audiovisual grounding proposed in [20, 22, 21], which consists of two branches: f for speech and g for image, as depicted in [PITH_F… view at source ↗

**Figure 2.** Figure 2: Frame-level t-SNE projections for four different acoustic representations, color coded for phonetic manner class, speaker identity, and noise/environment type. Visually, the ResDAVEnet features encode the least amount of speaker and environment information. t-SNE [52] comparing ResDAVEnet, FHVAE (Places A / Aurora4 All), and the baseline FBank feature. It can be observed from the first row that all three … view at source ↗

read the original abstract

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can pull phonetic features for ASR out of audio-visual grounding models trained only on image-speech pairs with no transcripts or ASR data, and that early layers keep more phonetics while later ones gain domain invariance.

read the letter

The main thing here is a clean transfer setup: train a contrastive grounding model on image-speech pairs to decide if they match semantically, then freeze it and feed its layer activations into separate ASR heads. Because the grounding objective only cares about lexical semantics, the claim is that phonetic detail survives while speaker and channel noise gets stripped. They test this by comparing layers and by checking cross-domain performance, and the results line up with the prediction—earlier layers are more phonetic, later ones more invariant. The fact that none of the ASR training data touched the feature extractor is the real distinction from prior work that usually mixes the two stages. That part is new and worth noting. The experiments appear to be run properly with the right controls for the question they ask. The premise that lexical content dominates the semantics signal is stated up front and the protocol actually tests whether phonetic information leaks through, so there is no hidden circularity. The soft spot is that the abstract gives no numbers, no baseline comparisons, and no dataset sizes, so it is hard to tell how large the practical gain is or whether the invariance holds up under stronger domain shifts. If the full paper has those details and the gains are modest but consistent, the work still stands as a useful proof of concept. This is for people working on unsupervised or low-resource ASR and cross-modal transfer. It is not a breakthrough result but it is a solid, self-contained idea that deserves referee time to check the numbers and the exact layer analysis.

Referee Report

2 major / 0 minor

Summary. The paper proposes distilling phonetic features for automatic speech recognition (ASR) from audio-visual grounding models trained solely on image-speech semantic correlation pairs, without any textual transcripts or ASR training data. The central claim is that because speech semantics are largely lexical, the contrastive grounding objective preserves phonetic content while discarding uncorrelated factors like speaker and channel; experiments using layer activations as input to separate ASR models are said to show that early layers retain more phonetic information while later layers exhibit greater domain invariance, supporting more universal applicability to new domains.

Significance. If the empirical results hold under rigorous controls, the work offers a concrete demonstration that semantic grounding can yield usable phonetic representations without direct supervision on ASR data, which could aid low-resource or domain-shift scenarios. The layer-wise analysis provides a testable prediction about feature properties that aligns with the training objective.

major comments (2)

[Abstract] Abstract: The premise that 'semantics of speech are largely determined by its lexical content' is stated without citation or supporting argument, yet it is load-bearing for the claim that the grounding model will preserve phonetic information while disregarding speaker/channel factors. The empirical protocol (layer activations to ASR heads) directly tests survival of phonetic content but does not independently validate the lexical-dominance assumption.
[Abstract] Abstract: The manuscript reports 'empirical results' on layer properties and domain invariance but supplies no details on model architectures, datasets, baselines, quantitative metrics (e.g., WER, phone error rate), or experimental controls. Without these, it is not possible to determine whether the data support the stated claims about phonetic retention and domain invariance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. The comments highlight important areas for clarification regarding the foundational premise and experimental transparency. We respond to each point below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The premise that 'semantics of speech are largely determined by its lexical content' is stated without citation or supporting argument, yet it is load-bearing for the claim that the grounding model will preserve phonetic information while disregarding speaker/channel factors. The empirical protocol (layer activations to ASR heads) directly tests survival of phonetic content but does not independently validate the lexical-dominance assumption.

Authors: We agree that the premise would be strengthened by explicit support. This assumption is standard in speech processing, as lexical identity is the primary carrier of semantic content while speaker, channel, and prosodic factors are largely orthogonal to it. In the revision we will insert a supporting sentence with citations to relevant literature on lexical semantics in speech. The layer-wise experiments test whether phonetic content survives the contrastive objective; the observed domain-invariance trend is consistent with the model discarding uncorrelated factors, but we do not claim the protocol constitutes an independent test of the lexical-dominance premise itself. revision: yes
Referee: [Abstract] Abstract: The manuscript reports 'empirical results' on layer properties and domain invariance but supplies no details on model architectures, datasets, baselines, quantitative metrics (e.g., WER, phone error rate), or experimental controls. Without these, it is not possible to determine whether the data support the stated claims about phonetic retention and domain invariance.

Authors: The abstract is intentionally concise; the body of the manuscript (Sections 3–4) describes the grounding architecture, the image-speech datasets used for training, the ASR back-ends, the layer-extraction protocol, and the WER-based evaluation on both in-domain and out-of-domain test sets. To make these elements immediately visible, we will expand the abstract with one or two key quantitative results and insert a compact experimental-summary table early in the paper. These changes will allow readers to assess the claims without first reading the full experimental sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's chain consists of independently training audio-visual grounding models on image-speech pairs (no transcripts or ASR data), extracting activations from successive layers, and training separate ASR heads on those fixed features to measure retained phonetic content and domain invariance. This protocol is a direct empirical test of the stated premise that lexical semantics dominate and the contrastive objective discards uncorrelated factors; the outcome can falsify the premise rather than being forced by construction. No self-definitional loops, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the derivation. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic correlation between image and speech forces the model to preserve phonetic information; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Semantics of speech are largely determined by its lexical content
Invoked in the abstract to explain why grounding models preserve phonetic information while ignoring speaker and channel factors.

pith-pipeline@v0.9.0 · 5695 in / 1155 out tokens · 27769 ms · 2026-05-25T00:14:11.801846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

[1]

Introduction Robustness of automatic speech recognition (ASR) systems is essential to generalization of using speech as interfaces for hu- man computer interaction. Thanks to the strong modeling ca- pacity of neural networks, recent studies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] have demonstrated that by providing supervised ex- amples as abundant and diverse as...

work page
[2]

Learning Spoken Languages through Audio-Visual Grounding In this section, we describe in detail the source task as well as the DA VEnet model, and then review several analysis studies which lay the foundation for our work. 2.1. Audio-Visual Grounding Inspired by the fact that humans learn to speak before being able to read or write, audio-visual grounding...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Transfer Learning to Speech Recognition 3.1. Distilling Robust Feature Extractors for ASR Both DA VEnet variants are trained on the Places Audio Caption dataset (PlacesAudCap) [21], derived from the Places205 scene classiﬁcation dataset [28]. PlacesAudCap is composed of over 400K image and unscripted spoken caption pairs collected from 2,954 speakers via ...

work page
[4]

Related Work Transfer learning has a long history in the ﬁeld of machine learning [19]. More recently, deep neural network models have been shown to be extremely effective for learning representa- tions of data with a high degree of re-usability across many dif- ferent tasks and domains. Perhaps the most well-known exam- ple of this is the use of the Imag...

work page
[5]

ASR Setup and Baselines We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise

Experiments 5.1. ASR Setup and Baselines We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise. TIMIT contains 5.4 hours of 16kHz broad- band recordings of read speech from 630 speakers, of which about 70% are male. Recordings from male speakers are used for training ASR...

work page
[6]

We achieve cross-dataset transferability, which is an important milestone toward building a generalized feature ex- tractor to be used in many tasks and domains like BERT

Concluding Discussion and Future Work In this paper, we present a successful example of transfer learn- ing from a weakly supervised semantic grounding task to ro- bust ASR. We achieve cross-dataset transferability, which is an important milestone toward building a generalized feature ex- tractor to be used in many tasks and domains like BERT. In addition...

work page
[7]

Deep Speech 2: End-to-end speech recognition in English and Mandarin,

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Batten- berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” in ICML, 2016

work page 2016
[8]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018

work page 2018
[9]

V ocal tract length perturbation (VTLP) improves speech recognition,

N. Jaitly and G. E. Hinton, “V ocal tract length perturbation (VTLP) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech and Language , 2013

work page 2013
[10]

Data augmentation for deep neural network acoustic modeling,

X. Cui, V . Goel, and B. Kingsbury, “Data augmentation for deep neural network acoustic modeling,” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015

work page 2015
[11]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015

work page 2015
[12]

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in Google Home,

C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in Google Home,” in Interspeech, 2017

work page 2017
[13]

Unsupervised domain adap- tation for robust speech recognition via variational autoencoder- based data augmentation,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised domain adap- tation for robust speech recognition via variational autoencoder- based data augmentation,” in ASRU, 2017

work page 2017
[14]

Unsupervised adaptation with interpretable disentangled representations for distant conversa- tional speech recognition,

W.-N. Hsu, H. Tang, and J. Glass, “Unsupervised adaptation with interpretable disentangled representations for distant conversa- tional speech recognition,” in Interspeech, 2018

work page 2018
[15]

A multi- discriminator CycleGAN for unsupervised non-parallel speech domain adaptation,

E. Hosseini-Asl, Y . Zhou, C. Xiong, and R. Socher, “A multi- discriminator CycleGAN for unsupervised non-parallel speech domain adaptation,” in Interspeech, 2018

work page 2018
[16]

Training Augmentation with Adversarial Examples for Robust Speech Recognition

S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y . Hwang, and L. Xie, “Training augmentation with adversarial examples for robust speech recognition,” arXiv preprint arXiv:1806.02782, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

RASTA processing of speech,

H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE transactions on speech and audio processing , 1994

work page 1994
[18]

Recognizing reverberant speech with RASTA-PLP,

B. E. Kingsbury and N. Morgan, “Recognizing reverberant speech with RASTA-PLP,” inICASSP, 1997

work page 1997
[19]

Power-normalized cepstral coefﬁcients (PNCC) for robust speech recognition,

C. Kim and R. M. Stern, “Power-normalized cepstral coefﬁcients (PNCC) for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing , 2016

work page 2016
[20]

Locally normalized ﬁlter banks applied to deep neural-network- based robust speech recognition,

J. Fredes, J. Novoa, S. King, R. M. Stern, and N. B. Yoma, “Locally normalized ﬁlter banks applied to deep neural-network- based robust speech recognition,”IEEE Signal Processing Letters, 2017

work page 2017
[21]

An unsupervised deep domain adaptation approach for robust speech recognition,

S. Sun, B. Zhang, L. Xie, and Y . Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neu- rocomputing, 2017

work page 2017
[22]

Unsupervised learning of dis- entangled and interpretable representations from sequential data,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data,” in NIPS, 2017

work page 2017
[23]

Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,

W.-N. Hsu and J. Glass, “Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,” in ICASSP, 2018

work page 2018
[24]

An unsuper- vised autoregressive model for speech representation learning,

Y .-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsuper- vised autoregressive model for speech representation learning,” in Interspeech, 2019

work page 2019
[25]

A survey on transfer learning,

S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering , 2010

work page 2010
[26]

Unsupervised learning of spoken language with visual context,

D. Harwath, A. Torralba, and J. Glass, “Unsupervised learning of spoken language with visual context,” in NIPS, 2016

work page 2016
[27]

Jointly discovering visual objects and spoken words from raw sensory input,

D. Harwath, A. Recasens, D. Sur ´ıs, G. Chuang, A. Torralba, and J. Glass, “Jointly discovering visual objects and spoken words from raw sensory input,” in ECCV, 2018

work page 2018
[28]

Learning word-like units from joint audio-visual analysis,

D. Harwath and J. R. Glass, “Learning word-like units from joint audio-visual analysis,” in ACL, 2017

work page 2017
[29]

Deep metric learning using triplet net- work,

E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in International Workshop on Similarity-Based Pattern Recognition, 2015

work page 2015
[30]

Unsupervised learning of se- mantic audio representations,

A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of se- mantic audio representations,” in ICASSP, 2018

work page 2018
[31]

Towards visually grounded sub-word speech unit discovery,

D. Harwath and J. Glass, “Towards visually grounded sub-word speech unit discovery,” inICASSP, 2019

work page 2019
[32]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

Analysis of audio-visual features for un- supervised speech recognition,

J. Drexler and J. Glass, “Analysis of audio-visual features for un- supervised speech recognition,” in Grounded Language Under- standing Workshop, 2017

work page 2017
[34]

Learn- ing deep features for scene recognition using places database,

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learn- ing deep features for scene recognition using places database,” in NIPS, 2014

work page 2014
[35]

Ob- ject detectors emerge in deep scene CNNs,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Ob- ject detectors emerge in deep scene CNNs,” in ICLR, 2015

work page 2015
[36]

Ima- genet: A large scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large scale hierarchical image database,” inCVPR, 2009

work page 2009
[37]

CNN features off-the-shelf: An astounding baseline for recognition,

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” in CVPR Workshop, 2014

work page 2014
[38]

Faster R-CNN: To- wards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- wards real-time object detection with region proposal networks,” in NIPS, 2015

work page 2015
[39]

Two-stream convolutional net- works for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional net- works for action recognition in videos,” in NIPS, 2014

work page 2014
[40]

Distributed representations of words and phrases and their com- positionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their com- positionality,” in NIPS, 2013

work page 2013
[41]

GloVe: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in EMNLP, 2014

work page 2014
[42]

Deep contextualized word represen- tations,

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word represen- tations,” in NAACL, 2018

work page 2018
[43]

BERT: pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre- training of deep bidirectional transformers for language under- standing,” CoRR, 2018

work page 2018
[44]

Unsupervised textual grounding: Linking words to image concepts,

R. A. Yeh, M. N. Do, and A. G. Schwing, “Unsupervised textual grounding: Linking words to image concepts,” in CVPR, 2018

work page 2018
[45]

Aligned image-word representations improve inductive transfer across vision-language tasks,

T. Gupta, K. Shih, S. Singh, and D. Hoiem, “Aligned image-word representations improve inductive transfer across vision-language tasks,” in ICCV, 2017

work page 2017
[46]

Multilingual data selection for training stacked bottleneck features,

E. Chuangsuwanich, Y . Zhang, and J. Glass, “Multilingual data selection for training stacked bottleneck features,” in ICASSP, 2013

work page 2013
[47]

Semantic speech retrieval with a visually grounded model of untranscribed speech,

H. Kamper, G. Shakhnarovich, and K. Livescu, “Semantic speech retrieval with a visually grounded model of untranscribed speech,” IEEE/ACM Trans. Audio, Speech & Language Processing, 2019

work page 2019
[48]

Representations of language in a model of visually grounded speech signal,

G. Chrupala, L. Gelderloos, and A. Alishahi, “Representations of language in a model of visually grounded speech signal,” in ACL, 2017

work page 2017
[49]

Encoding of phonol- ogy in a recurrent neural model of grounded speech,

A. Alishahi, M. Barking, and G. Chrupala, “Encoding of phonol- ogy in a recurrent neural model of grounded speech,” in CoNLL, 2017

work page 2017
[50]

Speech database development at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech communication, 1990

work page 1990
[51]

Aurora working group: DSR front end LVCSR evaluation AU/384/02,

D. Pearce and J. Picone, “Aurora working group: DSR front end LVCSR evaluation AU/384/02,” Inst. for Signal & Inform. Pro- cess., Mississippi State Univ., Tech. Rep, 2002

work page 2002
[52]

CSR-I (WSJ0) complete,

J. Garofalo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete,” Linguistic Data Consortium, Philadelphia , 2007

work page 2007
[53]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011

work page 2011
[54]

CNTK: Microsoft’s open-source deep- learning toolkit,

F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep- learning toolkit,” in KDD, 2016

work page 2016
[55]

Long short-term memory re- current neural network architectures for large scale acoustic mod- eling,

H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re- current neural network architectures for large scale acoustic mod- eling,” in Interspeech, 2014

work page 2014
[56]

Highway long short-term memory RNNs for distant speech recognition,

Y . Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass, “Highway long short-term memory RNNs for distant speech recognition,” in ICASSP, 2016

work page 2016
[57]

Scalable factorized hierarchical varia- tional autoencoder training,

W.-N. Hsu and J. Glass, “Scalable factorized hierarchical varia- tional autoencoder training,” in Interspeech, 2018

work page 2018
[58]

Visualizing high-dimensional data using t-SNE,

L. van der Maaten and G. Hinton, “Visualizing high-dimensional data using t-SNE,” JMLR, 2008

work page 2008

[1] [1]

Introduction Robustness of automatic speech recognition (ASR) systems is essential to generalization of using speech as interfaces for hu- man computer interaction. Thanks to the strong modeling ca- pacity of neural networks, recent studies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] have demonstrated that by providing supervised ex- amples as abundant and diverse as...

work page

[2] [2]

Learning Spoken Languages through Audio-Visual Grounding In this section, we describe in detail the source task as well as the DA VEnet model, and then review several analysis studies which lay the foundation for our work. 2.1. Audio-Visual Grounding Inspired by the fact that humans learn to speak before being able to read or write, audio-visual grounding...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Transfer Learning to Speech Recognition 3.1. Distilling Robust Feature Extractors for ASR Both DA VEnet variants are trained on the Places Audio Caption dataset (PlacesAudCap) [21], derived from the Places205 scene classiﬁcation dataset [28]. PlacesAudCap is composed of over 400K image and unscripted spoken caption pairs collected from 2,954 speakers via ...

work page

[4] [4]

Related Work Transfer learning has a long history in the ﬁeld of machine learning [19]. More recently, deep neural network models have been shown to be extremely effective for learning representa- tions of data with a high degree of re-usability across many dif- ferent tasks and domains. Perhaps the most well-known exam- ple of this is the use of the Imag...

work page

[5] [5]

ASR Setup and Baselines We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise

Experiments 5.1. ASR Setup and Baselines We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise. TIMIT contains 5.4 hours of 16kHz broad- band recordings of read speech from 630 speakers, of which about 70% are male. Recordings from male speakers are used for training ASR...

work page

[6] [6]

We achieve cross-dataset transferability, which is an important milestone toward building a generalized feature ex- tractor to be used in many tasks and domains like BERT

Concluding Discussion and Future Work In this paper, we present a successful example of transfer learn- ing from a weakly supervised semantic grounding task to ro- bust ASR. We achieve cross-dataset transferability, which is an important milestone toward building a generalized feature ex- tractor to be used in many tasks and domains like BERT. In addition...

work page

[7] [7]

Deep Speech 2: End-to-end speech recognition in English and Mandarin,

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Batten- berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” in ICML, 2016

work page 2016

[8] [8]

State- of-the-art speech recognition with sequence-to-sequence models,

C.-C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Goninaet al., “State- of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018

work page 2018

[9] [9]

V ocal tract length perturbation (VTLP) improves speech recognition,

N. Jaitly and G. E. Hinton, “V ocal tract length perturbation (VTLP) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech and Language , 2013

work page 2013

[10] [10]

Data augmentation for deep neural network acoustic modeling,

X. Cui, V . Goel, and B. Kingsbury, “Data augmentation for deep neural network acoustic modeling,” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015

work page 2015

[11] [11]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association , 2015

work page 2015

[12] [12]

Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in Google Home,

C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-ﬁeld speech recognition in Google Home,” in Interspeech, 2017

work page 2017

[13] [13]

Unsupervised domain adap- tation for robust speech recognition via variational autoencoder- based data augmentation,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised domain adap- tation for robust speech recognition via variational autoencoder- based data augmentation,” in ASRU, 2017

work page 2017

[14] [14]

Unsupervised adaptation with interpretable disentangled representations for distant conversa- tional speech recognition,

W.-N. Hsu, H. Tang, and J. Glass, “Unsupervised adaptation with interpretable disentangled representations for distant conversa- tional speech recognition,” in Interspeech, 2018

work page 2018

[15] [15]

A multi- discriminator CycleGAN for unsupervised non-parallel speech domain adaptation,

E. Hosseini-Asl, Y . Zhou, C. Xiong, and R. Socher, “A multi- discriminator CycleGAN for unsupervised non-parallel speech domain adaptation,” in Interspeech, 2018

work page 2018

[16] [16]

Training Augmentation with Adversarial Examples for Robust Speech Recognition

S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y . Hwang, and L. Xie, “Training augmentation with adversarial examples for robust speech recognition,” arXiv preprint arXiv:1806.02782, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

RASTA processing of speech,

H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE transactions on speech and audio processing , 1994

work page 1994

[18] [18]

Recognizing reverberant speech with RASTA-PLP,

B. E. Kingsbury and N. Morgan, “Recognizing reverberant speech with RASTA-PLP,” inICASSP, 1997

work page 1997

[19] [19]

Power-normalized cepstral coefﬁcients (PNCC) for robust speech recognition,

C. Kim and R. M. Stern, “Power-normalized cepstral coefﬁcients (PNCC) for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing , 2016

work page 2016

[20] [20]

Locally normalized ﬁlter banks applied to deep neural-network- based robust speech recognition,

J. Fredes, J. Novoa, S. King, R. M. Stern, and N. B. Yoma, “Locally normalized ﬁlter banks applied to deep neural-network- based robust speech recognition,”IEEE Signal Processing Letters, 2017

work page 2017

[21] [21]

An unsupervised deep domain adaptation approach for robust speech recognition,

S. Sun, B. Zhang, L. Xie, and Y . Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neu- rocomputing, 2017

work page 2017

[22] [22]

Unsupervised learning of dis- entangled and interpretable representations from sequential data,

W.-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data,” in NIPS, 2017

work page 2017

[23] [23]

Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,

W.-N. Hsu and J. Glass, “Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,” in ICASSP, 2018

work page 2018

[24] [24]

An unsuper- vised autoregressive model for speech representation learning,

Y .-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsuper- vised autoregressive model for speech representation learning,” in Interspeech, 2019

work page 2019

[25] [25]

A survey on transfer learning,

S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering , 2010

work page 2010

[26] [26]

Unsupervised learning of spoken language with visual context,

D. Harwath, A. Torralba, and J. Glass, “Unsupervised learning of spoken language with visual context,” in NIPS, 2016

work page 2016

[27] [27]

Jointly discovering visual objects and spoken words from raw sensory input,

D. Harwath, A. Recasens, D. Sur ´ıs, G. Chuang, A. Torralba, and J. Glass, “Jointly discovering visual objects and spoken words from raw sensory input,” in ECCV, 2018

work page 2018

[28] [28]

Learning word-like units from joint audio-visual analysis,

D. Harwath and J. R. Glass, “Learning word-like units from joint audio-visual analysis,” in ACL, 2017

work page 2017

[29] [29]

Deep metric learning using triplet net- work,

E. Hoffer and N. Ailon, “Deep metric learning using triplet net- work,” in International Workshop on Similarity-Based Pattern Recognition, 2015

work page 2015

[30] [30]

Unsupervised learning of se- mantic audio representations,

A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of se- mantic audio representations,” in ICASSP, 2018

work page 2018

[31] [31]

Towards visually grounded sub-word speech unit discovery,

D. Harwath and J. Glass, “Towards visually grounded sub-word speech unit discovery,” inICASSP, 2019

work page 2019

[32] [32]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

Analysis of audio-visual features for un- supervised speech recognition,

J. Drexler and J. Glass, “Analysis of audio-visual features for un- supervised speech recognition,” in Grounded Language Under- standing Workshop, 2017

work page 2017

[34] [34]

Learn- ing deep features for scene recognition using places database,

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learn- ing deep features for scene recognition using places database,” in NIPS, 2014

work page 2014

[35] [35]

Ob- ject detectors emerge in deep scene CNNs,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Ob- ject detectors emerge in deep scene CNNs,” in ICLR, 2015

work page 2015

[36] [36]

Ima- genet: A large scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large scale hierarchical image database,” inCVPR, 2009

work page 2009

[37] [37]

CNN features off-the-shelf: An astounding baseline for recognition,

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” in CVPR Workshop, 2014

work page 2014

[38] [38]

Faster R-CNN: To- wards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- wards real-time object detection with region proposal networks,” in NIPS, 2015

work page 2015

[39] [39]

Two-stream convolutional net- works for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional net- works for action recognition in videos,” in NIPS, 2014

work page 2014

[40] [40]

Distributed representations of words and phrases and their com- positionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their com- positionality,” in NIPS, 2013

work page 2013

[41] [41]

GloVe: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in EMNLP, 2014

work page 2014

[42] [42]

Deep contextualized word represen- tations,

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word represen- tations,” in NAACL, 2018

work page 2018

[43] [43]

BERT: pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre- training of deep bidirectional transformers for language under- standing,” CoRR, 2018

work page 2018

[44] [44]

Unsupervised textual grounding: Linking words to image concepts,

R. A. Yeh, M. N. Do, and A. G. Schwing, “Unsupervised textual grounding: Linking words to image concepts,” in CVPR, 2018

work page 2018

[45] [45]

Aligned image-word representations improve inductive transfer across vision-language tasks,

T. Gupta, K. Shih, S. Singh, and D. Hoiem, “Aligned image-word representations improve inductive transfer across vision-language tasks,” in ICCV, 2017

work page 2017

[46] [46]

Multilingual data selection for training stacked bottleneck features,

E. Chuangsuwanich, Y . Zhang, and J. Glass, “Multilingual data selection for training stacked bottleneck features,” in ICASSP, 2013

work page 2013

[47] [47]

Semantic speech retrieval with a visually grounded model of untranscribed speech,

H. Kamper, G. Shakhnarovich, and K. Livescu, “Semantic speech retrieval with a visually grounded model of untranscribed speech,” IEEE/ACM Trans. Audio, Speech & Language Processing, 2019

work page 2019

[48] [48]

Representations of language in a model of visually grounded speech signal,

G. Chrupala, L. Gelderloos, and A. Alishahi, “Representations of language in a model of visually grounded speech signal,” in ACL, 2017

work page 2017

[49] [49]

Encoding of phonol- ogy in a recurrent neural model of grounded speech,

A. Alishahi, M. Barking, and G. Chrupala, “Encoding of phonol- ogy in a recurrent neural model of grounded speech,” in CoNLL, 2017

work page 2017

[50] [50]

Speech database development at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech communication, 1990

work page 1990

[51] [51]

Aurora working group: DSR front end LVCSR evaluation AU/384/02,

D. Pearce and J. Picone, “Aurora working group: DSR front end LVCSR evaluation AU/384/02,” Inst. for Signal & Inform. Pro- cess., Mississippi State Univ., Tech. Rep, 2002

work page 2002

[52] [52]

CSR-I (WSJ0) complete,

J. Garofalo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete,” Linguistic Data Consortium, Philadelphia , 2007

work page 2007

[53] [53]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011

work page 2011

[54] [54]

CNTK: Microsoft’s open-source deep- learning toolkit,

F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep- learning toolkit,” in KDD, 2016

work page 2016

[55] [55]

Long short-term memory re- current neural network architectures for large scale acoustic mod- eling,

H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re- current neural network architectures for large scale acoustic mod- eling,” in Interspeech, 2014

work page 2014

[56] [56]

Highway long short-term memory RNNs for distant speech recognition,

Y . Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass, “Highway long short-term memory RNNs for distant speech recognition,” in ICASSP, 2016

work page 2016

[57] [57]

Scalable factorized hierarchical varia- tional autoencoder training,

W.-N. Hsu and J. Glass, “Scalable factorized hierarchical varia- tional autoencoder training,” in Interspeech, 2018

work page 2018

[58] [58]

Visualizing high-dimensional data using t-SNE,

L. van der Maaten and G. Hinton, “Visualizing high-dimensional data using t-SNE,” JMLR, 2008

work page 2008