Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Dhananjay Ram; Herv\'e Bourlard; Lesly Miculicich

arxiv: 1907.00443 · v1 · pith:RG3LJSMBnew · submitted 2019-06-30 · 💻 cs.CL · cs.HC· cs.LG· cs.SD· eess.AS

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Dhananjay Ram , Lesly Miculicich , Herv\'e Bourlard This is my paper

Pith reviewed 2026-05-25 12:36 UTC · model grok-4.3

classification 💻 cs.CL cs.HCcs.LGcs.SDeess.AS

keywords bottleneck featuresquery by examplespoken term detectionresidual networksmultilingual featuresdynamic time warpingGlobalPhoneQUESST 2014

0 comments

The pith

Residual networks produce better multilingual bottleneck features for query-by-example spoken term detection than feedforward networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how bottleneck features from neural networks affect query-by-example spoken term detection performance. It first compares features extracted from monolingual and multilingual feedforward networks. The authors then replace the feedforward networks with residual networks to generate the bottleneck features. Experiments on the QUESST 2014 database show that the ResNet versions deliver higher detection accuracy. All networks are trained on the GlobalPhone corpus before the features are fed into dynamic time warping for matching.

Core claim

Bottleneck features estimated with residual networks outperform the corresponding feedforward-network features in query-by-example spoken term detection. The study first evaluates monolingual and multilingual feedforward networks, then demonstrates that switching to residual networks yields significant gains when the networks are trained on the GlobalPhone corpus and evaluated on the challenging QUESST 2014 database.

What carries the argument

Residual networks (ResNet) used to estimate bottleneck features for dynamic time warping template matching.

If this is right

Multilingual training of the networks improves detection over monolingual training.
ResNet architecture produces measurable gains over feedforward networks on the same data.
The resulting features support effective matching on the difficult QUESST 2014 evaluation set.
GlobalPhone training supplies the multilingual coverage needed for cross-language term detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ResNet-based feature extraction could be tested on other spoken-term or keyword-spotting benchmarks.
Architectural upgrades from feedforward to residual layers may transfer to different audio feature pipelines.
Further accuracy lifts might appear if deeper residual blocks are combined with the existing multilingual training regime.

Load-bearing premise

The performance differences observed on QUESST 2014 are caused by the choice of multilingual training and ResNet architecture rather than by other uncontrolled factors in network training or DTW implementation.

What would settle it

Re-train identical models on the same GlobalPhone data, change only the architecture from feedforward to ResNet, and check whether QbE-STD accuracy on QUESST 2014 remains statistically unchanged.

read the original abstract

State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracted from feed forward networks. Then, we propose to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the corresponding feed forward network based features. The neural networks are trained on GlobalPhone corpus and QbE-STD experiments are performed on a very challenging QUESST 2014 database.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ResNet multilingual bottlenecks improve QbE-STD on QUESST 2014 over feedforward ones, but gains need matched training controls to attribute cleanly to the architecture.

read the letter

The one thing to know is that this paper finds ResNet-based multilingual bottleneck features outperform the corresponding feedforward network features for query-by-example spoken term detection on QUESST 2014, with all networks trained on GlobalPhone data and evaluated via DTW matching. The work compares several monolingual and multilingual setups first with standard feedforward nets, then swaps in residual networks for the bottleneck extractor. What is new is the specific use of ResNets in this pipeline and the direct head-to-head on the multilingual case. It does a solid job of including the monolingual baselines so the multilingual benefit is visible and by testing on a known hard, multi-language dataset rather than a toy one. The citation pattern covers the expected prior work on bottleneck features and QbE-STD without obvious gaps. The soft spot is the experimental controls. The stress-test concern lands: the abstract gives no evidence that the feedforward and ResNet models were trained under identical conditions on optimizer, learning-rate schedule, data order, or early stopping. Any difference there could produce the reported gains without the residual connections being responsible. The claim of significant improvements also lacks numbers, confidence intervals, or even basic scores in the abstract, which makes the size of the effect impossible to judge. If the full paper supplies those details and shows the training was matched, the central result holds up better. This is a paper for people already building or tuning spoken term detection systems in multilingual settings. Someone choosing a feature front-end for a DTW pipeline would get practical value from the comparisons. It deserves a serious referee because the task and data are standard and the question is testable, even though the current writeup is light on the controls and numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript studies query-by-example spoken term detection (QbE-STD) by extracting bottleneck features from feed-forward networks and residual networks (ResNet) trained on the GlobalPhone corpus. It compares monolingual and multilingual training regimes and reports that ResNet-based features yield significant improvements over the corresponding feed-forward features when used with DTW matching on the QUESST 2014 evaluation set.

Significance. If the performance gains are shown to arise specifically from the ResNet architecture under matched training conditions, the work would supply a concrete, reproducible baseline for multilingual bottleneck features in QbE-STD. The use of a standard corpus (GlobalPhone) and a challenging public benchmark (QUESST 2014) is a positive aspect that would facilitate future comparisons.

major comments (2)

[Abstract and §4] Abstract and §4 (experimental results): the claim of 'significant improvements' is presented without any numerical values, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the observed differences are large enough or reliable enough to support the central architectural claim.
[§3 and §4] §3 (network training) and §4: no statement confirms that the feed-forward and ResNet extractors were trained with identical optimizer, learning-rate schedule, batch size, data ordering, or early-stopping criterion. Because the central claim attributes gains to the residual connections rather than to any of these uncontrolled factors, the lack of such controls is load-bearing for the reported conclusion.

minor comments (2)

[Title and Introduction] The title emphasizes 'Multilingual' features, yet the abstract and experiments also present monolingual results; a short clarifying sentence in the introduction would help readers understand the intended scope.
[§2] Notation for the bottleneck dimension and the DTW distance measure should be defined once at first use rather than assumed from prior QbE-STD literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (experimental results): the claim of 'significant improvements' is presented without any numerical values, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the observed differences are large enough or reliable enough to support the central architectural claim.

Authors: We agree that the abstract and results section should include explicit numerical values to support the claim. In the revised manuscript we will update the abstract to report the key performance metrics (e.g., actual ATWV or EER figures from the QUESST 2014 experiments) and will add confidence intervals together with statistical significance tests (paired t-test or similar) in §4. revision: yes
Referee: [§3 and §4] §3 (network training) and §4: no statement confirms that the feed-forward and ResNet extractors were trained with identical optimizer, learning-rate schedule, batch size, data ordering, or early-stopping criterion. Because the central claim attributes gains to the residual connections rather than to any of these uncontrolled factors, the lack of such controls is load-bearing for the reported conclusion.

Authors: We acknowledge the manuscript does not explicitly state the training controls. Both networks were in fact trained under identical conditions (same optimizer, learning-rate schedule, batch size, data ordering via the same random seed, and early-stopping criterion on validation loss). We will add a dedicated paragraph in §3 documenting these matched settings so that the comparison isolates the effect of the residual connections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of network architectures on external benchmarks

full rationale

The paper reports an empirical study comparing bottleneck features extracted from feed-forward networks versus residual networks, trained on GlobalPhone and evaluated via DTW on the independent QUESST 2014 corpus. No equations, derivations, or first-principles predictions appear; the central claim is a measured performance delta between two trained models. This is a standard experimental result whose validity rests on training controls and statistical significance rather than any definitional or self-referential reduction. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatzes are present. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work rests on the standard domain assumption that bottleneck features are useful for DTW-based matching.

axioms (1)

domain assumption Bottleneck features extracted from neural networks trained on speech data provide useful representations for dynamic time warping template matching in QbE-STD.
This premise is required for the entire experimental pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1179 out tokens · 29308 ms · 2026-05-25T12:36:28.010621+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 5 internal anchors

[1]

INTRODUCTION Query-by-example spoken term detection (QbE-STD) is the task of detecting audio documents from an archive, which contain a spoken query provided by a user. In contrast to tex- tual queries in keyword spotting, QbE-STD requires spoken queries which enables a language independant search with- out the need of a full speech recognition system. Th...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

Generally, the network architecture consists of a shared part and sev- eral task-dependent parts

MULTITASK LEARNING Multitask learning [14, 15] have been used to exploit similar- ities across tasks resulting in an improved learning efﬁciency when compared to training each task separately. Generally, the network architecture consists of a shared part and sev- eral task-dependent parts. In order to obtain multilingual bot- tleneck features we model pho...

work page
[3]

FEED FORW ARD NETWORKS Feed forward networks have been traditionally used to obtain bottleneck features for speech related tasks [5, 13, 14]. Here, we describe the different architectures employed in this study as shown in Figure 1: (a) Monolingual: our monolingual FFN architecture, con- sists of 3 fully connected layers of 1024 neurons each, followed by ...

work page
[4]

Skipping layers effectively simpli- ﬁes the training and gives ﬂexibility to the network

RESIDUAL NETWORKS A Residual Network [18] is a CNN with shortcut connections between its stacked layers. Skipping layers effectively simpli- ﬁes the training and gives ﬂexibility to the network. Given an input matrix x and an output matrix y, it models the function y = f(x)+ x in each stacked layer, wheref(.) represents two convolutional layers with a non...

work page
[5]

Then, we present the details of training different neural networks

EXPERIMENTAL SETUP In this section, we describe the databases and the pre- processing steps to perform the experiments. Then, we present the details of training different neural networks. 5.1. Databases GlobalPhone Corpus: GlobalPhone [21] is a multilingual speech database consisting of high quality recordings of read speech with corresponding transcripti...

work page 2014
[6]

Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]

EXPERIMENTAL ANALYSIS In this section, we report and analyze the QbE-STD perfor- mance using various bottleneck features estimated from our FFN and ResNet models. Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]. We implemented those models to compare with multilingual f...

work page 2014
[7]

We present a performance analysis of these features using both ResNets and FFNs

CONCLUSIONS We proposed a ResNet based neural network architecture to estimate monolingual as well as multilingual bottleneck fea- tures for QbE-STD. We present a performance analysis of these features using both ResNets and FFNs. It shows that additional languages for training improves performance and the ResNets perform better than FFNs for both monolin...

work page
[8]

Unsupervised pat- tern discovery in speech,

Alex S Park and James R Glass, “Unsupervised pat- tern discovery in speech,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 16, no. 1, pp. 186–197, 2008

work page 2008
[9]

Model-based unsu- pervised spoken term detection with spoken queries,

Chun-an Chan and Lin-shan Lee, “Model-based unsu- pervised spoken term detection with spoken queries,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1330–1342, 2013

work page 2013
[10]

Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,

Yaodong Zhang and James R Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) , 2009, pp. 398–403

work page 2009
[11]

High- performance query-by-example spoken term detection on the SWS 2013 evaluation,

Luis Javier Rodriguez-Fuentes, Amparo Varona, Mike Penagarikano, Germ´an Bordel, and Mireia Diez, “High- performance query-by-example spoken term detection on the SWS 2013 evaluation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 7819–7823

work page 2013
[12]

Coping with channel mismatch in query- by-example-BUT QUESST 2014,

Igor Sz ¨oke, Miroslav Sk ´acel, Luk ´aˇs Burget, and Jan ˇCernock`y, “Coping with channel mismatch in query- by-example-BUT QUESST 2014,” in 2015 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5838–5842

work page 2014
[13]

Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,

Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,” in INTERSPEECH, 2016, pp. 923–927

work page 2016
[14]

CNN based query by example spoken term detection,

Dhananjay Ram, Lesly Miculicich, and Herv ´e Bourlard, “CNN based query by example spoken term detection,” in Proceedings of the Nineteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018

work page 2018
[15]

Query-by-example spoken term detection using pho- netic posteriorgram templates,

Timothy J Hazen, Wade Shen, and Christopher White, “Query-by-example spoken term detection using pho- netic posteriorgram templates,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2009, pp. 421–426

work page 2009
[16]

2, Springer, 2007

Meinard M ¨uller, Information retrieval for music and motion, vol. 2, Springer, 2007

work page 2007
[17]

Sparse subspace modeling for query by example spo- ken term detection,

Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Sparse subspace modeling for query by example spo- ken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 6, pp. 1130–1143, June 2018

work page 2018
[18]

Subspace regularized dynamic time warping for spo- ken query detection,

Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Subspace regularized dynamic time warping for spo- ken query detection,” in Workshop on Signal Process- ing with Adaptive Sparse Structured Representations (SPARS), 2017

work page 2017
[19]

Re- ducing the dimensionality of data with neural networks,

Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re- ducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006

work page 2006
[20]

Improved bottle- neck features using pretrained deep neural networks,

Dong Yu and Michael L Seltzer, “Improved bottle- neck features using pretrained deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011

work page 2011
[21]

The language- independent bottleneck features,

Karel Vesel `y, Martin Karaﬁ ´at, Franti ˇsek Gr ´ezl, Milo ˇs Janda, and Ekaterina Egorova, “The language- independent bottleneck features,” in 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 336–341

work page 2012
[22]

Multitask learning,

Rich Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997

work page 1997
[23]

Con- volutional neural networks for speech recognition,

Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Con- volutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and lan- guage processing, vol. 22, no. 10, pp. 1533–1545, 2014

work page 2014
[24]

Convolutional, long short-term memory, fully connected deep neural networks,

Tara N Sainath, Oriol Vinyals, Andrew Senior, and Has ¸im Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

work page 2015
[25]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[26]

Achieving Human Parity in Conversational Speech Recognition

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Very deep convolutional networks for end-to-end speech recognition,

Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 4845–4849

work page 2017
[28]

Globalphone: A multilingual text & speech database in 20 languages,

Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8126–8130

work page 2013
[29]

Query by exam- ple search on speech at mediaeval 2014.,

Xavier Anguera, Luis Javier Rodriguez-Fuentes, Igor Sz¨oke, Andi Buzo, and Florian Metze, “Query by exam- ple search on speech at mediaeval 2014.,” inMediaEval, 2014

work page 2014
[30]

The Kaldi speech recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

work page 2011
[31]

Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE , vol. 29, no. 6, pp. 82–97, 2012

work page 2012
[32]

An investigation of deep neural networks for multilingual speech recognition training and adaptation,

Sibo Tong, Philip N Garner, and Herv ´e Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in Pro- ceedings of the Eighteenth Annual Conference of the International Speech Communication Association (IN- TERSPEECH), 2017

work page 2017
[33]

Py- torch,

Adam Paszke, Sam Gross, and Soumith Chintala, “Py- torch,” 2017, [online] http://pytorch.org/

work page 2017
[34]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[36]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy, “Batch nor- malization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

thesis, Faculty of Information Technology BUT, 2008

Petr Schwarz, Phoneme recognition based on long tem- poral context , Ph.D. thesis, Faculty of Information Technology BUT, 2008

work page 2008
[38]

Speechdat (e)-eastern european telephone speech databases,

Petr Poll ´ak, Jerome Boudy, Khalid Choukri, Henk Van Den Heuvel, Klara Vicsi, Attila Virag, Rainer Siemund, Wojciech Majewski, Piotr Staroniewicz, Herbert Tropf, et al., “Speechdat (e)-eastern european telephone speech databases,” in the Proc. of XLDB 2000, Workshop on V ery Large Telephone Speech Databases . Citeseer, 2000

work page 2000
[39]

Mediaeval 2013 spoken web search task: system per- formance measures,

Luis J Rodriguez-Fuentes and Mikel Penagarikano, “Mediaeval 2013 spoken web search task: system per- formance measures,” n. TR-2013-1, Department of Electricity and Electronics, University of the Basque Country, 2013

work page 2013

[1] [1]

INTRODUCTION Query-by-example spoken term detection (QbE-STD) is the task of detecting audio documents from an archive, which contain a spoken query provided by a user. In contrast to tex- tual queries in keyword spotting, QbE-STD requires spoken queries which enables a language independant search with- out the need of a full speech recognition system. Th...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

Generally, the network architecture consists of a shared part and sev- eral task-dependent parts

MULTITASK LEARNING Multitask learning [14, 15] have been used to exploit similar- ities across tasks resulting in an improved learning efﬁciency when compared to training each task separately. Generally, the network architecture consists of a shared part and sev- eral task-dependent parts. In order to obtain multilingual bot- tleneck features we model pho...

work page

[3] [3]

FEED FORW ARD NETWORKS Feed forward networks have been traditionally used to obtain bottleneck features for speech related tasks [5, 13, 14]. Here, we describe the different architectures employed in this study as shown in Figure 1: (a) Monolingual: our monolingual FFN architecture, con- sists of 3 fully connected layers of 1024 neurons each, followed by ...

work page

[4] [4]

Skipping layers effectively simpli- ﬁes the training and gives ﬂexibility to the network

RESIDUAL NETWORKS A Residual Network [18] is a CNN with shortcut connections between its stacked layers. Skipping layers effectively simpli- ﬁes the training and gives ﬂexibility to the network. Given an input matrix x and an output matrix y, it models the function y = f(x)+ x in each stacked layer, wheref(.) represents two convolutional layers with a non...

work page

[5] [5]

Then, we present the details of training different neural networks

EXPERIMENTAL SETUP In this section, we describe the databases and the pre- processing steps to perform the experiments. Then, we present the details of training different neural networks. 5.1. Databases GlobalPhone Corpus: GlobalPhone [21] is a multilingual speech database consisting of high quality recordings of read speech with corresponding transcripti...

work page 2014

[6] [6]

Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]

EXPERIMENTAL ANALYSIS In this section, we report and analyze the QbE-STD perfor- mance using various bottleneck features estimated from our FFN and ResNet models. Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]. We implemented those models to compare with multilingual f...

work page 2014

[7] [7]

We present a performance analysis of these features using both ResNets and FFNs

CONCLUSIONS We proposed a ResNet based neural network architecture to estimate monolingual as well as multilingual bottleneck fea- tures for QbE-STD. We present a performance analysis of these features using both ResNets and FFNs. It shows that additional languages for training improves performance and the ResNets perform better than FFNs for both monolin...

work page

[8] [8]

Unsupervised pat- tern discovery in speech,

Alex S Park and James R Glass, “Unsupervised pat- tern discovery in speech,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 16, no. 1, pp. 186–197, 2008

work page 2008

[9] [9]

Model-based unsu- pervised spoken term detection with spoken queries,

Chun-an Chan and Lin-shan Lee, “Model-based unsu- pervised spoken term detection with spoken queries,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1330–1342, 2013

work page 2013

[10] [10]

Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,

Yaodong Zhang and James R Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) , 2009, pp. 398–403

work page 2009

[11] [11]

High- performance query-by-example spoken term detection on the SWS 2013 evaluation,

Luis Javier Rodriguez-Fuentes, Amparo Varona, Mike Penagarikano, Germ´an Bordel, and Mireia Diez, “High- performance query-by-example spoken term detection on the SWS 2013 evaluation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 7819–7823

work page 2013

[12] [12]

Coping with channel mismatch in query- by-example-BUT QUESST 2014,

Igor Sz ¨oke, Miroslav Sk ´acel, Luk ´aˇs Burget, and Jan ˇCernock`y, “Coping with channel mismatch in query- by-example-BUT QUESST 2014,” in 2015 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5838–5842

work page 2014

[13] [13]

Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,

Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,” in INTERSPEECH, 2016, pp. 923–927

work page 2016

[14] [14]

CNN based query by example spoken term detection,

Dhananjay Ram, Lesly Miculicich, and Herv ´e Bourlard, “CNN based query by example spoken term detection,” in Proceedings of the Nineteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018

work page 2018

[15] [15]

Query-by-example spoken term detection using pho- netic posteriorgram templates,

Timothy J Hazen, Wade Shen, and Christopher White, “Query-by-example spoken term detection using pho- netic posteriorgram templates,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2009, pp. 421–426

work page 2009

[16] [16]

2, Springer, 2007

Meinard M ¨uller, Information retrieval for music and motion, vol. 2, Springer, 2007

work page 2007

[17] [17]

Sparse subspace modeling for query by example spo- ken term detection,

Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Sparse subspace modeling for query by example spo- ken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 6, pp. 1130–1143, June 2018

work page 2018

[18] [18]

Subspace regularized dynamic time warping for spo- ken query detection,

Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Subspace regularized dynamic time warping for spo- ken query detection,” in Workshop on Signal Process- ing with Adaptive Sparse Structured Representations (SPARS), 2017

work page 2017

[19] [19]

Re- ducing the dimensionality of data with neural networks,

Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re- ducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006

work page 2006

[20] [20]

Improved bottle- neck features using pretrained deep neural networks,

Dong Yu and Michael L Seltzer, “Improved bottle- neck features using pretrained deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011

work page 2011

[21] [21]

The language- independent bottleneck features,

Karel Vesel `y, Martin Karaﬁ ´at, Franti ˇsek Gr ´ezl, Milo ˇs Janda, and Ekaterina Egorova, “The language- independent bottleneck features,” in 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 336–341

work page 2012

[22] [22]

Multitask learning,

Rich Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997

work page 1997

[23] [23]

Con- volutional neural networks for speech recognition,

Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Con- volutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and lan- guage processing, vol. 22, no. 10, pp. 1533–1545, 2014

work page 2014

[24] [24]

Convolutional, long short-term memory, fully connected deep neural networks,

Tara N Sainath, Oriol Vinyals, Andrew Senior, and Has ¸im Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

work page 2015

[25] [25]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[26] [26]

Achieving Human Parity in Conversational Speech Recognition

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Very deep convolutional networks for end-to-end speech recognition,

Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 4845–4849

work page 2017

[28] [28]

Globalphone: A multilingual text & speech database in 20 languages,

Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8126–8130

work page 2013

[29] [29]

Query by exam- ple search on speech at mediaeval 2014.,

Xavier Anguera, Luis Javier Rodriguez-Fuentes, Igor Sz¨oke, Andi Buzo, and Florian Metze, “Query by exam- ple search on speech at mediaeval 2014.,” inMediaEval, 2014

work page 2014

[30] [30]

The Kaldi speech recognition toolkit,

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

work page 2011

[31] [31]

Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE , vol. 29, no. 6, pp. 82–97, 2012

work page 2012

[32] [32]

An investigation of deep neural networks for multilingual speech recognition training and adaptation,

Sibo Tong, Philip N Garner, and Herv ´e Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in Pro- ceedings of the Eighteenth Annual Conference of the International Speech Communication Association (IN- TERSPEECH), 2017

work page 2017

[33] [33]

Py- torch,

Adam Paszke, Sam Gross, and Soumith Chintala, “Py- torch,” 2017, [online] http://pytorch.org/

work page 2017

[34] [34]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[36] [36]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy, “Batch nor- malization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[37] [37]

thesis, Faculty of Information Technology BUT, 2008

Petr Schwarz, Phoneme recognition based on long tem- poral context , Ph.D. thesis, Faculty of Information Technology BUT, 2008

work page 2008

[38] [38]

Speechdat (e)-eastern european telephone speech databases,

Petr Poll ´ak, Jerome Boudy, Khalid Choukri, Henk Van Den Heuvel, Klara Vicsi, Attila Virag, Rainer Siemund, Wojciech Majewski, Piotr Staroniewicz, Herbert Tropf, et al., “Speechdat (e)-eastern european telephone speech databases,” in the Proc. of XLDB 2000, Workshop on V ery Large Telephone Speech Databases . Citeseer, 2000

work page 2000

[39] [39]

Mediaeval 2013 spoken web search task: system per- formance measures,

Luis J Rodriguez-Fuentes and Mikel Penagarikano, “Mediaeval 2013 spoken web search task: system per- formance measures,” n. TR-2013-1, Department of Electricity and Electronics, University of the Basque Country, 2013

work page 2013