pith. sign in

arxiv: 1907.00443 · v1 · pith:RG3LJSMBnew · submitted 2019-06-30 · 💻 cs.CL · cs.HC· cs.LG· cs.SD· eess.AS

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Pith reviewed 2026-05-25 12:36 UTC · model grok-4.3

classification 💻 cs.CL cs.HCcs.LGcs.SDeess.AS
keywords bottleneck featuresquery by examplespoken term detectionresidual networksmultilingual featuresdynamic time warpingGlobalPhoneQUESST 2014
0
0 comments X

The pith

Residual networks produce better multilingual bottleneck features for query-by-example spoken term detection than feedforward networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how bottleneck features from neural networks affect query-by-example spoken term detection performance. It first compares features extracted from monolingual and multilingual feedforward networks. The authors then replace the feedforward networks with residual networks to generate the bottleneck features. Experiments on the QUESST 2014 database show that the ResNet versions deliver higher detection accuracy. All networks are trained on the GlobalPhone corpus before the features are fed into dynamic time warping for matching.

Core claim

Bottleneck features estimated with residual networks outperform the corresponding feedforward-network features in query-by-example spoken term detection. The study first evaluates monolingual and multilingual feedforward networks, then demonstrates that switching to residual networks yields significant gains when the networks are trained on the GlobalPhone corpus and evaluated on the challenging QUESST 2014 database.

What carries the argument

Residual networks (ResNet) used to estimate bottleneck features for dynamic time warping template matching.

If this is right

  • Multilingual training of the networks improves detection over monolingual training.
  • ResNet architecture produces measurable gains over feedforward networks on the same data.
  • The resulting features support effective matching on the difficult QUESST 2014 evaluation set.
  • GlobalPhone training supplies the multilingual coverage needed for cross-language term detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ResNet-based feature extraction could be tested on other spoken-term or keyword-spotting benchmarks.
  • Architectural upgrades from feedforward to residual layers may transfer to different audio feature pipelines.
  • Further accuracy lifts might appear if deeper residual blocks are combined with the existing multilingual training regime.

Load-bearing premise

The performance differences observed on QUESST 2014 are caused by the choice of multilingual training and ResNet architecture rather than by other uncontrolled factors in network training or DTW implementation.

What would settle it

Re-train identical models on the same GlobalPhone data, change only the architecture from feedforward to ResNet, and check whether QbE-STD accuracy on QUESST 2014 remains statistically unchanged.

read the original abstract

State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracted from feed forward networks. Then, we propose to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the corresponding feed forward network based features. The neural networks are trained on GlobalPhone corpus and QbE-STD experiments are performed on a very challenging QUESST 2014 database.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies query-by-example spoken term detection (QbE-STD) by extracting bottleneck features from feed-forward networks and residual networks (ResNet) trained on the GlobalPhone corpus. It compares monolingual and multilingual training regimes and reports that ResNet-based features yield significant improvements over the corresponding feed-forward features when used with DTW matching on the QUESST 2014 evaluation set.

Significance. If the performance gains are shown to arise specifically from the ResNet architecture under matched training conditions, the work would supply a concrete, reproducible baseline for multilingual bottleneck features in QbE-STD. The use of a standard corpus (GlobalPhone) and a challenging public benchmark (QUESST 2014) is a positive aspect that would facilitate future comparisons.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experimental results): the claim of 'significant improvements' is presented without any numerical values, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the observed differences are large enough or reliable enough to support the central architectural claim.
  2. [§3 and §4] §3 (network training) and §4: no statement confirms that the feed-forward and ResNet extractors were trained with identical optimizer, learning-rate schedule, batch size, data ordering, or early-stopping criterion. Because the central claim attributes gains to the residual connections rather than to any of these uncontrolled factors, the lack of such controls is load-bearing for the reported conclusion.
minor comments (2)
  1. [Title and Introduction] The title emphasizes 'Multilingual' features, yet the abstract and experiments also present monolingual results; a short clarifying sentence in the introduction would help readers understand the intended scope.
  2. [§2] Notation for the bottleneck dimension and the DTW distance measure should be defined once at first use rather than assumed from prior QbE-STD literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experimental results): the claim of 'significant improvements' is presented without any numerical values, confidence intervals, or statistical significance tests. This absence prevents assessment of whether the observed differences are large enough or reliable enough to support the central architectural claim.

    Authors: We agree that the abstract and results section should include explicit numerical values to support the claim. In the revised manuscript we will update the abstract to report the key performance metrics (e.g., actual ATWV or EER figures from the QUESST 2014 experiments) and will add confidence intervals together with statistical significance tests (paired t-test or similar) in §4. revision: yes

  2. Referee: [§3 and §4] §3 (network training) and §4: no statement confirms that the feed-forward and ResNet extractors were trained with identical optimizer, learning-rate schedule, batch size, data ordering, or early-stopping criterion. Because the central claim attributes gains to the residual connections rather than to any of these uncontrolled factors, the lack of such controls is load-bearing for the reported conclusion.

    Authors: We acknowledge the manuscript does not explicitly state the training controls. Both networks were in fact trained under identical conditions (same optimizer, learning-rate schedule, batch size, data ordering via the same random seed, and early-stopping criterion on validation loss). We will add a dedicated paragraph in §3 documenting these matched settings so that the comparison isolates the effect of the residual connections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of network architectures on external benchmarks

full rationale

The paper reports an empirical study comparing bottleneck features extracted from feed-forward networks versus residual networks, trained on GlobalPhone and evaluated via DTW on the independent QUESST 2014 corpus. No equations, derivations, or first-principles predictions appear; the central claim is a measured performance delta between two trained models. This is a standard experimental result whose validity rests on training controls and statistical significance rather than any definitional or self-referential reduction. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatzes are present. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work rests on the standard domain assumption that bottleneck features are useful for DTW-based matching.

axioms (1)
  • domain assumption Bottleneck features extracted from neural networks trained on speech data provide useful representations for dynamic time warping template matching in QbE-STD.
    This premise is required for the entire experimental pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5654 in / 1179 out tokens · 29308 ms · 2026-05-25T12:36:28.010621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 5 internal anchors

  1. [1]

    INTRODUCTION Query-by-example spoken term detection (QbE-STD) is the task of detecting audio documents from an archive, which contain a spoken query provided by a user. In contrast to tex- tual queries in keyword spotting, QbE-STD requires spoken queries which enables a language independant search with- out the need of a full speech recognition system. Th...

  2. [2]

    Generally, the network architecture consists of a shared part and sev- eral task-dependent parts

    MULTITASK LEARNING Multitask learning [14, 15] have been used to exploit similar- ities across tasks resulting in an improved learning efficiency when compared to training each task separately. Generally, the network architecture consists of a shared part and sev- eral task-dependent parts. In order to obtain multilingual bot- tleneck features we model pho...

  3. [3]

    FEED FORW ARD NETWORKS Feed forward networks have been traditionally used to obtain bottleneck features for speech related tasks [5, 13, 14]. Here, we describe the different architectures employed in this study as shown in Figure 1: (a) Monolingual: our monolingual FFN architecture, con- sists of 3 fully connected layers of 1024 neurons each, followed by ...

  4. [4]

    Skipping layers effectively simpli- fies the training and gives flexibility to the network

    RESIDUAL NETWORKS A Residual Network [18] is a CNN with shortcut connections between its stacked layers. Skipping layers effectively simpli- fies the training and gives flexibility to the network. Given an input matrix x and an output matrix y, it models the function y = f(x)+ x in each stacked layer, wheref(.) represents two convolutional layers with a non...

  5. [5]

    Then, we present the details of training different neural networks

    EXPERIMENTAL SETUP In this section, we describe the databases and the pre- processing steps to perform the experiments. Then, we present the details of training different neural networks. 5.1. Databases GlobalPhone Corpus: GlobalPhone [21] is a multilingual speech database consisting of high quality recordings of read speech with corresponding transcripti...

  6. [6]

    Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]

    EXPERIMENTAL ANALYSIS In this section, we report and analyze the QbE-STD perfor- mance using various bottleneck features estimated from our FFN and ResNet models. Previously, the best performance on QUESST 2014 database was obtained using monolingual bot- tleneck features estimated using FFNs [5]. We implemented those models to compare with multilingual f...

  7. [7]

    We present a performance analysis of these features using both ResNets and FFNs

    CONCLUSIONS We proposed a ResNet based neural network architecture to estimate monolingual as well as multilingual bottleneck fea- tures for QbE-STD. We present a performance analysis of these features using both ResNets and FFNs. It shows that additional languages for training improves performance and the ResNets perform better than FFNs for both monolin...

  8. [8]

    Unsupervised pat- tern discovery in speech,

    Alex S Park and James R Glass, “Unsupervised pat- tern discovery in speech,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 16, no. 1, pp. 186–197, 2008

  9. [9]

    Model-based unsu- pervised spoken term detection with spoken queries,

    Chun-an Chan and Lin-shan Lee, “Model-based unsu- pervised spoken term detection with spoken queries,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1330–1342, 2013

  10. [10]

    Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,

    Yaodong Zhang and James R Glass, “Unsupervised spoken keyword spotting via segmental dtw on gaus- sian posteriorgrams,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) , 2009, pp. 398–403

  11. [11]

    High- performance query-by-example spoken term detection on the SWS 2013 evaluation,

    Luis Javier Rodriguez-Fuentes, Amparo Varona, Mike Penagarikano, Germ´an Bordel, and Mireia Diez, “High- performance query-by-example spoken term detection on the SWS 2013 evaluation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 7819–7823

  12. [12]

    Coping with channel mismatch in query- by-example-BUT QUESST 2014,

    Igor Sz ¨oke, Miroslav Sk ´acel, Luk ´aˇs Burget, and Jan ˇCernock`y, “Coping with channel mismatch in query- by-example-BUT QUESST 2014,” in 2015 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5838–5842

  13. [13]

    Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,

    Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Unsupervised bottleneck features for low-resource query-by-example spoken term detec- tion.,” in INTERSPEECH, 2016, pp. 923–927

  14. [14]

    CNN based query by example spoken term detection,

    Dhananjay Ram, Lesly Miculicich, and Herv ´e Bourlard, “CNN based query by example spoken term detection,” in Proceedings of the Nineteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018

  15. [15]

    Query-by-example spoken term detection using pho- netic posteriorgram templates,

    Timothy J Hazen, Wade Shen, and Christopher White, “Query-by-example spoken term detection using pho- netic posteriorgram templates,” in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2009, pp. 421–426

  16. [16]

    2, Springer, 2007

    Meinard M ¨uller, Information retrieval for music and motion, vol. 2, Springer, 2007

  17. [17]

    Sparse subspace modeling for query by example spo- ken term detection,

    Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Sparse subspace modeling for query by example spo- ken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 26, no. 6, pp. 1130–1143, June 2018

  18. [18]

    Subspace regularized dynamic time warping for spo- ken query detection,

    Dhananjay Ram, Afsaneh Asaei, and Herv ´e Bourlard, “Subspace regularized dynamic time warping for spo- ken query detection,” in Workshop on Signal Process- ing with Adaptive Sparse Structured Representations (SPARS), 2017

  19. [19]

    Re- ducing the dimensionality of data with neural networks,

    Geoffrey E Hinton and Ruslan R Salakhutdinov, “Re- ducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006

  20. [20]

    Improved bottle- neck features using pretrained deep neural networks,

    Dong Yu and Michael L Seltzer, “Improved bottle- neck features using pretrained deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011

  21. [21]

    The language- independent bottleneck features,

    Karel Vesel `y, Martin Karafi ´at, Franti ˇsek Gr ´ezl, Milo ˇs Janda, and Ekaterina Egorova, “The language- independent bottleneck features,” in 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 336–341

  22. [22]

    Multitask learning,

    Rich Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997

  23. [23]

    Con- volutional neural networks for speech recognition,

    Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Con- volutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and lan- guage processing, vol. 22, no. 10, pp. 1533–1545, 2014

  24. [24]

    Convolutional, long short-term memory, fully connected deep neural networks,

    Tara N Sainath, Oriol Vinyals, Andrew Senior, and Has ¸im Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2015, pp. 4580–4584

  25. [25]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  26. [26]

    Achieving Human Parity in Conversational Speech Recognition

    Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016

  27. [27]

    Very deep convolutional networks for end-to-end speech recognition,

    Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017, pp. 4845–4849

  28. [28]

    Globalphone: A multilingual text & speech database in 20 languages,

    Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A multilingual text & speech database in 20 languages,” in 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8126–8130

  29. [29]

    Query by exam- ple search on speech at mediaeval 2014.,

    Xavier Anguera, Luis Javier Rodriguez-Fuentes, Igor Sz¨oke, Andi Buzo, and Florian Metze, “Query by exam- ple search on speech at mediaeval 2014.,” inMediaEval, 2014

  30. [30]

    The Kaldi speech recognition toolkit,

    Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011

  31. [31]

    Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

    Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE , vol. 29, no. 6, pp. 82–97, 2012

  32. [32]

    An investigation of deep neural networks for multilingual speech recognition training and adaptation,

    Sibo Tong, Philip N Garner, and Herv ´e Bourlard, “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in Pro- ceedings of the Eighteenth Annual Conference of the International Speech Communication Association (IN- TERSPEECH), 2017

  33. [33]

    Py- torch,

    Adam Paszke, Sam Gross, and Soumith Chintala, “Py- torch,” 2017, [online] http://pytorch.org/

  34. [34]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

  35. [35]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  36. [36]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy, “Batch nor- malization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015

  37. [37]

    thesis, Faculty of Information Technology BUT, 2008

    Petr Schwarz, Phoneme recognition based on long tem- poral context , Ph.D. thesis, Faculty of Information Technology BUT, 2008

  38. [38]

    Speechdat (e)-eastern european telephone speech databases,

    Petr Poll ´ak, Jerome Boudy, Khalid Choukri, Henk Van Den Heuvel, Klara Vicsi, Attila Virag, Rainer Siemund, Wojciech Majewski, Piotr Staroniewicz, Herbert Tropf, et al., “Speechdat (e)-eastern european telephone speech databases,” in the Proc. of XLDB 2000, Workshop on V ery Large Telephone Speech Databases . Citeseer, 2000

  39. [39]

    Mediaeval 2013 spoken web search task: system per- formance measures,

    Luis J Rodriguez-Fuentes and Mikel Penagarikano, “Mediaeval 2013 spoken web search task: system per- formance measures,” n. TR-2013-1, Department of Electricity and Electronics, University of the Basque Country, 2013