pith. sign in

arxiv: 1906.09417 · v2 · pith:VG5UVAN3new · submitted 2019-06-22 · 💻 cs.SD · cs.HC· cs.LG· eess.AS

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Pith reviewed 2026-05-25 18:11 UTC · model grok-4.3

classification 💻 cs.SD cs.HCcs.LGeess.AS
keywords keyword spottinghearing assistive devicesmulti-task learningdeep residual networkexternal speaker detectionown-voice detectionGoogle Speech Commands Dataset
0
0 comments X

The pith

A multi-task deep residual network extends keyword spotting to jointly detect external speakers and improves accuracy by 32% relative for hearing assistive devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a standard small-footprint KWS residual network can be extended through multi-task learning to perform both keyword spotting and own-voice versus external-speaker classification. This joint training requires almost no extra parameters yet yields a 32% relative gain in KWS accuracy on a corpus derived from the Google Speech Commands Dataset that emulates hearing-aid capture. The approach solves the practical problem that speaker-independent KWS systems can be triggered by anyone, which is unacceptable when only the device wearer should issue commands. A reader would care because the method keeps the model size suitable for tiny assistive devices while enforcing user-only control.

Core claim

By following a multi-task learning scheme, a state-of-the-art deep residual network for small-footprint KWS is extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters; on a speech corpus generated from the Google Speech Commands Dataset to emulate hearing aids as the capturing device, this yields a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers.

What carries the argument

Multi-task deep residual network that jointly performs keyword spotting and own-voice/external speaker detection.

If this is right

  • The joint task prevents external speakers from triggering the device while preserving small model size.
  • The 32% relative KWS accuracy gain is obtained with negligible added parameters.
  • The same architecture can be applied to other small electronic devices that must restrict voice commands to the owner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be combined with on-device noise suppression or beamforming already present in hearing aids to test additive gains.
  • Validation on real hardware recordings in varied rooms and with multiple simultaneous talkers would be the next concrete step.
  • The approach may transfer to other medical or wearable voice interfaces that require owner-only activation.

Load-bearing premise

The artificially generated speech corpus from the Google Speech Commands Dataset, modified to emulate hearing aids as the capturing device, sufficiently represents real-world acoustic conditions and speaker scenarios for hearing assistive devices.

What would settle it

Running the trained multi-task network on real recordings captured by actual hearing aids in the presence of external speakers would show whether the reported 32% relative KWS accuracy gain holds outside the emulated dataset.

Figures

Figures reproduced from arXiv: 1906.09417 by Iv\'an L\'opez-Espejo, Jesper Jensen, Zheng-Hua Tan.

Figure 1
Figure 1. Figure 1: Diagram of the multi-task deep residual network for KWS with own-voice/external speaker detection. linear unit (ReLU) activation function and a batch normaliza￾tion layer. Convolutional layers in the residual blocks apply di￾lated convolutions with a dilation rate of  2 b l 3 c , 2 b l 3 c  , where l = 0, ..., 11 refers to the successive layers of this type and b·c denotes the floor function. Then, a non… view at source ↗
Figure 2
Figure 2. Figure 2: Every external speaker can be located in one of the 48 equidistantly spaced points (black dots) on a circumference of 1.9 meter radius. An actual person wearing a 2-microphone behind-the-ear hearing aid in her left ear is seated in the center of the circumference. The blue and red dots symbolize the front and rear microphones, respectively, of the hearing aid. ratio can be understood as a measure of the in… view at source ↗
Figure 3
Figure 3. Figure 3: Detection error trade-off curves for own￾voice/external speaker detection. 0° 30° 60° 90° 120° 150° 180° 210° 240° 270° 300° 330° 0.4 0.6 0.8 1 Front Rear Dual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized external speaker detection accuracy as a function of the external speaker angle with respect to the user of the hearing assistive device. The users’ head is centered in the origin and faces towards 0. ferent random model initialization. Overall (i.e., over the whole test set) own-voice/external speaker detection accuracy results are also broken down by accuracies measured separately on the own-v… view at source ↗
read the original abstract

Keyword spotting (KWS) is experiencing an upswing due to the pervasiveness of small electronic devices that allow interaction with them via speech. Often, KWS systems are speaker-independent, which means that any person --user or not-- might trigger them. For applications like KWS for hearing assistive devices this is unacceptable, as only the user must be allowed to handle them. In this paper we propose KWS for hearing assistive devices that is robust to external speakers. A state-of-the-art deep residual network for small-footprint KWS is regarded as a basis to build upon. By following a multi-task learning scheme, this system is extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters. For experiments, we generate from the Google Speech Commands Dataset a speech corpus emulating hearing aids as a capturing device. Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes extending a state-of-the-art deep residual network for small-footprint keyword spotting (KWS) via multi-task learning to jointly perform KWS and user own-voice versus external-speaker detection. This targets hearing assistive devices where only the device wearer should activate the system. A speech corpus is synthetically generated from the Google Speech Commands Dataset to emulate hearing-aid capture; the multi-task model is reported to deliver an approximately 32% relative KWS accuracy gain over a single-task baseline while adding negligible parameters.

Significance. If the reported gain proves robust under real acoustic conditions, the multi-task formulation offers an efficient route to speaker-specific KWS for hearing aids. The negligible parameter overhead is a practical strength if the result replicates.

major comments (2)
  1. [Abstract] Abstract: the 32% relative KWS accuracy improvement is stated without any description of the baseline architecture, the multi-task loss weighting values, statistical significance tests, or error bars on the accuracy figures. These omissions make the central empirical claim impossible to assess from the given information.
  2. [Abstract] Abstract / data-generation paragraph: the evaluation corpus is produced by modifying the Google Speech Commands Dataset to emulate hearing-aid microphone response, placement, and external-speaker mixing, yet the procedure is described only at a high level. Because the claimed robustness rests on faithful reproduction of near-field user voice versus far-field external-speaker geometry and frequency response, any mismatch directly undermines the measured 32% gain; real-device recordings or a sensitivity analysis of the simulation parameters are required.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, proposing revisions to improve clarity while defending the core contributions of the multi-task approach for hearing-assistive KWS.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 32% relative KWS accuracy improvement is stated without any description of the baseline architecture, the multi-task loss weighting values, statistical significance tests, or error bars on the accuracy figures. These omissions make the central empirical claim impossible to assess from the given information.

    Authors: Abstracts are constrained by length, so full details appear in the manuscript body: the baseline is the referenced state-of-the-art deep residual network (Section 3), loss weighting is specified in the multi-task training setup (Section 4.2), and results are compared in Section 5. We will revise the abstract to briefly note the single-task baseline and multi-task extension, and add error bars to accuracy figures in the results section. Statistical significance testing can be included if supported by the experimental protocol. revision: partial

  2. Referee: [Abstract] Abstract / data-generation paragraph: the evaluation corpus is produced by modifying the Google Speech Commands Dataset to emulate hearing-aid microphone response, placement, and external-speaker mixing, yet the procedure is described only at a high level. Because the claimed robustness rests on faithful reproduction of near-field user voice versus far-field external-speaker geometry and frequency response, any mismatch directly undermines the measured 32% gain; real-device recordings or a sensitivity analysis of the simulation parameters are required.

    Authors: The manuscript body (Section 4.1) provides the generation procedure beyond the abstract summary. We will expand this description with explicit parameters for microphone emulation, placement, and mixing, and add a sensitivity analysis on key factors such as distance ratios and frequency responses. Real-device recordings are not feasible in this work due to hardware constraints, but the synthetic corpus follows established acoustic modeling to isolate the own-voice vs. external-speaker distinction; the relative gain holds under these controlled conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical KWS accuracy gain measured on held-out synthetic data

full rationale

The paper trains a multi-task residual network on a corpus generated from Google Speech Commands and reports a 32% relative KWS accuracy improvement versus a single-task baseline. This is a standard empirical comparison; the reported figure is obtained by training and testing on data partitions, not by any equation that reduces the output to a fitted parameter or self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that joint training of KWS and speaker detection tasks yields additive gains without task interference, plus the representativeness of the synthetic corpus; no new entities are postulated.

free parameters (1)
  • multi-task loss weighting
    The relative weight between KWS loss and speaker detection loss is a training hyperparameter that must be chosen or tuned.
axioms (1)
  • domain assumption The residual network architecture can be extended to an additional output head for speaker classification with negligible parameter increase.
    Invoked when the paper states the system is extended by following a multi-task learning scheme.

pith-pipeline@v0.9.0 · 5734 in / 1210 out tokens · 28344 ms · 2026-05-25T18:11:41.953615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Keyword spotting (KWS) aims at detecting a series of words from an audio stream comprising speech. This technology has become a popular research topic as it is considered a key- stone for voice-based activation of virtual assistants (e.g., smart speakers) by means of keywords or wake-up-words [1]. Similarly, KWS may allow a hearing impaired p...

  2. [2]

    Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

    Deep Residual Learning for KWS In this section we briefly review the deep residual network for small-footprint KWS proposed in [9] ( res15), as this is re- garded as a basis to build upon. This architecture is based on the work of He et al. [11], where the authors proposed resid- ual learning to tackle the performance degradation that occurs when CNNs are ...

  3. [3]

    Let X be the input speech features to the model

    Multi-task Learning for KWS and Own-V oice/External Speaker Detection We employ the state-of-the-art res15 described in the previ- ous section to perform KWS on hearing assistive devices. Let X be the input speech features to the model. In order to also let the deep residual network detect whether the user,Su, or an external speaker, Se, is trying to trig...

  4. [4]

    yes”, “no

    Experimental Framework 4.1. Hearing Aid Speech Database The Google Speech Commands Dataset (GSCD) [10] is a speech database comprising 105,829 one second long utter- ances from a total of 2,618 different speakers. As each utterance contains only one word among a set of 35 possible words, this database is well suited for research on KWS. The GSCD also prov...

  5. [5]

    Results We test the multi-task architecture by making use of the dual- microphone signal (Dual) from the hearing assistive device and compare it with using the single-microphone signal from the front (Front) and rear (Rear) microphones, respectively. To as- sess the KWS performance of existing systems, which do not take the potential presence of external ...

  6. [6]

    Conclusions In this paper we have proposed a multi-task learning strategy to carry out KWS for hearing assistive devices that is robust to external speakers. This robustness is important for practical applications like the one assessed here, where our approach has been able to significantly outperform a state-of-the-art small- footprint KWS system. Further...

  7. [7]

    Acknowledgements This work was supported, in part, by the Oticon Foundation

  8. [8]

    Alexa, Siri, Cortana, and more: An introduction to voice assistants,

    M. B. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice assistants,” Medical Reference Services Quarterly, vol. 37, pp. 81–88, 2018

  9. [9]

    Convolutional recurrent neural networks for small-footprint keyword spotting,

    S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” in Pro- ceedings of INTERSPEECH 2017 – 18 th Annual Conference of the International Speech Communication Association, August 20- 24, Stockholm, Sweden, 2017, pp. 1606–1610

  10. [10]

    Query-by-example key- word spotting using long short-term memory networks,

    G. Chen, C. Parada, and T. N. Sainath, “Query-by-example key- word spotting using long short-term memory networks,” in Pro- ceedings of ICASSP 2015 – 40 th IEEE International Conference on Acoustics, Speech and Signal Processing, April 19-24, Bris- bane, Australia, 2015, pp. 5236–5240

  11. [11]

    Query-by-example spoken term detection using phonetic posteriorgram templates,

    T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in Pro- ceedings of ASRU 2009 – IEEE Workshop on Automatic Speech Recognition & Understanding, December 13-17, Merano, Italy , 2009, pp. 421–426

  12. [12]

    Small-footprint keyword spotting using deep neural networks,

    G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of ICASSP 2014 – 39th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4-9, Florence, Italy, 2014, pp. 4087– 4091

  13. [13]

    Automatic gain control and multi-style training for ro- bust small-footprint keyword spotting with deep neural networks,

    R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath, “Automatic gain control and multi-style training for ro- bust small-footprint keyword spotting with deep neural networks,” in Proceedings of ICASSP 2015 – 40 th IEEE International Con- ference on Acoustics, Speech and Signal Processing, April 19-24, Brisbane, Australia, 2015, pp. 4704–4708

  14. [14]

    Convolutional neural networks for small-footprint keyword spotting,

    T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proceedings of INTER- SPEECH 2015 – 16 th Annual Conference of the International Speech Communication Association, September 6-10, Dresden, Germany, 2015, pp. 1478–1482

  15. [15]

    An experimental analysis of the power consumption of convolutional neural networks for keyword spotting,

    R. Tang, W. Wang, Z. Tu, and J. Lin, “An experimental analysis of the power consumption of convolutional neural networks for keyword spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, April 15-20, Calgary, Canada, 2018, pp. 5479–5483

  16. [16]

    Deep residual learning for small-footprint keyword spotting,

    R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, April 15-20, Calgary, Canada, 2018, pp. 5484–5488

  17. [17]

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

    P. Warden, “Speech Commands: A dataset for limited-vocabulary speech recognition,” arXiv:1804.03209v1, 2018

  18. [18]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of CVPR 2016 – Conference on Computer Vision and Pattern Recognition, June 26-July 1, Las Vegas, USA, 2016, pp. 770–778

  19. [19]

    Dy- namic task prioritization for multitask learning,

    M. Guo, A. Haque, D.-A. Huang, S. Yeung, and L. Fei-Fei, “Dy- namic task prioritization for multitask learning,” inProceedings of ECCV 2018 – European Conference on Computer Vision, Septem- ber 8-14, Munich, Germany, 2018, pp. 270–287

  20. [20]

    GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

    Z. Chen, V . Badrinarayanan, C.-Y . Lee, and A. Rabinovich, “GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” arXiv:1711.02257v4, 2018

  21. [21]

    Personalized signal-independent beamforming for binaural hearing aids,

    A. H. Moore, J. M. de Haan, M. S. Pedersen, P. A. Naylor, M. Brookes, and J. Jensen, “Personalized signal-independent beamforming for binaural hearing aids,” in review

  22. [22]

    Simple audio recognition,

    TensorFlow.org Tutorials, “Simple audio recognition,” https:// www.tensorflow.org/tutorials/sequences/audio recognition

  23. [23]

    Chollet et al., “Keras,” https://keras.io, 2015

    F. Chollet et al., “Keras,” https://keras.io, 2015