pith. sign in

arxiv: 1907.01448 · v1 · pith:VHKE6BIOnew · submitted 2019-07-02 · 📡 eess.AS · cs.SD

Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification

Pith reviewed 2026-05-25 10:22 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords sub-band CNNspoken term classificationconvolutional neural networkscomputational efficiencySpeech Commands datasetsmall-footprint modelsacoustic feature maps
0
0 comments X

The pith

Sub-band CNN applies different kernels to acoustic feature sub-bands to cut computation by up to 49% while maintaining accuracy on spoken term classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Unlike standard CNNs, it applies different convolutional kernels on each feature sub-band because the spatial invariance of 2D kernels does not fit acoustic applications well. On the publicly available Speech Commands dataset, the sub-band CNN reduces computation by 39.7% for commands classification and 49.3% for digits classification compared to a baseline full-band CNN, with accuracy maintained. The efficiency gain is especially useful for small-footprint models in acoustic tasks.

Core claim

The sub-band CNN architecture applies different convolutional kernels on each feature sub-band to make overall computation more efficient for spoken term classification. Experimental results on the Speech Commands dataset show that this reduces computation by 39.7% on commands classification and 49.3% on digits classification while maintaining accuracy compared to a full band CNN baseline.

What carries the argument

Sub-band CNN architecture applying different convolutional kernels on each feature sub-band to improve computational efficiency for acoustic feature maps.

Load-bearing premise

That applying different convolutional kernels on each feature sub-band preserves classification accuracy while lowering compute cost.

What would settle it

Running the sub-band CNN and full-band CNN on the Speech Commands dataset and finding that the sub-band version either requires more computation or achieves lower accuracy.

Figures

Figures reproduced from arXiv: 1907.01448 by Chao Wang, Chieh-Chi Kao, Ming Sun, Shiv Vitaladevuni, Yixin Gao.

Figure 1
Figure 1. Figure 1: CNN models with different weight sharing methods. (a) The baseline model proposed in [6]. (b) Applying the multi-band approach proposed in [15] to the baseline model. (c) The proposed overlapped sub-band CNN. For the easiness of illustration, the x-axis is set as feature in this figure, which is different from conventional settings [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy curve of different weight sharing methods on subsets of Google Speech Commands dataset [1]. Each data point represents an average of five trials, and the error bar is the sample standard deviation of five trials [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy curve of experiments on number of sub-bands and concatenation methods for sub-band features. Commands classification is used as the testbed. Each data point represents an average of five trials, and the error bar is the sample standard deviation of five trials. with an error bar compared to plotting tens of DETs in a figure. 3.3. Results Fig. 2a and Fig. 2b show the accuracy curves for different w… view at source ↗
read the original abstract

This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for small-footprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a sub-band CNN architecture for spoken term classification on the Speech Commands dataset. It motivates the approach by noting that spatial invariance of standard 2D convolutional kernels fits acoustic feature maps poorly, and claims that applying different kernels per sub-band reduces computation (39.7% for commands classification, 49.3% for digits classification) while maintaining accuracy relative to a full-band baseline CNN, with particular benefits for small-footprint models.

Significance. If the empirical results hold under proper validation, the work could provide a useful efficiency technique for resource-constrained acoustic applications such as keyword spotting on embedded devices. The use of a public dataset allows direct comparison, and the focus on compute reduction without accuracy loss addresses a practical constraint in the field.

major comments (2)
  1. [Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.
  2. [Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.
minor comments (1)
  1. [Introduction] The motivation paragraph on spatial invariance could benefit from a short concrete illustration (e.g., how a kernel's meaning changes across frequency bands) to strengthen the premise before the empirical test.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concerns regarding experimental reproducibility and statistical rigor are valid, and we will revise the manuscript to address them by adding the requested details on variance, protocol, and implementation. This will strengthen the presentation of our efficiency claims without altering the core technical contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the central claim that accuracy is 'maintained' while achieving the stated compute reductions supplies no error bars, dataset splits, training protocol details, number of runs, or statistical tests. This absence makes the accuracy-maintenance assertion unverifiable from the reported text and is load-bearing for the efficiency claim.

    Authors: We agree that the absence of these details limits verifiability. In the revised manuscript we will report mean accuracy and standard deviation over multiple independent training runs (with different random seeds), specify the exact train/validation/test splits from the Speech Commands dataset, describe the full training protocol (optimizer, learning rate, batch size, epochs, and regularization), state the number of runs performed, and include any statistical significance tests used to support the claim that accuracy is maintained within experimental variation. revision: yes

  2. Referee: [Results] Results section: without reported variance across runs or baseline implementation specifics (e.g., exact kernel sizes, layer counts, or FLOPs calculation method), it is impossible to confirm the 39.7% and 49.3% reductions or to rule out that accuracy differences fall within experimental noise.

    Authors: We concur that implementation specifics and variance measures are required for independent verification. The revision will explicitly list the architecture details (kernel sizes, number of layers, channel counts) for both the full-band baseline and sub-band models, describe the precise method used to compute FLOPs, and report accuracy and compute results with error bars across the same set of multiple runs. This will allow direct assessment of whether observed differences lie within noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper contains no derivation chain, equations, or fitted parameters. Its central claim is an empirical result: on the Speech Commands dataset the sub-band CNN achieves stated compute reductions (39.7% commands, 49.3% digits) while accuracy is maintained relative to a full-band baseline. The premise about 2-D kernel spatial invariance is offered only as motivation; the experiment itself tests the efficiency-accuracy trade-off. No self-citation, ansatz, or uniqueness theorem is load-bearing. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard CNN training and a frequency-axis split whose boundaries are not stated.

pith-pipeline@v0.9.0 · 5705 in / 1109 out tokens · 28465 ms · 2026-05-25T10:22:34.623129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    Introduction With the rapid development of public available datasets (e.g. spoken term classification [1], speaker identification [2, 3], acoustic event classification/detection [4, 5], etc.), state-of-the- art models for various acoustic applications can be trained with a large amount of annotated data. CNN-based architectures have achieved state-of-the-art...

  2. [2]

    cnn-trad-fpool3

    Sub-band CNN We show implementation details of the proposed sub-band CNN in this section. We chose the “ cnn-trad-fpool3” model proposed in [6] as our baseline model. We used the im- plementation of “cnn-trad-fpool3” in Tensorflow official package [27] as the baseline, which is slightly different from the original model described in [6]. As shown in Fig. 1a...

  3. [3]

    Audio Recognition

    Experimental Results 3.1. Datasets We tested the proposed model on Google Speech Commands dataset [1], which has 35 words in the latest version (v0.02). We chose two subsets as our testbed for spoken term classifi- cation tasks. For the formulation of subsets, we use the same setup as the “Audio Recognition” tutorial in the official Tensor- flow package [28]...

  4. [4]

    We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classification tasks

    Conclusions In this paper, we proposed a sub-band CNN architecture and explored it for spoken term classification. We compare the pro- posed sub-band CNNs to full band CNNs and another weight sharing approach on two spoken term classification tasks. The proposed architecture of sub-band CNNs reduces the computa- tion by 39.7% on commands classification, and ...

  5. [5]

    Acknowledgement The authors would like to thank Weiran Wang, Krishna Puvvada, and Wei-Ning Hsu for useful discussions

  6. [6]

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

    P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” CoRR, vol. abs/1804.03209, 2018

  7. [7]

    V oxceleb: a large- scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017, pp. 2616–2620

  8. [8]

    V oxceleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018, pp. 1086–1090

  9. [9]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inIEEE ICASSP, 2017, pp. 776–780

  10. [10]

    DCASE 2017 challenge setup: Tasks, datasets and baseline system,

    A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vin- cent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” inProceedings of the Detec- tion and Classification of Acoustic Scenes and Events 2017 Work- shop (DCASE2017), November 2017, pp. 85–92

  11. [11]

    Convolutional neural networks for small-footprint keyword spotting,

    T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015, pp. 1478–1482

  12. [12]

    Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,

    O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in IEEE ICASSP, 2012, pp. 4277–4280

  13. [13]

    Very deep convolutional neural networks for noise robust speech recognition,

    Y . Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2263–2276, Dec 2016

  14. [14]

    Deep convo- lutional neural networks and data augmentation for acoustic event recognition,

    N. Takahashi, M. Gygli, B. Pfister, and L. V . Gool, “Deep convo- lutional neural networks and data augmentation for acoustic event recognition,” in INTERSPEECH, 2016, pp. 2982–2986

  15. [15]

    Cnn architec- tures for large-scale audio classification,

    S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architec- tures for large-scale audio classification,” in IEEE ICASSP, 2017, pp. 131–135

  16. [16]

    Semi-supervised acoustic event detection based on tri-training,

    B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Semi-supervised acoustic event detection based on tri-training,” in IEEE ICASSP, 2019, pp. 750–754

  17. [17]

    Hierarchical residual-pyramidal model for large context based media presence detection,

    Q. Tang, M. Sun, C. Kao, V . Rozgic, and C. Wang, “Hierarchical residual-pyramidal model for large context based media presence detection,” in IEEE ICASSP, 2019, pp. 3312–3316

  18. [18]

    Rare sound event detection using 1D convolutional recurrent neural networks,

    H. Lim, J. Park, and Y . Han, “Rare sound event detection using 1D convolutional recurrent neural networks,” DCASE2017 Chal- lenge, Tech. Rep., September 2017

  19. [19]

    R-CRNN: region-based convolutional recurrent neural network for audio event detection,

    C. Kao, W. Wang, M. Sun, and C. Wang, “R-CRNN: region-based convolutional recurrent neural network for audio event detection,” in INTERSPEECH, 2018, pp. 1358–1362

  20. [20]

    Multi-scale multi-band densenets for audio source separation,

    N. Takahashi and Y . Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 21–25

  21. [21]

    Small-footprint keyword spotting using deep neural networks,

    G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in IEEE ICASSP, 2014, pp. 4087–4091

  22. [22]

    Streaming small-footprint keyword spotting using sequence-to-sequence models,

    Y . He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” inIEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), 2017, pp. 474–481

  23. [23]

    Convolutional recurrent neural networks for small-footprint keyword spotting,

    S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” inINTER- SPEECH, 2017, pp. 1606–1610

  24. [24]

    Deep residual learning for small-footprint keyword spotting,

    R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in IEEE ICASSP, 2018, pp. 5484–5488

  25. [25]

    Model compression applied to small-footprint keyword spotting,

    G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vita- ladevuni, “Model compression applied to small-footprint keyword spotting,” in INTERSPEECH, 2016, pp. 1878–1882

  26. [26]

    Com- pressed time delay neural network for small-footprint keyword spotting,

    M. Sun, D. Snyder, Y . Gao, V . Nagaraja, M. Rodehorst, S. Pan- chapagesan, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Com- pressed time delay neural network for small-footprint keyword spotting,” in INTERSPEECH, 2017, pp. 3607–3611

  27. [27]

    Knowledge distillation for small- footprint highway networks,

    L. Lu, M. Guo, and S. Renals, “Knowledge distillation for small- footprint highway networks,” in IEEE ICASSP, 2017, pp. 4820– 4824

  28. [28]

    Compression of end-to-end models,

    R. Pang, T. Sainath, R. Prabhavalkar, S. Gupta, Y . Wu, S. Zhang, and C.-C. Chiu, “Compression of end-to-end models,” in INTER- SPEECH, 2018, pp. 27–31

  29. [29]

    Compression of acoustic event detection models with quantized distillation,

    B. Shi, M. Sun, C. Kao, V . Rozgic, S. Matsoukas, and C. Wang, “Compression of acoustic event detection models with quantized distillation,” to appear in INTERSPEECH, 2019

  30. [30]

    Exploring convolutional neural network structures and optimization techniques for speech recognition,

    O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH, 2013, pp. 3366–3370

  31. [31]

    Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classification,

    S. S. R. Phaye, E. Benetos, and Y . Wang, “Subspectralnet - using sub-spectrogram based convolutional neural networks for acoustic scene classification,” in IEEE ICASSP, 2019, pp. 825–829

  32. [32]

    TensorFlow: Large-scale machine learning on heterogeneous systems,

    M. Abadi, A. Agarwal et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https: //www.tensorflow.org/

  33. [33]

    TensorFlow: Simple audio recognition

    “TensorFlow: Simple audio recognition.” [Online]. Available: https://www.tensorflow.org/tutorials/sequences/audio recognition/